New Tricks: Data Mining With Google Spreadsheets

March 22, 2010

Happily, I stumbled across the following link:

Now You Can Mine Data With Google Queries Too

The interesting bit is below the comic where they actually reveal a method I hadn’t thought of: 

Using a query embedded in Google Spreadsheets to mine and graph data in Google’s engine.

While the actual instructions are terse, I was able to get things up and running by visiting the actual example, and then copying and pasting the individual cells for examination.

Here is the blow by blow:

First, decide what you want to mine.  One of the examples is for income, we will use this one.

Open up Google spreadsheets and in cell A2 put (complete as printed here):

=””””&”I make $”&B2&” per year”””

[NOTE:  WordPress jacks up the quotes, so you are going to have to replace all of the quotes in the above with double quotes, or it won’t work!]

Initially it is gonna look like this “I make $ per year”.

Now in B2 put a dollar amount: 45,000.

You should see your number populate in B1 now.

Finally, the magic that actually gets the query info.

Put the following in C2:

=importXML(“http://www.google.com/search?num=100&q=”&A2,&#8221://p%5B@id=’resultStats’%5D/b%5B3%5D”)

[NOTE: Same problem here – WordPress tries to mess with the multiple quotes.  Replace all double AND single quotes manually and you will be fine, otherwise you will get an Error.]

After a brief load time you should see a number returned.  This is the number of returns that included your statement in cell A2.

Now copy and paste A2 and C2 down the line and change your values accordingly as you move down.

To create the graph, simply open “Insert->Chart” and choose your graph type.

To populate the graph with your data, make sure to clear the box right under “What Data?” and then click and drag down column C on your spreadsheet.  Make sure to remove Column C as labels.  You should see your data represented in the preview.

That’s it!  The world is now your oyster!  I can’t wait to apply this in some cases I am working on, I am still mulling over where this can be most useful, but the possibilities boggle the mind.


YouTube Struggles With a Wretched Hive of Scum and Villainy

March 5, 2009

Information Week Article: “YouTube Wrestles With Scammer-Generated Content

InformationWeek reports that YouTube is “struggling” with posted videos showing such things as stolen credit cards, PINs, etc.  They go on to talk about how difficult it is to screen video content.

A single line mentions that meta-content can be used for screening (searching for keywords that can identify the content), but a YouTube spokesman goes on to say that they rely “on our community to know our community guidelines and flag content that violates the guidelines.”

First of all, the type of community that will be looking for that niche content isn’t going to be all that quick to flag it.

Secondly,  how hard would it be to build a signature base of meta-word and behavioral screening to remove the largest portion of objectionable (illegal) content?  Here are a few ideas to think about as you read the article – feel free to post your own:

  • Spam assassin for content anyone?  Use the meta data to help weight the red flag.
  • Watch topics that users post to/visit and use this to weight a flag.  For instance, a little old lady that is concerned about “poodles” and “identity theft” will not affect the weight as much as someone looking for “Free credit card numbers” and “MS Windows licenses”.
  • Use Natural Language Processing techniques to identify and weight actual posts (remember the “StupidFilter“?).

I realize full well that these techniques can be gamed just like anything else, but it seems to me that they are viable, not so hard to implement (I use components of them in my work – although the scale is different!), and a darn spot better than relying on the crooks to report themselves!