New Tricks: Data Mining With Google Spreadsheets

March 22, 2010

Happily, I stumbled across the following link:

Now You Can Mine Data With Google Queries Too

The interesting bit is below the comic where they actually reveal a method I hadn’t thought of: 

Using a query embedded in Google Spreadsheets to mine and graph data in Google’s engine.

While the actual instructions are terse, I was able to get things up and running by visiting the actual example, and then copying and pasting the individual cells for examination.

Here is the blow by blow:

First, decide what you want to mine.  One of the examples is for income, we will use this one.

Open up Google spreadsheets and in cell A2 put (complete as printed here):

=””””&”I make $”&B2&” per year”””

[NOTE:  WordPress jacks up the quotes, so you are going to have to replace all of the quotes in the above with double quotes, or it won’t work!]

Initially it is gonna look like this “I make $ per year”.

Now in B2 put a dollar amount: 45,000.

You should see your number populate in B1 now.

Finally, the magic that actually gets the query info.

Put the following in C2:

=importXML(“http://www.google.com/search?num=100&q=”&A2,&#8221://p%5B@id=’resultStats’%5D/b%5B3%5D”)

[NOTE: Same problem here – WordPress tries to mess with the multiple quotes.  Replace all double AND single quotes manually and you will be fine, otherwise you will get an Error.]

After a brief load time you should see a number returned.  This is the number of returns that included your statement in cell A2.

Now copy and paste A2 and C2 down the line and change your values accordingly as you move down.

To create the graph, simply open “Insert->Chart” and choose your graph type.

To populate the graph with your data, make sure to clear the box right under “What Data?” and then click and drag down column C on your spreadsheet.  Make sure to remove Column C as labels.  You should see your data represented in the preview.

That’s it!  The world is now your oyster!  I can’t wait to apply this in some cases I am working on, I am still mulling over where this can be most useful, but the possibilities boggle the mind.


Financial Institutions Using Live Data Sets in Test Environments

March 16, 2010

A recently released survey finds 83% of financial firms use production data for testing.  What this means (for the non-developers) is that your customer data is used unmasked and in its full form to test systems that, by the very fact that they should be TEST systems, have an unknown level of security and integrity.

Even though the study was commissioned by a company that works specifically with data protection in test environments (important to call out bias!), I believe the numbers on this one – especially when I go back and research the number of financial institution data breaches that have occurred because “live” customer data sets were in the hands of a third party contractor, or other employee off-site.

I have done development work on health data and I understand full well the challenges of creating meaningful data sets (as well as the enormous expense) for testing purposes.  The bottom line comes to this:  There is no excuse that justifies exposing personal data in this manner.  Period.

In performing penetration tests a common tactic that we use during the “recon” phase is to look for servers that are obviously development systems.  We do this because patch levels and security are typically at a minimum on these systems and they are usually the “low hanging fruit”.

So it makes me wonder – just what justification can possibly make these guys think this is OK?

Here are some more stats from the study that should give you pause:

  • identity compliance procedures (used by only 56 percent of companies surveyed);
  • intrusion detection systems (used by only 47 percent of companies surveyed);
  • data loss prevention (DLP) technology (used by only 41 percent of companies surveyed); and
  • Social Security number usage (88 percent of those surveyed still use this as a primary identifier)

Remember these findings the next time you read a news release regarding a financial institution data breach and some chuckle-head says that they are quite certain no sensitive data was taken or misused.  The very next question to ask is: How would you even know?

Sources:

http://money.cnn.com/news/newsfeeds/articles/globenewswire/185342.htm

http://cpwr.client.shareholder.com/releasedetail.cfm?ReleaseID=448389


ID Theft: It’s Not Just For Credit Cards Anymore

March 10, 2010

George Jenkins, the writer for the “I’ve Been Mugged” blog (http://ivebeenmugged.typepad.com) writes about a recent survey release discussing medical identity theft.  While this has been going on for a while (I had my first case involving electronic MedID theft 8 years ago) it serves as an excellent proactive warning:  THINK about any and all information systems that you give your ID to and QUESTION the flow of information.  We are not living in an age where blind trust/acceptance is acceptable.

The study was performed by the Poneman Institute and sponsored by Experian.  One of the Privacy analysts with Poneman was quoted (emphasis added):

“The two results that stood out to me were the more than $20,000 average cost to consumers who suffered ID/credit fraud as a result of a medical data breach, as well as the potential for physical harm to those who have their medical records ‘polluted’ due to healthcare fraud,” says Mike Spinney, a senior privacy analyst at Ponemon Institute.

The residual issue of “physical harm’ due to a corruption of medical records gives plenty to ponder – especially given the efforts to aggregate medical records in an electronic environment.  Also particularly interesting are the number of people that were aware they had a problem and did not report it.  I wonder about the psychology of that.

By the way – George is an excellently informed writer on these types of stories, and his blog is definitely worth a follow.

George Jenkins’ Link:

Survey: 5.8% Of US Adults Have Been Medical Identity Theft Victims