Graph analytics – the new super power

Graph analytics – is it just hype or is it technology that has come of age?  Mike Hoskins, CTO of Actian sums it up well in this article from InfoWorld:
Mike Hoskins writes about graph analytics and how it is a game changer for finding unknown unknowns
“One area where graph analytics particularly earns its stripes is in data discovery. While most of the discussion around big data has centered on how to answer a particular question or achieve a specific outcome, graph analytics enables us, in many cases, to discover the “unknown unknowns” — to see patterns in the data when we don’t know the right question to ask in the first place.”

Read Mike’s full article at:
InfoWorld

In the remainder of this post I outline a few more of my thoughts on this topic and give you pointers to some more resources to help you understand what to do next.

Continue reading “Graph analytics – the new super power”

HP Scanning on OSX Yosemite [SOLVED]

Read More
Vuescan website
Vuescan website

Vuescan website
Vuescan website
It was time to solve my HP scanning problem on OSX Yosemite.  My HP LaserJet CM1312 MFP – multifunction printer is getting a bit old, or at least HP’s support is falling by the wayside. Don’t bother with the HP software as it’s for 10.7. It did work at one point on this laptop but then totally gave up.

I spent too long trying to find new drivers that enable scanning, with no luck, until I found Vuescan.  It’s a free trial (with watermark).  I’m sure it supports lots of scanners, just get it and see before paying.

After a $29 Paypal and 7MB download I was scanning instantly – didn’t even have to point it to my printer.

 

SPARQL Query for Graph Density Analysis

Read More
Friends graph sample viewed in Gephi
SPARQLcity Graph Analytics Engine
SPARQLverse is the graph analytics engine produced by SPARQLcity. Standards compliant and super fast!

I’ve been spending a lot of time this past year running queries against the open source SPARQLverse graph analytic engine.  It’s amazing how simple some queries can look and yet how much work is being done behind the scenes.

My current project requires building up a set of query examples that allow typical kinds of graph/network analytics – starting with the kinds of queries needed for Social Network Analysis (SNA), i.e. find friends of friends, graph density and more.

In this post I go through computing graph density in detail. Continue reading “SPARQL Query for Graph Density Analysis”

Kafka Topic Clearing after Producing Messages

[UPDATE: Check out the Kafka Web Console to more easily administer your Kafka topics]


 

This week I’ve been working with the Kafka messaging system in a project.

Basic C# Methods for Kafka Producer

To publish to Kafka I built a C# app that uses the Kafka4n libraries – it doesn’t get much simpler than this:

I was reading from various event and performance monitoring logs and pushing them through just fine. Continue reading “Kafka Topic Clearing after Producing Messages”

Twitter cards made easy in WordPress

Just a note to self and others using WordPress – the Yoast WordPress SEO plugin did the trick to get Twitter cards working well – with only a couple clicks.   If you are tempted to hack your own HTML headers, you are going the wrong direction – follow this tutorial instead: http://www.wpbeginner.com/

Thanks WPBeginner and Yoast!

Social Graph – Year in Review (Preparation)

Graph developed from Tyler Mitchell's LinkedIn connections.

I pulled this visualization of my LinkedIn social graph together in just a few minutes while working through a tutorial.  What about other social networks?  Give me your input… Continue reading “Social Graph – Year in Review (Preparation)”

Data Sharing Saved My Life – or How an Insurer Reduced My Healthcare Claim Costs

Read More
Wellness FX chart of ALT
A chart showing some of my earlier tests - loaded into WellnessFX.com for visualisation.

It’s not every day that you receive snail mail with life-changing information in it, but when it does come, it can come from the unlikeliest sources.

Healthcare data shown in a list of bio sample test results
My initial test results showing problems with the Liver

A year ago, when doing a simple change of health insurance vendors, I had to give the requisite blood sample.  I knew the drill… nurse comes to the house, takes blood, a month later I get new insurance documents in the mail.

But this time the package included something new: the results of my tests.

The report was a list of 13 metrics and their values, including a brief description about what they meant and what my scores should be.  One in particular was out of the norm.  My ALT score, which helps measure liver malfunction, was about 50% higher than the expected range.

Simple Data Can Be Valuable

Here is the key point: I then followed up with my family doctor, with data in hand.  I did not have to wait to see symptoms of a systemic issue and get him to figure it out. We had a number, right there, in black and white. Something was wrong.

Naturally, I had a follow up test to see if it was just a blip.  However, my second test showed even worse results, twice as high in fact!  This lead to an ultrasound and more follow up tests.

In the end, I had (non-alcoholic) Fatty Liver Disease.  Most commonly seen in alcoholics, it was a surprise as I don’t drink.  It was solely due to my diet and the weight I had put on over several years.

It was a breaking point for my system and the data was a big red flag calling me to change before it was too late.

Wellness FX chart of ALT
A chart showing some of my earlier tests – loaded into WellnessFX.com for visualisation.

Not impressed with my weight nor all my other scores, I made simple but dramatic changes to improve my health.*  Changes were so dramatic that my healthcare provider was very curious about my methods.

By only making changes to my diet I was able to get my numbers to a healthy level in just a few months.  In the process I lost 46 pounds in 8 months and recovered from various other symptoms.  The pending train wreck is over.

Long Term Value in Sharing Healthcare Data

It’s been one year this week, so I’m celebrating and it is thanks to Manulife or whoever does their lab tests, for taking the initiative to send me my lab results.

It doesn’t take long to see the business value in doing so, does it?   I took action on the information and now I’m healthier than I have been in almost 20 years.  I have fewer health issues, will use their systems less, will cost them less money, etc.

Ideally it benefits the group plan I’m in too as a lower cost user of the system.  I hope both insurers and employers take this to heart and follow suit to give the data their people need to make life changing and cost reducing decisions like this.

One final thought.. how many people are taking these tests right now?  Just imagine what you could do with a bit of data analysis of their results.  Taking these types of test results, companies could be making health predictions for their customers and health professionals to review.  That’s why I’m jumping onto “biohacking” sites like WellnessFX.com to track all my scores these days and to get expert advice on next steps or access to additional services.

I’m so happy with any data sharing, but why give me just the raw data when I still have to interpret it?  I took some initiative to act on the results, but what if I had needed more incentive?  If I had been told “Lower your ALT or your premiums will be 5% higher” I would have appreciated that.

What’s your price?  If your doctor or insurer said “do this and save $100” – would you do it?  What if they laid the data out before you and showed you where your quality of life was headed, would it make a difference to you?

I’m glad I had this opportunity to improve my health, but at this point I just say thanks for the data … and pass the salad please!

Tyler


* I transitioned to a Whole Food – Plant Based diet (read Eat to Live and The China Study).  You can read more about the massive amounts of nutrition science coming out every year at NutritionFacts.org or read research papers yourself.

HBase queries from Bash – a couple simple REST examples

Read More
Screenshot of HBase REST query examples

Learn how to do some simple queries to extract data from the Hadoop/HDFS based HBase database using its REST API.

Are you getting stuck trying to figure out HBase query via the REST API?  Me too.  The main HBase docs are pretty limited in terms of examples but I guess it’s all there, just not that easy for new users to understand.

As an aside, during my searches for help I also wanted to apply filters – if you’re interested in HBase filters, you’ll want to check out Marc’s examples here.

What docs do you find most useful?  Leave a comment.  Should someone write more books or something else?

My Use Cases

There were two things I wanted to do – query HBase via REST to see if a table exists (before running the rest of my script, for example).  Then I wanted to grab the latest timestamp from that table.  Here we go…

Does a specific table exist in HBase?

First, checking if a table exists can be done in a couple ways.  The simplest is to simply request the table name with the “exists” path after it and see what result you get back.

Here I use the curl “-i” option to return the detailed info/headers so I can see the HTTP responses (200 vs 404).  The plain text results from the command are either blank (if exists) or “Not found” if it does not.

Let’s roll it into a simple Bash script and use a wildcard search to see if the negative status is found:

Extract a timestamp from an HBase scanner/query

Now that I know the table exists, I want to get the latest timestamp value from it.  I thought I’d need to use some filter attributes like I do in HBase shell:

To do this with curl, you want to use HBase scanner techniques to accomplish this (the shortest section in the official docs it seems).

It’s a two stage operation – first you initialise a scanner session, then you request the results.  Bash can obviously help pull the results together easily for you, but let’s so go step by step:

Note the XML chunk in the statement that tells it how many records to return in the batch.  That’s as simple as it gets here!

Amongst the results of this command you’ll see the Location value returned, this is the URL to use to access the results of the query.  Results are truncated and line breaked so you can see the meaningful bits:

Ugh, XML.. if you want JSON instead just add an ACCEPT property to the header:

For now we’ll hack some sed to get the to the value we want, first for the JSON response, second for the XML response.  Just pipe the curl command into this sed:

Now you can create a basic script the grabs the latest timestamp from the HBase query and decides what to do with it.  Here we just assign it to a variable and let you go back to implement as needed.

Thanks for reading!

If you like this, follow me on Twitter at http://twitter.com/1tylermitchell or with any of the other methods below.

 

Running Gephi graph vizualization on OSX Mavericks (10.9.5)

Read More
Gephi visualization running on OSX
Gephi running on OSX - showing Tyler's social network from LinkedIn
Gephi visualization running on OSX
Gephi running on OSX – showing Tyler’s social network from LinkedIn

Having trouble launching latest Gephi on OSX?  I’m running Mavericks but I’m sure this will help others who have upgraded or who are still running older versions of OSX.

From command line, use the jdkhome parameter when launching Gephi and point it to the system Java 1.6 install:

 

Analytics Dashboard – Kibana 3 – a few short quick tips

After you’ve loaded log files into elasticsearch you can start to visualize them using the Kibana web app and build your own dashboard. While using Kibana for a week or so, I found it tricky to find the docs or tutorials to get me up to speed quickly with some of the more advanced/hidden features.

In this Kibana dashboard video:

  1. build TopN automated classification queries
  2. view the TopN values of a particular column from the table panel
  3. manually create multiple queries to appear as series in your charts