This week I’ve been working with the Kafka messaging system in a project.

Basic C# Methods for Kafka Producer

To publish to Kafka I built a C# app that uses the Kafka4n libraries – it doesn’t get much simpler than this:

I was reading from various event and performance monitoring logs and pushing them through just fine.

Basic Python Kafka Consumer

For watching the realtime feed, I created a consumer on a Linux machine using the Python kafka-python package:

This worked great until I started pushing in a lot of data both in size and quantity.  Eventually I started getting an error that seems to relate to the max size my consumer could request:

So I tweaked my publisher to make sure it wasn’t putting in really large messages (which wasn’t needed for my application anyway) and then cleared the Kafka topic.

Clearing Kafka Topics with Python

After trying a few different approaches to clearing the topic, I found this python approach to be simplest, using the zc.zk module.  First I listed the contents from Zookeeper:

Which showed the topics and consumers that were of interest to me .  I found deleting topics was not enough, I had to also reset the consumer data here so my script would not try to pick up where it left off:

To delete the topic and consumers, it only takes a couple more commands:

Then I started streaming new data back into my topic which will be auto-created.

Increasing Buffer Size

The default buffer size for my Python consumer script was set to some small size that prevents it from getting too many messages.

To fix this I add one more line to my Python consumer script, setting the max buffer size.  See what the default is by getting the value for:

Here I set it to 1Mb and it streams on happily for now, setting it to zero seems to let it go infinitely:


Data Sharing Saved My Life – or How an Insurer Reduced My Healthcare Claim Costs

It’s not every day that you receive snail mail with life-changing information in it, but when it does come, it can come from the unlikeliest sources.

Healthcare data shown in a list of bio sample test results
My initial test results showing problems with the Liver

A year ago, when doing a simple change of health insurance vendors, I had to give the requisite blood sample.  I knew the drill… nurse comes to the house, takes blood, a month later I get new insurance documents in the mail.

But this time the package included something new: the results of my tests.

The report was a list of 13 metrics and their values, including a brief description about what they meant and what my scores should be.  One in particular was out of the norm.  My ALT score, which helps measure liver malfunction, was about 50% higher than the expected range.

Simple Data Can Be Valuable

Here is the key point: I then followed up with my family doctor, with data in hand.  I did not have to wait to see symptoms of a systemic issue and get him to figure it out. We had a number, right there, in black and white. Something was wrong.

Naturally, I had a follow up test to see if it was just a blip.  However, my second test showed even worse results, twice as high in fact!  This lead to an ultrasound and more follow up tests.

In the end, I had (non-alcoholic) Fatty Liver Disease.  Most commonly seen in alcoholics, it was a surprise as I don’t drink.  It was solely due to my diet and the weight I had put on over several years.

It was a breaking point for my system and the data was a big red flag calling me to change before it was too late.

Wellness FX chart of ALT
A chart showing some of my earlier tests – loaded into WellnessFX.com for visualisation.

Not impressed with my weight nor all my other scores, I made simple but dramatic changes to improve my health.*  Changes were so dramatic that my healthcare provider was very curious about my methods.

By only making changes to my diet I was able to get my numbers to a healthy level in just a few months.  In the process I lost 46 pounds in 8 months and recovered from various other symptoms.  The pending train wreck is over.

Long Term Value in Sharing Healthcare Data

It’s been one year this week, so I’m celebrating and it is thanks to Manulife or whoever does their lab tests, for taking the initiative to send me my lab results.

It doesn’t take long to see the business value in doing so, does it?   I took action on the information and now I’m healthier than I have been in almost 20 years.  I have fewer health issues, will use their systems less, will cost them less money, etc.

Ideally it benefits the group plan I’m in too as a lower cost user of the system.  I hope both insurers and employers take this to heart and follow suit to give the data their people need to make life changing and cost reducing decisions like this.

One final thought.. how many people are taking these tests right now?  Just imagine what you could do with a bit of data analysis of their results.  Taking these types of test results, companies could be making health predictions for their customers and health professionals to review.  That’s why I’m jumping onto “biohacking” sites like WellnessFX.com to track all my scores these days and to get expert advice on next steps or access to additional services.

I’m so happy with any data sharing, but why give me just the raw data when I still have to interpret it?  I took some initiative to act on the results, but what if I had needed more incentive?  If I had been told “Lower your ALT or your premiums will be 5% higher” I would have appreciated that.

What’s your price?  If your doctor or insurer said “do this and save $100″ – would you do it?  What if they laid the data out before you and showed you where your quality of life was headed, would it make a difference to you?

I’m glad I had this opportunity to improve my health, but at this point I just say thanks for the data … and pass the salad please!


* I transitioned to a Whole Food – Plant Based diet (read Eat to Live and The China Study).  You can read more about the massive amounts of nutrition science coming out every year at NutritionFacts.org or read research papers yourself.

HBase queries from Bash – a couple simple REST examples

Learn how to do some simple queries to extract data from the Hadoop/HDFS based HBase database using its REST API.

Are you getting stuck trying to figure out HBase query via the REST API?  Me too.  The main HBase docs are pretty limited in terms of examples but I guess it’s all there, just not that easy for new users to understand.

As an aside, during my searches for help I also wanted to apply filters – if you’re interested in HBase filters, you’ll want to check out Marc’s examples here.

What docs do you find most useful?  Leave a comment.  Should someone write more books or something else?

My Use Cases

There were two things I wanted to do – query HBase via REST to see if a table exists (before running the rest of my script, for example).  Then I wanted to grab the latest timestamp from that table.  Here we go…

Does a specific table exist in HBase?

First, checking if a table exists can be done in a couple ways.  The simplest is to simply request the table name with the “exists” path after it and see what result you get back.

Here I use the curl “-i” option to return the detailed info/headers so I can see the HTTP responses (200 vs 404).  The plain text results from the command are either blank (if exists) or “Not found” if it does not.

Let’s roll it into a simple Bash script and use a wildcard search to see if the negative status is found:

Extract a timestamp from an HBase scanner/query

Now that I know the table exists, I want to get the latest timestamp value from it.  I thought I’d need to use some filter attributes like I do in HBase shell:

To do this with curl, you want to use HBase scanner techniques to accomplish this (the shortest section in the official docs it seems).

It’s a two stage operation – first you initialise a scanner session, then you request the results.  Bash can obviously help pull the results together easily for you, but let’s so go step by step:

Note the XML chunk in the statement that tells it how many records to return in the batch.  That’s as simple as it gets here!

Amongst the results of this command you’ll see the Location value returned, this is the URL to use to access the results of the query.  Results are truncated and line breaked so you can see the meaningful bits:

Ugh, XML.. if you want JSON instead just add an ACCEPT property to the header:

For now we’ll hack some sed to get the to the value we want, first for the JSON response, second for the XML response.  Just pipe the curl command into this sed:

Now you can create a basic script the grabs the latest timestamp from the HBase query and decides what to do with it.  Here we just assign it to a variable and let you go back to implement as needed.

Thanks for reading!

Running Gephi graph vizualization on OSX Mavericks (10.9.5)

Gephi visualization running on OSX
Gephi running on OSX – showing Tyler’s social network from LinkedIn

Having trouble launching latest Gephi on OSX?  I’m running Mavericks but I’m sure this will help others who have upgraded or who are still running older versions of OSX.

From command line, use the jdkhome parameter when launching Gephi and point it to the system Java 1.6 install:


Analytics Dashboard – Kibana 3 – a few short quick tips

After you’ve loaded log files into elasticsearch you can start to visualize them using the Kibana web app and build your own dashboard. While using Kibana for a week or so, I found it tricky to find the docs or tutorials to get me up to speed quickly with some of the more advanced/hidden features.

I share in this video about how to:

  1. build TopN automated classification queries
  2. view the TopN values of a particular column from the table panel
  3. manually create multiple queries to appear as series in your charts

Scripts to beat up Windows – using Powershell

This is just a shoutout to Luke Brennan’s Technet blog. This old post of his gave me some simple to use and easy to understand Powershell scripts for stress testing a Windows machine.

Saturate the CPU and fill up all your RAM easily:

Thanks for sharing Luke!

Supertunnels with SSH – multi-hop proxies

I never know what to call this process, so I’m inventing the term supertunnels via SSH for now. A lot of my work these days involves using clusters built on Amazon EC2 cloud environment. There, I have some servers that are externally accessible, i.e. web servers. Then there are support servers that are only accessible “internally” to those web servers and not accessible from the outward facing public side of the network, i.e. Hadoop clusters, databases, etc.

To help log into the “internal” machines, I have pretty much one choice – using SSH through the public machine first. No problem here, any server admin knows how to use SSH – I’ve been using it forever. However, I didn’t really use some of the more advanced features that are very helpful. Here are two…

Remote command chaining

Most of my SSH usage is for running long sessions on a remote machine. But you can also pass a command as an argument and the results come directly back to your current terminal:

$ ssh user@host "ls /usr/lib"

Take this example one step further and you can actually inject another SSH command that gets into the “internal” side of the network.

This is starting to really sound like tunneling, though it’s somewhat manual and doesn’t redirect traffic from your client side, we’ll get to that later.

As an aside, in EC2-land you often use certificate files during SSH login, so you don’t need to have an interactive password exchange. You specify the certificate with another argument. If that’s how you run your servers (or with authorized_keys files) then you can push in multiple levels of additional SSH commands easily.

For example, here I log into ext-host1, then from there log into int-host2 and run a command:

$ ssh -i ~/mycert.pem user@ext-host1 "ssh -i ~/mycert.pem user@int-host2 'ls /usr/lib'"

That is a bit of a long line for just getting a file listing, but it’s easy to understand and gets the job done quickly. It also works great in shell scripts, in fact you could wrap it up with a simple script to make it shorter.

Proxy config

Another way to make your command shorter and simpler is to add some proxy rules to the ~/.ssh/config file. I didn’t even know this file existed, so was thrilled to find out how it can be used.

To talk about this, let’s use the external and internal hosts as examples. And let’s assume that the internal host is Obviously these don’t need to be specifically public or private SSH endpoints, but it serves its purpose for this discussion.

If we are typically accessing int-host2 via ext-host1 then we can setup a Proxy rule in the config file:

Host 10.0.*.*
ProxyCommand ssh -i ~/mycert.pem user@ext-host1 -W %h:%p

This rule watches for any requests on the 10.0… network and automatically pushes the requests through the ext-host1 as specified above. Furthermore, the -W option tells it to stream all output back to the same terminal you are using. (Minor point, but if you miss it you may go crazy trying to find out where your responses go.)

Now I can do a simple login request on the internal host and not even have to think about how to get there.

ssh -i ~/mycert.pem user@int-host2

I think that’s a really beautiful thing – hope it helps!

Another time I’ll have to write more about port forwarding…

Converting Decimal Degree Coordinates

Converting Decimal Degree Coordinates to/from DMS Degrees Minutes Seconds

cs2cs command from GDAL/OGR toolset (gdal.org) - allows robust coordinate transformations.
cs2cs command from GDAL/OGR toolset (gdal.org) – allows robust coordinate transformations.

If you have files or apps that have to filter or convert coordinates – then the cs2cs command is for you.  It comes with most distributions of the GDAL/OGR (gdal.org) toolset.  Here is one popular example for converting between degrees minutes and seconds (DMS) and decimal degrees (DD).

Geospatial Power Tools book coverThe following is an excerpt from the book: Geospatial Power Tools – Open Source GDAL/OGR Command Line Tools by me, Tyler Mitchell.  The book is a comprehensive manual as well as a guide to typical data processing workflows, such as the following short sample…

Input coordinates can come from the command line or an external file. Assuming a file containing DMS (degree, minute, seconds) style, looks like:

124d10'20"W 52d14'22"N
122d20'05"W 54d12'00"N

Use the cs2cs command, specifying how the print format will be returned, using the -f option. In this case -f “%.6f”
is explicitly requesting a decimal degree number with 6 decimals:

cs2cs -f "%.6f" +proj=latlong +datum=WGS84 input.txt

Example Converting DMS to/from DD

This will return the results, notice no 3D/Z value was provided, so none is returned:

-124.172222 52.239444 0.000000
-122.334722 54.200000 0.000000

To do the inverse, remove the formatting option and provide a list of values in decimal degree (DD):

cs2cs +proj=latlong +datum=WGS84 inputdms.txt
124d10'19.999"W 52d14'21.998"N 0.000
122d20'4.999"W 54d12'N 0.000

Geospatial Power Tools is 350+ pages long – 100 of those pages cover these kinds of workflow topic examples. Each copy includes a complete (edited!) set of the GDAL/OGR command line documentation as well as the following topics/examples:

Workflow Table of Contents

  1. Report Raster Information – gdalinfo
  2. Web Services – Retrieving Rasters (WMS)
  3. Report Vector Information – ogrinfo
  4. Web Services – Retrieving Vectors (WFS)
  5. Translate Rasters – gdal_translate
  6. Translate Vectors – ogr2ogr
  7. Transform Rasters – gdalwarp
  8. Create Raster Overviews – gdaladdo
  9. Create Tile Map Structure – gdal2tiles
  10. MapServer Raster Tileindex – gdaltindex
  11. MapServer Vector Tileindex – ogrtindex
  12. Virtual Raster Format – gdalbuildvrt
  13. Virtual Vector Format – ogr2vrt
  14. Raster Mosaics – gdal_merge

