Transforming Cumulative Ceilometer Stats to Gauges

Overview

I’ve been dabbling a bit more with OpenStack as of late. If you know me, you can likely guess my goal is how to ingest logs, monitor resources, etc.

I’ve been trying to see how well Ceilometer, one of the core components of OpenStack that actually provides some of this stuff, would work. Initially, I was a bit bummed, but after fumbling around for a while, I am starting to see the light.

You see, the reason I almost abandoned the idea of using Ceilometer was due to the fact that some of the “meters” it provides are, well, a bit nonsensical (IMHO). For example, there’s network.outgoing.bytes, which is what you would expect… sort of. Turns out, this is a “cumulative” meter. In other words, this meter tells me the total number of bytes sent out a given Instance’s virtual NIC. Ceilometer has the following meter types to measure :

  • Cumulative: Increasing over time (e.g.: instance hours)
  • Gauge: Discrete items (e.g.: floating IPs, image uploads) and fluctuating values (e.g.: disk I/O)
  • Delta: Changing over time (bandwidth)

Maybe I am naive, but seems quite a bit more helpful to track this as a value based on a given period… you know, so I can get a hint of how much bandwidth a given instance is using. In Ceilometer parlance, this would be a delta metric.

I’ll take this aside and defend the fine folks working on Celiometer on this one. Ceilometer was built initially to generate non-repudiable bulling info. Technically, AFAIK, that is the project’s only goal – though it has morphed to gain things like an alarm framework.

So, now you can see why network.outgoing.bytes would be cumulative: so you can bill a customer for serving up torrents, etc.

Anyway, I can’t imagine that I’m the only person looking to get Delta metrics out of a Cumulative one, so I thought I’d document my way of getting there. Ultimately there might be a better way, YMMV, caveat emptor, covering my backside, yadda yadda.

Transformers to the Rescue!

… no, not that kind of Transformer. Lemme ‘splain. No, there is too much. Let me sum up.

Actually, before we start diving in, let’s take a quick tour of Ceilometer’s workflow.

The general steps to the Ceilometer workflow are:

Collect -> (Optionally) Transform -> Publish -> Store -> Read

Collect

There are two methods of collection:

  1. Services (Nova, Neutron, Cinder, Glance, Swift) push data into AMQP and Ceilometer slurps them down
  2. Agents/Notification Handlers poll APIs of the services

This is where our meter data comes from.

Transform/Publish

This is the focus of this post. Transforms are done via the “Pipeline.”

The flow for the Pipeline is:

Meter -> Transformer -> Publisher -> Receiver

  • Meter: Data/Event being collected
  • Transformer: (Optional) Take meters and output new data based on those meters
  • Publisher: How you want to push the data out of Ceilometer
    • To my knowledge, there are only two options:
  • Receiver: This is the system outside Ceilometer that will receive what the Publisher sends (Logstash for me, at least at the present – will likely move to StatsD + Graphite later on)

Store

While tangental to this post, I won’t leave you wondering about the “Store” part of the pipeline. Here are the storage options:
* Default: Embedded MongoDB
* Optional:
* SQL
* HBase
* DB2

Honorable Mention: Ceilometer APIs

Like pretty much everything else in OpenStack, Ceilometer has a suite of OpenAPIs that can also be used to fetch metering data. I initially considered this route, but in the interest of efficiency (read: laziness), I opted to use the Publisher vs rolling my own code to call the APIs.

Working the Pipeline

There are two Transformers (at least that I see in the source):

  • Scaling
  • Rate of Change

In our case, we are interested in the latter, as it will give us the delta between two samples.

To change/use a given Transformer, we need to create a new pipeline via /etc/ceilometer/pipeline.yaml

Here is the default pipeline.yaml:

---
 -
     name: meter_pipeline
     interval: 600
     meters:
         - "*"
     transformers:
     publishers:
         - rpc://
 -
     name: cpu_pipeline
     interval: 600
     meters:
         - "cpu"
     transformers:
         - name: "rate_of_change"
           parameters:
               target:
                   name: "cpu_util"
                   unit: "%"
                   type: "gauge"
                   scale: "100.0 / (10**9 * (resource_metadata.cpu_number or 1))"
     publishers:
         - rpc://

The “cpu_pipeline” pipeline gives us a good example of what we will need:

  • A name for the pipeline
  • The interval (in seconds) that we want the pipeline triggered
  • Which meters we are interested in (“*” is a wildcard for everything, but you can also have an explicit list for when you want the same transformer to act on multiple meters)
  • The name of the transformation we want to use (scaling|rate_of_change)
  • Some parameters to do our transformation:
    • Name: Optionally used if you want to override the metric’s original name * Unit: Like Name, can be used to override the original unit (useful for things like converting network.*.bytes from B(ytes) to MB or GB)
    • Type: If you want to override the default type (remember they are: (cumulative|gauge|delta))
    • Scale: A snippet of Python for when you want to scale the result in some way (would typically be used along with Unit)
      • Side note: This one seems to be required, as when I omitted it, I got the value of the cumulative metric. Please feel free to comment if I goobered something up there.

Looking at all of this, we can see that the cpu_pipeline, er, pipeline:

  1. Multiplies the number of vCPUs in the instance (resource_metadata.cpu_number) times 1 billion (10^9, or 10**9 in Python)
    • Note the “or 1”, which is a catch for when resource_metadata.cpu_number doesn’t exist
  2. Divides 100 by the result

The end result is a value that tells us how taxed the Instance’s is from a CPU standpoint, expressed as a percentage.

Bringing it all Home

Armed with this knowledge, here is what I came up with to get a delta metric out of the network.*.bytes metrics:

    name: network_stats
    interval: 10
    meters:
        - "network.incoming.bytes"
        - "network.outgoing.bytes"
    transformers:
        - name: "rate_of_change"
          parameters:
              target:
                  type: "gauge"
                  scale: "1"
    publishers:
        - udp://192.168.255.149:31337

In this case, I’m taking the network.incoming.bytes and network.outgoing.bytes metrics and passing them through the “Rate of Change” transformer to spit a gauge out of what was previously a comulative metric.

I could have (and likely will) taken it a step further and used the scale parameter to change it from bytes to KB. For now, I am playing with OpenStack in a VM on my laptop, so the amount of traffic is small. After all, the difference between 1.1 and 1.4 in a histogram panel in Kibana isn’t very interesting looking :)

Oh, I forgot… the Publisher. Remember how I said the UDP Publisher uses msgpack to stuff its data in? It just so happens that Logstash has both a UDP input and a msgpack codec. As a result, my Receiver is Logstash – at least for now. Again, it would make alot more sense to ship this through StatsD and use Graphite to visualize the data. But, even then, I can still use Logstash’s StatsD output for that. Decisions, decisions :)

Since the data is in Logstash, that means I can use Kibana to make pretty charts with the data.

Here are the bits I added to my Logstash config to make this happen:

input {
  udp {
    port => 31337
    codec => msgpack
    type => ceilometer
  }
}

At that point, I get lovely input in ElasticSearch like the following:

{
  "_index": "logstash-2014.01.16",
  "_type": "logs",
  "_id": "CDPI8-ADSDCoPiqY9YqlEw",
  "_score": null,
  "_source": {
    "user_id": "21c98bfa03f14d56bb7a3a44285acf12",
    "name": "network.incoming.bytes",
    "resource_id": "instance-00000009-41a3ff24-f47e-4e29-86ce-4f50a61f78bd-tap30829cd9-5e",
    "timestamp": "2014-01-16T21:54:56Z",
    "resource_metadata": {
      "name": "tap30829cd9-5e",
      "parameters": {},
      "fref": null,
      "instance_id": "41a3ff24-f47e-4e29-86ce-4f50a61f78bd",
      "instance_type": "bfeabe24-08dc-4ea9-9321-1f7cf74b858b",
      "mac": "fa:16:3e:95:84:b8"
    },
    "volume": 1281.7777777777778,
    "source": "openstack",
    "project_id": "81ad9bf97d5f47da9d85569a50bdf4c2",
    "type": "gauge",
    "id": "d66f268c-7ef8-11e3-98cb-000c29785579",
    "unit": "B",
    "@timestamp": "2014-01-16T21:54:56.573Z",
    "@version": "1",
    "tags": [],
    "host": "192.168.255.147"
  },
  "sort": [
    1389909296573
  ]
}

Finally, I can then key a Histogram panel in Kibana to key on the “Volume” field for these documents and graph the result, like so:

Screen Shot 2014-01-16 at 4.30.28 PM

OK, so maybe not as pretty as I sold it to be, but that’s the data’s fault – not the toolchain :) It will look much more interesting once I mirror this on a live system.

Hopefully someone out there on the Intertubes will find this useful and let me know if there’s a better way to get at this!

Advertisements

There’s Gold in Them Thar Logs!

Oh crap. Another post about logging…

Yep, get over it. This time, however, I’m not here to piss and moan about how much logging stinks. Rather, I want to show the power of extracting data from the logging you have.

Logstash + ElasticSearch + Kibana = Logging Bliss!

I’m not necessarily treading any new ground here, but for the initiated, let me quickly go over what is becoming a very popular logging technology stack in the DevOps space:

  • Logstash
  • ElasticSearch
  • Kibana

I will get into some of the deeper details later, but let’s take a quick tour of the components first.

Logstash

Logstash is the brain child of Jordan Sissel. At a high level, the goal of Logstash is to chew up your logs and spit out contextualized data from what would otherwise be blobs of text. Said another way, instead of mountains of plain-text log files, you get the same log content split out and stuffed into fields you define. This allows you to do all sorts of things with the logs, not the least of which is to perform focused searches in Kibana for exactly the data you need.

Going with the lumberjack analogy, think of Logstash as your lumber mill that takes your logs and spits out 2x4s, etc. There’s a reason why you don’t see tree trunks at Home Depot. Tree trunks do me no good. Studs and plywood, now that I can work with.

ElasticSearch

In this stack, ElasticSearch operates as our super-fast “database” to hold what Logstash kicks out (technically, it’s a search index, but let’s not nerd out too much here). This description sells both ElasticSearch and Logstash short (Logstash can output in many different formats, not just ElasticSearch), but I’m trying to stick to the solution at hand.

ElasticSearch is based on Apache Lucene – just like Apache Solr – and lives to gobble up data broken into fields, then query large volumes of that data near instantly. It also gives a rich query syntax, which Kibana takes advantage of.

Kibana

Kibana is where we see the fruits of Logstash’s and ElasticSearch’s efforts. Kibana is a slick web UI that queries ElasticSearch for various purposes:

  1. To search logs (again, allowing you to use Lucene search syntax)
  2. To make pretty charts
  3. To show the most common items in logs
  4. etc

As an example on the searching, think of the last time you had to visit your Apache or Tomcat access logs to play Columbo for some issue that happened. I am sure it was an exercise of grep, sed, awk, etc to find that needle in the haystack. Even then, you end up with alot of stuff that is likely not relevant (heaven forbid you need to look for common words like “post,” “get,” or “user”). Now you can say things like, “Show me all the log statements where”:

  1. ‘user’ shows up in the referring URL
  2. The user got a 500 and the server took more than 500ms to fulfill the request
  3. My user ID exists, my boss’ user ID exists, and the word “raise” was in the request URL ;)

And here’s a cool benefit: Normal human beings can now have access to logs with a “type what you’re looking for” interface, so, I dunno, maybe they can leave you alone to do more interesting things?

What is interesting about Kibana is that there is not a server-side component. Kibana is all client-side code. All you need is a web server to pass along Kibana’s code to the browser, then the browser issues queries to ElasticSearch directly via ElasticSearch’s APIs.

While Kibana has a built-in dashboard for Logstash that allows you to search your logs, where things really get interesting is applying Kibana’s Panels to display data from your logs. Think of Panels as the Lego from which you construct your dashboard. Behind each of these is an ElasticSearch query that populates that Panel.

While I will cover all of the Panels in my dashboard, here is a teaser:

Screen Shot 2013-11-11 at 8.03.44 AM

My First Dashboard ™

Typically the best way to learn something is through an example. The primary reason I am writing this post is to help others take advantage of this tech stack, which I hope to do by walking though what I did – and shouldn’t have done – to get my Kibana dashboard in place.

Recall that the pipeline is like so:

Logging source (syslog/file/flume/etc) => Logstash => ElasticSearch

“Wait, where’s Kibana?” you might be asking. Remember that Kibana is just the UI to visualize data that Logstash, um, stashed in ElasticSearch. I’ll definitely cover Kibana in detail, but we need to start with Logstash, as that is where we add meaning to the text of our logging. Without that context, Kibana will be of limited use to us. Garbage in, garbage out – right?

Magic ain’t Free

Again, I am selling its full capabilities way short, but Logstash’s job here is to take a single line of text and – based on filters, either pre-defined or custom – assign chunks of that text to fields we define. We will use these fields later on not only to more intelligently query our logs, but to target specific data to be shown in Panels in Kibana.

  1. Be the logs
  2. Know the end goal
  3. Grok grok (no, that’s not a typo)
  4. Iterate, because you ain’t gonna get it right the first time

This will take work, but to crib from Alton Brown, “Your hard work will be rewarded.”

Be the Logs

To really get the value out of this stack, you need to know your logs, as well as what you want to get out of them.

At this stage, we just want to look at our target logging and see what kind of data might be useful. So, do the obvious: copy a typical log statement into a text editor and look at it in isolation. Logstash doesn’t have the benefit of context. It’s going to chop up those logs one line at a time, hence I find it best to start my analysis by looking at just one log statement.

As mentioned in #4 above, you are likely to find that your logging is not as “normalized” as you think, so you’ll likely have to revisit this step quite few times along your journey.

Let’s take an example Apache access_log statement:

10.129.16.62 - - [11/Nov/2013:07:24:35 -0800] "GET /html/themes/default/images/sprites/core_sprite_left.png?t=74 HTTP/1.1" 304 - "https://my-server.cjchand.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36" 354

There is the following useful/interesting information in that single log statement (in order shown):

  • Client IP address
  • Datestamp
  • Request URL
  • HTTP status code
  • Referring URL
  • User Agent string
  • Response time

Of those, are there any items that are throwaways (e.g.: data that would not be interesting/useful as search or graphing criteria)? In this case, I would argue not, but especially with logging in custom applications, you can end up with some items that might be interesting when viewed with the log statement itself, but might not be something that we would need to filter/chart (e.g.: PID).

That said, at this point, err on the side of capturing everything you can. If you ultimately determine you don’t need that data, you can simply bit-bucket it via your grok pattern and/or Logstash config.

Similarly to knowing what meaningful data lives in your logs, it helps to know what Kibana can visualize for you. Here are the Panel types available (at the least the non-deprecated ones) at the time I am writing this:

  • Bettermap: Uses Longitude and Latitude coordinates to place a “pin” on a map
  • Column: A “wrapper” panel of sorts; used to control the layout of panels in a dashboard
  • Derivequeries: This panel auto-creates facets for a given field (e.g.: might create multiple queries for each HTTP status code seen in the current data set)
  • Histogram: A chart to show the total, count, mean, etc number of log statements in the timerange specified
  • Hits: Total count of lines matching a given query/set of queries
  • Terms: A “Top X” list of the most common terms found in log statements
    • Can be viewed as:
      • Chart: Pie or Bar
      • Table
  • Text: Panel to hold text or sanitized HTML (including in Markdown syntax)
  • Trends: Based on a user-configurable time frame, compare the data shown to the same data across the specified timeframe
    • Example: I am looking at from 13:00-14:00 and I want to know – as a percentage – how many more/less 404’s I have seen from 13:00-14:00 today vs 13:00-14:00 yesterday)

Know the End Goal

“Wait. Shouldn’t I start with this step?” Sure, you can definitely make that argument, but in many cases you have little control over what logging you get – which is often the case if you are not the one developing the code or controlling the deployment of the app.

As such, my philosophy is to see what I have to work with first. Then, I set my target for the final product.

YMMV on when you this step, but you definitely need to perform this step before you sit down to craft grok patterns.

Continuing with our Apache access log example, here are some potential target items we would want to craft grok patterns around – and why:

  • Search Filters:
    • All of these fields
  • Charting/Trending items:
    • Client IP address:
      • Top XX IP addresses
      • Geolocation of client
    • HTTP status code:
      • Histogram of unexpected statuses
    • User Agent String:
      • Client OS
      • Client browser
    • Response Time:
      • Histogram of mean response times (yes, I know that’s not statistically the best measure, but it’s what Kibana gives us. There are no confidence intervals, etc.)

Do you Grok Me?

Logstash comes with some built-in filters (called “grok patterns“). These cover common items like IP addresses, timestamps, and even common log types such as httpd access logs.

Using the Lego analogy again, most logs can be parsed using these out-of-the-box grok patterns. You can check out the full list in the Logstash GitHub.

As with most things, odds are your logging might not dovetail perfectly with these patterns and you will have to create your own. But, hey, that’s job security, right? ;)

Creating your own Grok patterns requires some heavy lifting with regex, on which grok patterns are based.

No! Don’t go! We’re just getting to the good part.

I know regex is a four-letter word for most, but in this case, we don’t have to write one massive regex for a given log statement. In fact, depending on your logging data, you likely won’t need to do any regex work at all, as Logstash comes with grok patterns for lots of things you commonly find in logs like:

  • Timestamps (including days, months, etc)
  • IP addresses
  • Class names (com.example.my.custom.class)
  • Quoted Strings (like requested and referring URLs in Apache logs)
  • Usernames
  • UUIDs
  • MAC addresses

In my case, I had to do some customizations because:

  • The built-in grok pattern for “program” – which is a sub-pattern referenced by the SYSLOGBASE pattern – expects a string containing the process ID (PID) like so: processname[PID] (e.g.: httpd[12345])
    • In the case of our Tomcat localhost_access logs, the program name is customized via our syslog config. As a result, the PID is missing (e.g.: accesslog[])
    • It took me a while to figure out, but because of this Logstash would not properly parse my localhost_access logs (see the Iterate section later on :) )
  • While not discussed extensively in this post, I also created a custom grok pattern to understand the logs our Tomcat application generates so I can get things like the userID the log statement pertains to, the Java Classname, etc to help perform more targeted searches

Iterate

Like anything you write to parse strings, you’ll likely hit corner cases. Maybe a field is not populated in some instances, or what would normally be a string ends up being a number, or maybe some mobile device passes a funky User Agent string.

The point is, once you get a set of Grok patterns in place, you will want to review what Logstash gives you to ensure that you’re parsing as many logs as you can.

The reason why this is important is that if there is not a matching Grok pattern, the entire line of logging will not be split out. While it will still be searchable, you will not get the full benefit of this logging framework.

Going back to my missing PID as an example, this simple deviation caused every single line of my logging not to be parsed. Until I came across that missing PID as the root cause, I was unable to do all of the cool stuff you’ll see later when I discuss Kibana.

Grok Debugger

Unless you’re a regex whiz, you’re likely familiar with using sites like RegexPal to build a regex against a target string. Luckily, there is a similar tool for creating and testing Grok patterns: Grok Debugger.

The really cool thing about Grok Debugger is that it is Grok-aware and understands the built-in Grok patterns (e.g.: HOSTORIP, URI, QS [Quoted String], etc). There is even a “discover” feature that will attempt to build a Grok pattern based on a sample line of logging you give it. While not 100% perfect, it does help make quick work of creating a custom Grok pattern, should the need arise.

Bringing it all Home

At this point I will assume you have done the following:

  • Deployed Logstash
  • Started passing logs through Logstash
  • Stored Logstash’s output in ElasticSearch
  • Used Kibana to query ElasticSearch, just to make sure things are working right

If not, here are some links you need to get these items working:

  1. Simple Logstash deployment (which includes an embedded ElasticSearch instance)
  2. Kibana deployment (simply unzip/untar and plunk in the Web Server of your choosing)

You can certainly get way more elaborate than this, but I’d recommend getting started with the simple version above. I’ve had such a setup ingest ~3GB of logs a day on a plain vanilla CentOS virtual machine with just 2 vCPU and 4GB of RAM – and it barely broke a sweat.

IOW, crawl before you walk ;)

With that PSA out of the way, now comes the real work:

  1. Creating and verifying your customized Logstash config
  2. Creating and falling in love with your Kibana Dashboard

My Logstash Config

Note that I will not go into all of the ins and outs of Logstash file config as there is plenty of info on that across the Intert00bz, but here are some sources to get you started:

I will try – but likely not succeed – to not rehash too much of the content above as I explain why my finished Logstash config file looks like this:

input { 
   syslog {
      host => "0.0.0.0"
      port => 514
      type => "My Custom App"
   }
}

filter {
  grok {
    patterns_dir => "/opt/logstash/patterns/"
     match => [ "message", "%{TOMCAT_ACCESS}", "message", "%{APP_SERVER}" ]
   }
  mutate {
       convert => [ "responsecode", "integer", "responsesize", "integer", "resp_time", "integer", "geoip.latitude", "float", "geoip.longitude", "float" ]
  }
  if [user_agent] {
    useragent {
      source => "user_agent"
      target => "ua"
    }
  }
  if [sourceip] !~ "(^127\.0\.0\.1)|(^10\.)|(^172\.1[6-9]\.)|(^172\.2[0-9]\.)|(^172\.3[0-1]\.)|(^192\.168\.)" and [sourceip] {
    geoip {
         source => "sourceip"
    }
      mutate {
      # 'coords' will be kept, 'tmplat' is temporary.
      # Both of these new fields are strings.
      add_field => [ "coords", "%{[geoip][longitude]}",
             "tmplat", "%{[geoip][latitude]}" ]
      }
      mutate {
      # Merge 'tmplat' into 'coords'
      merge => [ "coords", "tmplat" ]
      }
      mutate {
      # Convert our new array of strings back to float
      convert => [ "coords", "float" ]
      # Delete our temporary latitude field
      remove => [ "tmplat" ]
      }
  } 
}

output { 
   elasticsearch { embedded => true }
}

… and here is the file containing my custom TOMCAT_ACCESS and APP_SERVER Grok patterns:

SYSLOGBASE %{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} %{SYSLOGPROG}:
SYSLOGPROG %{PROG:program}(?:\[%{DATA:pid}\])?
TOMCAT_ACCESS %{SYSLOGBASE} %{IP:sourceip} - - %{DATA:sessionid} \[%{HAPROXYDATE}\] %{QS:request} %{INT:responsecode} %{DATA:responsesize} %{QS:referrer} %{QS:user_agent} %{INT:resp_time}
APP_SERVER %{SYSLOGBASE} %{DATA:level}  \[%{DATA:classname}\] - \[%{DATA:threadid}\] - \[%{DATA:userid}\] - \[%{DATA:sessionid}\]: %{GREEDYDATA:payload}

While I certainly could have put the patterns directly in the Logstash config file itself, by using a separate patterns file it makes the config easier to read and manage.

The reason I had to create a pattern file was twofold:

  1. I had to override the built-in SYSLOGPROG – and by extension, SYSLOGBASE – patterns because of our goofiness in not spitting out a PID along with the program name (e.g.: httpd:[])
  2. Also, since our formatting of our Tomcat Access Logs deviated from the norm, I had to create a custom filter for that

Um, WTF is the “geoip” and “coords” stuff in there?

I briefly mentioned that we can GeoLocate clients based on their IP address. This magic comes courtesy of Logstash, which ships with a built-in DB that maps IP addresses to physical locations – Longitude and Latitude coordinates, to be exact. These coordinates are based on data from the various Internet Registries (ARIN,RIPE, etc).

This leads to an issue for “internal” applications: What to do about private IP addresses? Obviously, those will not be in the various Internet Registries. Unfortunately, we can’t customize these, either. Logstash gets this Geolocation data from the kind folks at MaxMind. In order to keep this free (as in beer), the Geolocation DB is in MaxMind’s proprietary format. More on handling internal IPs coming up in a bit.

Back to the Geolocation DB: The proprietary, binary format means we can’t (not easily, at least) use a customized version of the DB to specify where private subnets actually physically reside within our organization. Of course, Logstash is Open Source, thus you could implement support for the CSV formatted DBs MaxMind provides, but that’s a topic for another post.

I will start off by saying that a vast majority of the Geolocation config was blatantly stolen from Bryan White’s post you can find here. The short version is that you have to do some conversion of the coordinates to get them into the format that Kibana needs for the Bettermap panel. The really shrot version is just copy and paste what he has kindly provided and move on :)

Now, that just leaves the matter of how to handle internal IP addresses. Logstash recently added the idea of “conditionals” in its config files. This allows you to perform actions against a log statement only if certain conditions are met.

In this case, we want to skip Geolocation processing if either of these situations are present in a given log statement:

  • The sourceip field is missing
  • The value of the sourceip field falls into the range of internal IP addresses (e.g.: 10.x.x.x, 172.16.x.x through 172.31.x.x and 192.168.x.x)

The reason we want to skip Geolocation for these is the coordinates end up being 0,0 – which is somewhere West of Africa in the middle of the Atlantic. Not terribly useful data and it makes us wonder why so many people are visiting our site from the middle of the ocean ;)

As a result, I have the entire block of Bryan’s Geolocation config sitting inside this conditional:
if [sourceip] !~ "(^127\.0\.0\.1)|(^10\.)|(^172\.1[6-9]\.)|(^172\.2[0-9]\.)|(^172\.3[0-1]\.)|(^192\.168\.)" and [sourceip]

What follows the if is a regex, which is likely self-explanatory, but just in case it is translated to:

  • If the field sourceip:
    • Is not in the range of private IP addresses
    • and
    • Is present (which is why you see [sourceip] after the and near the end of the statement)

… then go ahead and do Geolocation for this log statement, which is done via the geoip filter and the output of geoip is reformatted by the mutate filters to get them into the format Kibana expects.

The reason for checking that sourceip is present in the first place is because I am receiving logs from all manner of locations (e.g.: messages logs, etc), many of which will not have an IP address in them. If I did not filter these out, I would get alot of log statements with coordinates of 0,0. I learned that the hard way (see the “Iterate” section above).

The end result of this config is:

  • I am parsing Tomcat Access Logs and my Applications custom logs to get custom fields, some of which I defined (e.g.: userID) and some of which leverage built-in field names within Logstash (e.g.: user_agent)
  • All other logs (e.g.: messages, cron, etc) are simply stored as they are received with no special field creation, as I typically don’t have a need to search and/or chart on these
    • Note: Logstash still creates some fields, depending on the log type
    • For example, it will recognize cron logs and set the program field to “cron”
  • The output is going into the embedded ElasticSearch

Kibana Dashboard

At this point, I have laid the groundwork to get the fields necessary to populate my Kibana Dashboard. Again, this Dashboard is comprised of Panels, each of which will look for the presence of a given field for a document in ElasticSearch. Based on the contents of that field – and the type of panel – Kibana will visualize that data for us.

For example, the Bettermap panel will look for the value of the coords field in each record it receives. Then, based on that value, it will place a “pin” in the location corresponding to that value. A Terms panel can be configured to look at things like sourceip and give me a list of the top X (e.g.: 10) most common source IP addresses found in the logs, along with how many times that IP address appeared in the logs for a given time frame.

Before I get to the shiny, let me detail each of the panels in my Dashboard so you can better understand the resulting charts, etc.

  • “Events Over Time”: This is a Histogram panel that gives me the total number of log statements (stored as “documents” in ElasticSearch). Depending on the time range chosen, Kibana will aggregate these. For example, if you look at an hour’s worth of logs, it will aggregate the total number of documents by chunks of 30 seconds, then graph that aggregated count. Note: This is a raw count of all logging seen for the given time frame, regardless of log source.

Screen Shot 2013-11-11 at 8.03.44 AM

  • “Mean Response Time”: This is the mean value of the response time, as found in only the Tomcat Access Logs. When you configure a Panel to look at the target field, that Panel will formulate its query appropriately. For example, the “Mean Response Time” Panel is only concerned about documents in ElasticSearch where the resp_time field (which is one I custom-defined in my patterns file) is present. The result is that the Panel issues a query like this to ElasticSearch to get the data it will graph:
curl -XGET 'http://10.194.190.245:9200/logstash-2013.12.05/_search?pretty' -d '{
  "facets": {
    "2": {
      "date_histogram": {
        "key_field": "@timestamp",
        "value_field": "resp_time",
        "interval": "30s"
      },
      "facet_filter": {
        "fquery": {
          "query": {
            "filtered": {
              "query": {
                "query_string": {
                  "query": "*"
                }
              },
              "filter": {
                "bool": {
                  "must": [
                    {
                      "match_all": {}
                    },
                    {
                      "range": {
                        "@timestamp": {
                          "from": 1386258436590,
                          "to": "now"
                        }
                      }
                    },
                    {
                      "bool": {
                        "must": [
                          {
                            "match_all": {}
                          }
                        ]
                      }
                    }
                  ]
                }
              }
            }
          }
        }
      }
    }
  },
  "size": 0
}'

Screen Shot 2013-12-05 at 4.00.58 PM

  • “Most Frequent IPs”: As the title implies, this is the Top X (25, in this case) most frequently seen IP addresses, based on the sourceip field. As is the case with many of the fields in this Dashboard, this field is only present in Tomcat Access Logs.

Most Frequent IPs - redacted

  • “Top Programs”: This Panel uses Logstash’s built-in capability to discern which application wrote a given log statement (e.g.: cron, httpd, sshd, monit, etc)

Screen Shot 2013-11-11 at 8.02.01 AM

  • “HTTP Response Codes”: This is a histogram charting the (hopefully low) total number of HTTP response codes that are > 400, with the idea being to see how many “errors” are returned to the users.

Screen Shot 2013-11-11 at 8.02.09 AM

  • “Trends”: Using the Trends Panel, this compares the number of >300 Response Codes seen in this time frame to the same time frame in a previous period. In my case, it is 7 days ago, as my default view shows the previous hour. I chose 7 days to avoid situations where I would compare a weekday to a day on the weekend, as usage is mainly during the work week in my case.

Note that the Trends Panel is only as valuable as the data you feed it. If you have a small number of hits for a given term, you can see massive jumps in percentage compared to your trend target.

As you can see from the example below, I need to re-evaluate the usefulness of this panel. But, I did say to iterate, right? :)

Screen Shot 2013-12-05 at 4.03.22 PM

  • “Browsers”: This is a pie chart breaking down the ratio of various client browsers, based on the User Agent string.
  • “Operating Systems”: Another pie chart to show the breakdown of client OS’, again based on the User Agent string.

Screen Shot 2013-10-30 at 3.20.39 PM

  • “HTTP Errors”: A pie chart showing a breakdown of the same data shown in the HTTP Response Codes Histogram.

Screen Shot 2013-12-05 at 4.04.24 PM

  • “Server Activity”: This Terms Panel shows the relative amount of logging from each of the node types in the distributed application I am monitoring (e.g.: Tomcat, memcached, MongoDB, etc)

Screen Shot 2013-12-05 at 4.05.04 PM

  • “Client Locations”: A Bettermap Panel that uses the Geolocation coords field to show where client requests are physically coming from

Screen Shot 2013-11-11 at 8.03.55 AMScreen Shot 2013-12-05 at 4.05.44 PM

  • “Logs”: A Table Panel that shows the @timestamp, message, and sourceip fields. Clicking any row in this table expands to show all of the fields related to this log statement. This is part of the canned Logstash dashboard that ships with Kibana.

Screen Shot 2013-11-11 at 8.04.08 AM

Enough, already. Just show me the whole Dashboard!

OK, OK. Geez.

The final result is below. It is admittedly a bit busy, but I’ve built it to be as much of a tech demo as it is also a useful tool.

Kibana 3

I hope this post helps you create a useful dashboard of your own!

 

 

Update: Adding Kibana Template

I’ve had a few folks ask for the Kibana template used for the dashboards above. So, here it is! Hope you find it useful.

 

{
 "title": "example-dashboard",
 "services": {
 "query": {
 "idQueue": [
 1,
 4,
 7,
 8,
 9,
 10
 ],
 "list": {
 "0": {
 "alias": "200s",
 "query": "responsecode:[200 TO 299]",
 "id": 0,
 "color": "#7EB26D",
 "pin": false,
 "type": "lucene"
 },
 "2": {
 "id": 2,
 "color": "#6ED0E0",
 "query": "*",
 "alias": "All Requests",
 "pin": false,
 "type": "lucene"
 },
 "3": {
 "alias": "300s",
 "query": "responsecode:[300 TO 399]",
 "id": 3,
 "color": "#EF843C",
 "pin": false,
 "type": "lucene"
 },
 "5": {
 "alias": "400s",
 "query": "responsecode:[400 TO 499]",
 "id": 5,
 "color": "#1F78C1",
 "pin": false,
 "type": "lucene"
 },
 "6": {
 "id": 6,
 "color": "#BA43A9",
 "query": "responsecode:[500 TO 599]",
 "alias": "500s",
 "pin": false,
 "type": "lucene"
 }
 },
 "ids": [
 0,
 3,
 5,
 6,
 2
 ]
 },
 "filter": {
 "idQueue": [
 1,
 2,
 3
 ],
 "list": {
 "0": {
 "type": "time",
 "field": "@timestamp",
 "from": "now-1h",
 "to": "now",
 "mandate": "must",
 "active": true,
 "alias": "",
 "id": 0
 }
 },
 "ids": [
 0
 ]
 }
 },
 "rows": [
 {
 "title": "Graph",
 "height": "350px",
 "editable": true,
 "collapse": false,
 "collapsable": true,
 "panels": [
 {
 "span": 12,
 "editable": true,
 "group": [
 "default"
 ],
 "type": "histogram",
 "mode": "count",
 "time_field": "@timestamp",
 "value_field": null,
 "auto_int": true,
 "resolution": 100,
 "interval": "30s",
 "fill": 3,
 "linewidth": 3,
 "timezone": "browser",
 "spyable": true,
 "zoomlinks": true,
 "bars": false,
 "stack": false,
 "points": false,
 "lines": true,
 "legend": true,
 "x-axis": true,
 "y-axis": true,
 "percentage": true,
 "interactive": true,
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "title": "Events over time",
 "intervals": [
 "auto",
 "1s",
 "1m",
 "5m",
 "10m",
 "30m",
 "1h",
 "3h",
 "12h",
 "1d",
 "1w",
 "1M",
 "1y"
 ],
 "options": true,
 "tooltip": {
 "value_type": "cumulative",
 "query_as_alias": false
 }
 }
 ],
 "notice": false
 },
 {
 "title": "Average Response Size",
 "height": "150px",
 "editable": true,
 "collapse": false,
 "collapsable": true,
 "panels": [
 {
 "span": 12,
 "editable": true,
 "type": "histogram",
 "loadingEditor": false,
 "mode": "mean",
 "time_field": "@timestamp",
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "value_field": "resp_time",
 "auto_int": true,
 "resolution": 100,
 "interval": "30s",
 "intervals": [
 "auto",
 "1s",
 "1m",
 "5m",
 "10m",
 "30m",
 "1h",
 "3h",
 "12h",
 "1d",
 "1w",
 "1M",
 "1y"
 ],
 "fill": 0,
 "linewidth": 3,
 "timezone": "browser",
 "spyable": true,
 "zoomlinks": true,
 "bars": false,
 "stack": false,
 "points": false,
 "lines": true,
 "legend": true,
 "x-axis": true,
 "y-axis": true,
 "percentage": false,
 "interactive": true,
 "options": true,
 "tooltip": {
 "value_type": "cumulative",
 "query_as_alias": false
 },
 "title": "Mean Response Time"
 }
 ],
 "notice": false
 },
 {
 "title": "Errors",
 "height": "150px",
 "editable": true,
 "collapse": false,
 "collapsable": true,
 "panels": [
 {
 "error": false,
 "span": 8,
 "editable": true,
 "type": "terms",
 "loadingEditor": false,
 "queries": {
 "mode": "all",
 "ids": [
 0,
 3,
 5,
 2,
 6
 ]
 },
 "field": "sourceip",
 "exclude": [
 "127.0.0.1",
 "10.42.7.5",
 "10.42.7.6"
 ],
 "missing": false,
 "other": false,
 "size": 25,
 "order": "count",
 "style": {
 "font-size": "10pt"
 },
 "donut": false,
 "tilt": false,
 "labels": true,
 "arrangement": "horizontal",
 "chart": "bar",
 "counter_pos": "above",
 "spyable": true,
 "title": "Most Frequent IPs"
 },
 {
 "error": false,
 "span": 4,
 "editable": true,
 "type": "pie",
 "loadingEditor": false,
 "query": {
 "field": "program",
 "goal": 100
 },
 "queries": {
 "mode": "all",
 "ids": [
 0,
 3,
 5,
 2,
 6
 ]
 },
 "size": 10,
 "exclude": [
 "su"
 ],
 "donut": false,
 "tilt": false,
 "legend": "above",
 "labels": true,
 "mode": "terms",
 "default_field": "DEFAULT",
 "spyable": true,
 "title": "Top Programs"
 }
 ],
 "notice": false
 },
 {
 "title": "Response Codes",
 "height": "150px",
 "editable": true,
 "collapse": false,
 "collapsable": true,
 "panels": [
 {
 "span": 8,
 "editable": true,
 "type": "histogram",
 "loadingEditor": false,
 "mode": "count",
 "time_field": "@timestamp",
 "queries": {
 "mode": "selected",
 "ids": [
 3,
 5
 ]
 },
 "value_field": "responsecode",
 "auto_int": true,
 "resolution": 100,
 "interval": "30s",
 "intervals": [
 "auto",
 "1s",
 "1m",
 "5m",
 "10m",
 "30m",
 "1h",
 "3h",
 "12h",
 "1d",
 "1w",
 "1M",
 "1y"
 ],
 "fill": 0,
 "linewidth": 3,
 "timezone": "browser",
 "spyable": true,
 "zoomlinks": true,
 "bars": false,
 "stack": true,
 "points": false,
 "lines": true,
 "legend": true,
 "x-axis": true,
 "y-axis": true,
 "percentage": false,
 "interactive": true,
 "options": false,
 "tooltip": {
 "value_type": "cumulative",
 "query_as_alias": false
 },
 "title": "HTTP Response Codes"
 },
 {
 "span": 2,
 "editable": true,
 "type": "trends",
 "loadingEditor": false,
 "queries": {
 "mode": "selected",
 "ids": [
 3,
 5,
 2,
 6
 ]
 },
 "style": {
 "font-size": "16pt"
 },
 "ago": "1d",
 "arrangement": "vertical",
 "spyable": true,
 "title": "Trends"
 },
 {
 "error": false,
 "span": 2,
 "editable": true,
 "type": "derivequeries",
 "loadingEditor": false,
 "loading": false,
 "label": "Search",
 "query": "*",
 "ids": [
 0,
 1,
 3,
 4,
 5
 ],
 "field": "responsecode",
 "fields": [
 "@timestamp",
 "@version",
 "classname",
 "device",
 "facility",
 "facility_label",
 "haproxy_hour",
 "haproxy_milliseconds",
 "haproxy_minute",
 "haproxy_month",
 "haproxy_monthday",
 "haproxy_second",
 "haproxy_time",
 "haproxy_year",
 "host",
 "level",
 "logsource",
 "major",
 "message",
 "minor",
 "name",
 "os",
 "os_major",
 "os_minor",
 "os_name",
 "patch",
 "payload",
 "pid",
 "priority",
 "program",
 "referrer",
 "request",
 "resp_time",
 "responsecode",
 "responsesize",
 "sessionid",
 "severity",
 "severity_label",
 "sourceip",
 "tags",
 "threadid",
 "timestamp",
 "type",
 "ua.device",
 "ua.major",
 "ua.minor",
 "ua.name",
 "ua.os",
 "ua.os_major",
 "ua.os_minor",
 "ua.os_name",
 "ua.patch",
 "user_agent",
 "userid"
 ],
 "spyable": true,
 "rest": false,
 "size": 5,
 "mode": "terms only",
 "exclude": [],
 "history": [
 "*",
 "206"
 ],
 "remember": 10,
 "title": "HTTP Response Codes"
 }
 ],
 "notice": false
 },
 {
 "title": "OS and Browser Info",
 "height": "150px",
 "editable": true,
 "collapse": false,
 "collapsable": true,
 "panels": [
 {
 "error": false,
 "span": 3,
 "editable": true,
 "type": "terms",
 "loadingEditor": false,
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "field": "ua.name",
 "exclude": [
 "other"
 ],
 "missing": false,
 "other": false,
 "size": 10,
 "order": "count",
 "style": {
 "font-size": "10pt"
 },
 "donut": false,
 "tilt": false,
 "labels": true,
 "arrangement": "horizontal",
 "chart": "pie",
 "counter_pos": "above",
 "spyable": true,
 "title": "Browsers"
 },
 {
 "error": false,
 "span": 3,
 "editable": true,
 "type": "terms",
 "loadingEditor": false,
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "field": "ua.os_name",
 "exclude": [
 "other",
 "x",
 "os",
 "7",
 "8",
 "10",
 "xp",
 "vista"
 ],
 "missing": false,
 "other": false,
 "size": 10,
 "order": "count",
 "style": {
 "font-size": "10pt"
 },
 "donut": false,
 "tilt": false,
 "labels": true,
 "arrangement": "horizontal",
 "chart": "pie",
 "counter_pos": "above",
 "spyable": true,
 "title": "Operating Systems"
 },
 {
 "error": false,
 "span": 3,
 "editable": true,
 "type": "pie",
 "loadingEditor": false,
 "query": {
 "field": "responsecode",
 "goal": 100
 },
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "size": 10,
 "exclude": [
 "200",
 "304",
 "302"
 ],
 "donut": false,
 "tilt": false,
 "legend": "above",
 "labels": true,
 "mode": "terms",
 "default_field": "DEFAULT",
 "spyable": true,
 "title": "HTTP Statuses"
 },
 {
 "error": false,
 "span": 3,
 "editable": true,
 "type": "terms",
 "loadingEditor": false,
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "field": "logsource",
 "exclude": [
 "example.com",
 "1",
 "2",
 "3",
 "server"
 ],
 "missing": false,
 "other": false,
 "size": 10,
 "order": "count",
 "style": {
 "font-size": "10pt"
 },
 "donut": false,
 "tilt": false,
 "labels": true,
 "arrangement": "horizontal",
 "chart": "pie",
 "counter_pos": "above",
 "spyable": true,
 "title": "Server Activity"
 }
 ],
 "notice": false
 },
 {
 "title": "Logs",
 "height": "150px",
 "editable": true,
 "collapse": false,
 "collapsable": true,
 "panels": [
 {
 "error": false,
 "span": 12,
 "editable": true,
 "type": "table",
 "loadingEditor": false,
 "status": "Stable",
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "size": 100,
 "pages": 5,
 "offset": 0,
 "sort": [
 "@timestamp",
 "desc"
 ],
 "group": "default",
 "style": {
 "font-size": "9pt"
 },
 "overflow": "min-height",
 "fields": [
 "@timestamp",
 "message",
 "sourceip"
 ],
 "highlight": [],
 "sortable": true,
 "header": true,
 "paging": true,
 "field_list": true,
 "all_fields": true,
 "trimFactor": 300,
 "normTimes": true,
 "spyable": true,
 "title": "Logs"
 }
 ],
 "notice": false
 }
 ],
 "editable": true,
 "failover": false,
 "index": {
 "interval": "day",
 "pattern": "[logstash-]YYYY.MM.DD",
 "default": "NO_TIME_FILTER_OR_INDEX_PATTERN_NOT_MATCHED"
 },
 "style": "light",
 "panel_hints": true,
 "loader": {
 "save_gist": false,
 "save_elasticsearch": true,
 "save_local": true,
 "save_default": true,
 "save_temp": true,
 "save_temp_ttl_enable": true,
 "save_temp_ttl": "30d",
 "load_gist": true,
 "load_elasticsearch": true,
 "load_elasticsearch_size": 20,
 "load_local": true,
 "hide": false
 },
 "pulldowns": [
 {
 "type": "query",
 "collapse": false,
 "notice": false,
 "enable": true,
 "query": "*",
 "pinned": true,
 "history": [
 "responsecode:[500 TO 599]",
 "responsecode:[400 TO 499]",
 "responsecode:[300 TO 399]",
 "*",
 "responsecode:[200 TO 299]",
 "responsecode:>500",
 "responsecode:\"403\"",
 "responsecode:\"404\"",
 "responsecode:\"201\"",
 "responsecode:\"206\""
 ],
 "remember": 10
 },
 {
 "type": "filtering",
 "collapse": false,
 "notice": true,
 "enable": true
 }
 ],
 "nav": [
 {
 "type": "timepicker",
 "collapse": false,
 "notice": false,
 "enable": true,
 "status": "Stable",
 "time_options": [
 "5m",
 "15m",
 "1h",
 "6h",
 "12h",
 "24h",
 "2d",
 "7d",
 "30d"
 ],
 "refresh_intervals": [
 "5s",
 "10s",
 "30s",
 "1m",
 "5m",
 "15m",
 "30m",
 "1h",
 "2h",
 "1d"
 ],
 "timefield": "@timestamp",
 "now": true,
 "filter_id": 0
 }
 ],
 "refresh": "5m"
}

Defuse your landmines… before it is too late

Anyone involved with development, support – heck even usage – of software knows that there are always little “gotchas” that you learn to avoid. We should eject thumb drives before removing them. Don’t delete your System or Windows directory. Or the dreaded “Saving… Don’t power off your console” message.

Some of these make sense or are pretty obvious. Also, there’s only so much you can do to prevent a user from shooting themselves in the foot. It is not feasible – at least from a cost/benefit standpoint – to design a game console that is capable of gracefully recovering your save if kill the power mid-flight, just to cite one example.

But, there are lots of other scenarios where there is much to be lost if things go bad. As totally pissed as I would be if my Forza save went “poof,” I believe I’d still live. If I lose all of my family pictures from the past decade because of a bug or a rare caveat, that’s a totally different story.

“Hello. My name is Chris and I’m a landmine denier.”

Admit it: You know your software has these types of problems. There are areas where we can reasonably expect users to tread, but we do not put up the proper fencing. Things where it’s not a matter of if it will happen – it’s a matter of when. In a way, I shouldn’t complain. It’s crap like this that keeps me employed.

That said, I view my job is to make myself unemployed. My software should be so simple, so robust, so intuitive that you don’t need a geek like me to deploy, maintain, or troubleshoot it… or at least fewer geeks like me :) [ I’ll let you in on a little secret. We geeks would rather be solving diffrent, more interesting problems, anyway.]

Landmines: The Home Version

I’ll offer up a non-tech example of this. We’ve lived in our current house for over 2 years. Thanks to our <sarcasm>lovely HOA</sarcasm>, our trash cans can’t be left by the alley. So, we’ve kept them on one side of the driveway as a “solution.” Simple enough and we meet the letter of the law to keep the HOA police off our back.

Because of this, parking a car on that side of the garage required pretty precise driving – both coming and going. I knew that some day, likely when we were in a hurry, one of us was going to get it wrong. Put the car into the wall of the garage, run over the trash cans, whatever.

But, I knew the caveat. I had “documented” the workaround: Just don’t run into the garage. SImple, right?

Did I put any effort into proper solution? Absolutely! I thought about digging out a space on the other side of the driveway, install some bricks (we have a stack taking up room in the garage, anyway), and park the cans over there. That would leave the driveway 100% clear.

OK… But did I do anything about it? Well, not really. I “didn’t have time.” There were other, more important things going on.

That said, I did finally manage to find time a couple of weekends ago… only after the side view mirror was ripped off my wife’s car. Amazing how things like that suddenly change our priorities (more on this later).

Rear View Mirror Damage 2 (Not actually my mirror, but you get the point…)

Now, not only did I install the bricks – which I should have done months ago, back when it wasn’t ~100 degrees outside – but I had to drop about $200 in parts to replace the mirror.

Bringing it Back to Tech

We all know of – or even create and support – software that has similar landmines in it. Sure, we can train people to step around these landmines. While the audience is small, this is manageable.

What happens when your product is the wonderful success you want it to be? Does a model that requires special insider info or meticulous attention to details in documentation bode for a great experience in the field?

In my trash can situration, I thought I had it all under control. Things went well for a long time. So long, in fact, that I convinced myself that implementing a proper fix wasn’t a priority. “Nothing has blown up so far. I probably don’t need to dig up the yard, after all.”

But, as I was picking up pieces of broken mirror from the driveway, I kicked myself for not acting sooner. Moreover, I had to rearrange my weekend to complete the work in a couple of days, whereas I could have taken more time (and likely done a better job) had I started earlier and paced myself.

Think back in your career and I am sure you can find examples of this in your product’s history. A major customer hits a critical issue, despite it being “well known” or even documented, which causes an all-hands-on-deck situation. New feature development grinds to a halt… and we were already behind on new features, because we’re human and we all suck at estimation.

“A stitch in time…,” “An ounce of prevention…”, etc. Yes, they’re all cliché, but that doesn’t make them any less true.

The point is: You can bury your skeletons before they are found or while a customer is beating you over the head with a femur. You make the call!

On the Importance of Requirements and Stakeholders

Not only am I curmudgeon when it comes to Supportability; I also nitpick about things in meatspace :)

Take this for example:

Image

This is the ice maker in one of our break rooms. Let’s take a look at how we operate this thing:

  1. The user flips a switch (or two) to select ice, water, or both
  2. The user presses the button, which dispenses water and/or ice, depending on the selections in Step 1

Seems pretty straightforward, right? What’s wrong with this design, you ask? Here are a couple of things, for starters:

  1. The operation is sub-optimal.
    1. Say I wanted ice, but not water. But, the person who used the machine before me just wanted water.
      1. Side rant: Getting just water from this machine is silly, because we have a dedicated water machine about 3 feet away. Despite this, I find the ice maker in this state quite often.
    2. Now, I have to perform three operations to get what I want:
      1. Switch water to “off.”
      2. Switch ice to “on.”
      3. Press button (this one is common to all operations; no getting away from that).
  2. It does not allow for differing ratios of ice to water. Once I push that button, ice and/or water dumps out at whatever rate the machine wants.
  3. If the user is not paying attention, they can get something other than what they wanted.
    1. Say they pour a soda into a cup, then they just press that button. How good will their user experience be when both ice and water come out? Great if you like your soda watered down, but neigh impossible to fix once you push that button.

Bringing this back to coding and Agile, this is a prime example of issues I see all the time in software design:

  1. Improper requirements gathering
  2. Engineering implementing what was spec’d, despite the fact that a better solution might exist as a slightly higher cost
  3. Lack of stakeholder input

Let’s break each of these down.

Improper Requirements Gathering

I can envision the genesis of this feature. It’s an ice machine, so it likely had the single “Push” button for its sole end-user function: gimme some ice!

Then, someone came up with an enhancement:

User: “Could I get both ice and water… at the same time, from the same machine?”

Product Owner: “Hey, that’s a nice idea. I will run this by engineering.”

Here’s where the problem starts. The Product Owner simply took the requirement as the customer stated it. They let the customer design the solution to their own problem… which, with all due respect to our customers, may not be the best plan of attack.

Any feature exists to address a problem/need/desire. A good Product Owner would not gather requirements from the customer by simply parroting the customer’s proposed solution. Rather, they would ask “the question behind the question.”

Asking “the question behind the question” gets into actual requirements gathering. Simply taking the feature request at face value makes you an “order taker.” Also, if one customer has a need, it’s almost guaranteed that others have the same. By backing up to find out what itch needs to be scratched, a good Product Owner works hand-in-hand with the customer and Engineering to find out:

  1. What is the problem/need/desire to be addressed?
  2. Given technological, resource, and time limitations, what is the optimal solution to the problem/need/desire that can be used by the most customers?
    1. See also, Jason Fried’s book, Getting Real, and his comments about building your product around a small number of customers

Here is an example I encountered that illustrates this point (as a refresher, I am in the Enterprise Social Software biz.):

We had a customer ask for what I lovingly termed “the firehose.” They wanted a running activity stream of all activity across the entire system. Every time anyone created a post, discussion forum thread, updated their status, etc, they wanted to see it.

Sounds ridiculous, right? How would that be at all useful? It would be like watching The Matrix scroll by – total information overload.

Instead of taking that request to Product Management as-is, I asked, “What issue are you trying to solve?”

The answer: “We had this feature in a previous piece of software and it helped us find other people with similar interests, old colleagues, etc. Sure, it was unbearable during the day, but if I was working early or late, I sometimes saw content or people scroll by that were useful to me.”

Basically, they wanted to use serendipity as a sort of RDF mechanism. Also, I could argue that staring at a ticker tape of activities, hoping to find something useful, is not likely going to register with their management as a great use of time. But, I digress…

My reply: “Sure, that would work. But, how about we use data from within the system to help link you to people and Communities with common interests, geographies, job roles, etc? We can then display suggestions for people, Communities, Posts, etc to engage with.”

Turns out we we already had a Suggestions feature in the works that used standard RDF triples to get at this info, which is partly why I made the suggestion. It was already near completion, it was something nearly all of our customers wanted, and – IMHO – it scratched this particular customer’s itch… even if they didn’t know it just yet.

Had we given them “the firehose,” they would have exactly what they asked for… but not what they really wanted. In this case, they were keynoting on previous experiences as the solution to their problem. A better approach would be to vocalize the pain point and work with us to devise a solution that best met their true need – given the resources available to all involved.

Ultimately, “the firehose” would have likely been unusable for most people and engineering resources could have been better spent elsewhere.

“It was implemented as spec’d, man. Don’t blame me!”

Going back to the ice maker, recall the design:

Image

Running with our assumption that the “Push” button was the previous design, I can take a guess at how we arrived at the design above. It probably went something like this.

In the initial design, pushing the button completed the circuit to drive the motor to dispense ice. Keeping with that same design, when they added a valve to dispense water, they took the same design paradigm.

Engineer1: But, wait… there’s only one button. How about we put a switch between the button and the motor to turn the ice dispensing on and off. We need to hook the valve for the water up to the same button anyway, so let’s just do the same for the water.

Engineer2: Awesome! We’re totally going to get a bonus for coming up with a working solution so quickly.

Yes, this would indeed address the product requirement given to them:

  • Give the user the ability to dispense ice, water, or both

Certainly meets the letter of the law. I can’t argue against that.

But, did they actually take the time to use their creation? Did they take a step back and challenge the design? Was there a better way to solve the problem – even if it meant a bit more engineering time?

Quite often, I see Developers – most of which I know personally to be very smart and very UX-aware – simply implement a design as-is, even if they know that design is sub-optimal.

Why would they do such a thing? There simply isn’t time baked into the schedule to get the requested work done, much less time to iterate on the design. So, they implement what’s given, which sometimes ends up getting redone in a later release.

This one I would put in the lap of the User Experience team, with Management coming in a close second. Just as you can’t take a requirement from an end-user at face value, neither can you take requirements given from a Product Manager to be infallible. The whole idea of Agile is to iterate quickly so you can get feedback – from all parties – early and often.

Rather than just have “one” design, you should have a basis for your design – and iterate from there. You need to have slack in your schedule to accomodate this iteration.

Instead, we are often both Feature- and Time-bound, thus there is no wiggle room to do anything outside of our immediate marching orders. Ultimately, quality and UX suffers.

Lack of Stakeholder Input

This is typically more of an issue with Commercial software, as your Stakeholders are every potential client and end-user. It’s unreasonable to attempt to get such a large audience in a room – virtual or otherwise – to gain consensus and gather feedback. As a result, internal audiences – like the Product Owner – are the ones who end up accepting a User Story as complete.

Unless you plan to be the sole consumer of your product (e.g.: internal applications), this is a horrible idea.

In my experience, the people who create a product are terrible at predicting all of the incorrect ways an end-user can attempt to use the product. Nor are they equipped with the varying array of business needs that exist in the field.

Because of this, the feedback loop ends up being support cases and defects – neither of which are a good way to “wow” a customer. Not only have you damaged your brand, it is also much more costly to fix a design once it is out in the field. While most teams “can’t afford” to do this iteration in their Sprints, they end up paying more for it on the backside as the defects and angry customers roll in. Pay me now or pay me later.

tumblr_m7o4foAX1x1qjnw5fo1_500

To understand why this happens, go back to our first item: Requirements Gathering.

We can have every intention of doing a kick ass job of implementing a feature, but no matter how hard we try, we can not anticipate the corner cases and preferences of the end-users. The only way to do this is to get the feedback from the customers themselves.

Short of calling Miss Cleo, there is no substitute.

Going back to our ice machine, I can suspect the new design met all of the issues I cited at the top of this post… and more. Did these things occur to be because I’m some wonderful UX genius who can see things others can’t? As much as I’d love to say “yes,” the answer is “of course not!”

Had this design been tested – even internally, in this case – for a couple of weeks, they would have likely received the exact same feedback. Most importantly, it needs to be tested outside of the team that created it!

Our end-users are not aware of what happens behind the green curtain. They know how they would like it to work – regardless of the underpinnings and engineering involved. As such, they will not let technology lead them to a solution or rule out potential solutions, as would often be the case for those closer to the metal.

OK, time to bring this home…

Like any good story, this one has a happy ending. If you look for the current model of this machine, here is what the design looks like now:

414tdkQ9YuL._SL500_SS500_

Apologies for the crappy pic, but you should be able to make out that the new design is quite simple: Two buttons: One for ice, one for water. That’s it! No switches, no carry over of the previous user’s preferences.

This design is not only more aesthetically pleasing and much easier to grok, it addresses the issues mentioned above:

  • Want just ice or just water – without having to jack around with switches? Check
  • Want the ability to determine what ice:water ratio you want, without any fiddling with switches, etc? Check
  • Are instances of getting something other than you intended reduced? Absolutely (the only way it can happen is if the user pushes the wrong button)

I’ll be the first to admit that this is a pretty asinine way to get my point across, but this is a real (first world) problem that ultimately required a redesign to address. I see this happening time and again in software as the release, customer fury, re-design, re-release cycle persists.

Not to jump on the Microsoft bashing bandwagon, but look a quick look at the Amazon reviews for Windows 8 tells you that they got it all wrong. They took the option of doing what was simpler – hastily bolting a touch-centric UI on top of the standard Desktop – rather than “doing it right.”

Am I stupid enough to say I have the perfect answer to what they should have done differently? Hell no. But as a start, if you read the reviews, there is near universal agreement that the odd jumping between Metro and the legacy Desktop is jarring.

You don’t need alot of field research to find that one out. Yet, they decided to ship and are now suffering the backlash, ultimately delivering a way to bypass Metro entirely.

The points I am trying to drive home are:

  • Be mindful when gathering requirements
    • Ask the question behind the question and look for the need left unspoken
  • When designing and implementing, don’t tack on something to what exists if it will ultimately make things worse
    • Do it right the first time, else expect to see it come back – with high velocity – when your customer base hates it
  • Keep your Stakeholders in the loop
    • You aren’t selling ice makers to yourself, so make sure to deliver strong the first time to the people that matter – your customers!
    • You don’t have a second chance to surprise and delight
      • Ignore them during design and implementation at your peril!

“Help me, Supportability. You are my only hope.”

Supportability? Wait, don’t you mean “Serviceability?”

No, I don’t. In my mind, Serviceability is a subset of Supportability.

In order for normal humans to support your product, it definitely has to be “serviceable.” To me, Serviceability encompasses things like:

  • Managing the product
  • Configuring it
  • Deploying it successfully without intimate knowledge of its guts
  • Viewing system health/status

If have to call you every time I want to change a feature’s behavior because it settings are strewn across 10 different properties files, that’s not a supportable product because it is not serviceable.

Jiffy Lube != Ferrari’s F1 Pit Crew

To make a [crappy] analogy, Jiffy Lube can definitely service my car by changing the oil. But I would not take my car to them if I had issues with the engine stalling on cold mornings. That’s a support issue that is best handled by mechanics who intimately understand how an engine works and have the tools and knowledge to dig in at a low level to remedy the issue.

When people lump fixing Day2 problems/outages [AKA: “the unexpected”] with normal design/deploy/operate issues [AKA: “the expected”], they are missing a harsh reality of our industry: we have way more Jiffy Lube-types than Formula 1 mechanics.

[ Note: That was not intended to be a slam of the Jiffy Lube folks. My mechanic can’t change my oil in 5 minutes.

I’m not nearly as good a basketball player as Dwayne Wade, but I suspect I’m a touch better than he is at decyphering Tomcat logs.

We all have to specialize. But, I digress… ]

Plenty of people can configure and maintain systems. However, an overwhelming majority of people are simply not the tech equivalent of a mechanic.

They can change the oil, bolt on new wheels, maybe even install a new stereo. But, they are unable to dig into each of the “black boxes” of your solution [the engine or transmission, to continue the horrible analogy] to find out why things aren’t working. When issues arise, they poke the boxes from various angles hoping to get a desired outcome, like some kind of twisted digital pachinko machine.

DevOps as a stop-gap

Luckily, we DevOps types have emerged to help bridge the gap between coding large-scale products/platforms and keeping it afloat. But, that’s not a model that scales (though it is short- to mid-term job security ;) ).

As the person in my BU who thinks the most about holistic Supportability, my job is to make the platform so easy to support that I don’t have a job anymore. Unfortunately, “easy to support” is a (falsely) assumed feature that should just be there.

OK, the number of DevOps-savvy folks are limited. I get that. Why can’t “normal” humans do the same? You guys aren’t that damned smart.

Because you haven’t built the tools to help people support the product.

True ability to support a given platform requires not just intimate knowledge of the target platform, but proper visibility into its operation. If I call my mechanic and have him listen to the noise my engine makes, he can take a guess at the problem. But, if I get the car in their hands, they can truly see what’s going on. They can took up the car to their computers and see valve timing, air/fuel ratios, any sensor alarms, etc.

Oh, is that it? We’re all set, dude. We have very verbose logging.

Suuuuuuuure you do. As Merlin Mann puts it, everyone also thinks they’re great at french kissing. Maybe you should call some of your ex’s just to make sure your perception matches reality.

I’d be willing to guess a fair amount of your “verbose” logging comes from code like this:

try {
someMethod(someArgument);
} catch (Exception e) {
log.error("Error occured", e);
}

… which will get you something as wonderful as this:


May 26 19:10:01 appserver-1.company.com myapp[]: DEBUG [Post] - [Post] - [] - []:Error occurred
[ ... followed by multiple KB of stacktrace ... ]

If by “verbose” you mean “voluminous,” I’ll give you that.

Pop quiz, hot shot: You’ll know what to do with the resulting stacktrace when you’re bleary-eyed and troubleshooting an issue at 3AM, right? Right? Your entire Tech Support staff are Java “Ninjas,” so it’ll never get to you anyway, right?

The awful truth is that not only is the logging pattern above exceedingly common, it ensures that almost no one outside of your Developers can tell you what in the hell is going on.

As some of us here in Texas would say, “That dog won’t hunt.”

We will put in Supportability features in the next release…

I know. You’d love to add this kind of stuff, but you have code to write, deadlines to meet. I get it. You’re totally swamped. No worries. We’ll get around to making the product supportable by people other than the Developers in the next release. I promise.

If you have the fortune of making a product that is successful, you will wish you would had invested in making that product easier to support. Once you ship – and customers ask for more, better, and faster – you won’t have time to right the SS Supportability. She will be horribly off course and headed for a watery grave.

Either do it properly from the start or be willing to suffer the lovely consequences, which include:

  • Pulling your rockstar Devs away from new features to work on issues that can’t be supported by anyone else
  • Causing schedule slips
  • Pissing off your customers
  • Ruining your “brand”

Few companies get Supportability right because it is not a feature; nor does it fit into a User Story. Supportability has to be embedded in the culture of your team. It is not something ScrumA can do and ScrumB can put off until the next release (which is French for “It’ll never happen”). It has to permeate and affect everything you do. 

Supportability: It’s not just for logging anymore!

While I’ve harped on logging here, Supportability applies to anything that make things run smoother. The reason I’m beating up on logging is that I have supported multiple products and I have yet to see a product that’s really gotten this right. Thus, I felt it deserved some air time.

Note, when I say products, I am talking not just about ones I directly support – though I (perhaps incorrectly) like to think we’ve done a decent job of being better than most. I also talking about the infrastructure components (DBs, etc).

Case in point: Have fun hunting down a suspected corrupted key/value pair your app is storing in memcached – especially if you’re using some driver/abstraction layer to manage said key/value pairs (I’m looking at you, Hibernate). If you don’t have the key handy, there’s no really feasible way to find it – especially if you shard across multiple memcached servers.

Self-healing/auto-correcting code, proactive alerts [ more on this is a later post ], simpler integrations, more rapid deployment [e.g.: Continuous Integration], less downtime. This all comes together to create the overall experience of supporting the platform, not just servicing it.

This keeps it running, making it do more, better, faster – hopefully without the need of help from folks like me.