There’s Gold in Them Thar Logs!

Oh crap. Another post about logging…

Yep, get over it. This time, however, I’m not here to piss and moan about how much logging stinks. Rather, I want to show the power of extracting data from the logging you have.

Logstash + ElasticSearch + Kibana = Logging Bliss!

I’m not necessarily treading any new ground here, but for the initiated, let me quickly go over what is becoming a very popular logging technology stack in the DevOps space:

  • Logstash
  • ElasticSearch
  • Kibana

I will get into some of the deeper details later, but let’s take a quick tour of the components first.

Logstash

Logstash is the brain child of Jordan Sissel. At a high level, the goal of Logstash is to chew up your logs and spit out contextualized data from what would otherwise be blobs of text. Said another way, instead of mountains of plain-text log files, you get the same log content split out and stuffed into fields you define. This allows you to do all sorts of things with the logs, not the least of which is to perform focused searches in Kibana for exactly the data you need.

Going with the lumberjack analogy, think of Logstash as your lumber mill that takes your logs and spits out 2x4s, etc. There’s a reason why you don’t see tree trunks at Home Depot. Tree trunks do me no good. Studs and plywood, now that I can work with.

ElasticSearch

In this stack, ElasticSearch operates as our super-fast “database” to hold what Logstash kicks out (technically, it’s a search index, but let’s not nerd out too much here). This description sells both ElasticSearch and Logstash short (Logstash can output in many different formats, not just ElasticSearch), but I’m trying to stick to the solution at hand.

ElasticSearch is based on Apache Lucene – just like Apache Solr – and lives to gobble up data broken into fields, then query large volumes of that data near instantly. It also gives a rich query syntax, which Kibana takes advantage of.

Kibana

Kibana is where we see the fruits of Logstash’s and ElasticSearch’s efforts. Kibana is a slick web UI that queries ElasticSearch for various purposes:

  1. To search logs (again, allowing you to use Lucene search syntax)
  2. To make pretty charts
  3. To show the most common items in logs
  4. etc

As an example on the searching, think of the last time you had to visit your Apache or Tomcat access logs to play Columbo for some issue that happened. I am sure it was an exercise of grep, sed, awk, etc to find that needle in the haystack. Even then, you end up with alot of stuff that is likely not relevant (heaven forbid you need to look for common words like “post,” “get,” or “user”). Now you can say things like, “Show me all the log statements where”:

  1. ‘user’ shows up in the referring URL
  2. The user got a 500 and the server took more than 500ms to fulfill the request
  3. My user ID exists, my boss’ user ID exists, and the word “raise” was in the request URL ;)

And here’s a cool benefit: Normal human beings can now have access to logs with a “type what you’re looking for” interface, so, I dunno, maybe they can leave you alone to do more interesting things?

What is interesting about Kibana is that there is not a server-side component. Kibana is all client-side code. All you need is a web server to pass along Kibana’s code to the browser, then the browser issues queries to ElasticSearch directly via ElasticSearch’s APIs.

While Kibana has a built-in dashboard for Logstash that allows you to search your logs, where things really get interesting is applying Kibana’s Panels to display data from your logs. Think of Panels as the Lego from which you construct your dashboard. Behind each of these is an ElasticSearch query that populates that Panel.

While I will cover all of the Panels in my dashboard, here is a teaser:

Screen Shot 2013-11-11 at 8.03.44 AM

My First Dashboard ™

Typically the best way to learn something is through an example. The primary reason I am writing this post is to help others take advantage of this tech stack, which I hope to do by walking though what I did – and shouldn’t have done – to get my Kibana dashboard in place.

Recall that the pipeline is like so:

Logging source (syslog/file/flume/etc) => Logstash => ElasticSearch

“Wait, where’s Kibana?” you might be asking. Remember that Kibana is just the UI to visualize data that Logstash, um, stashed in ElasticSearch. I’ll definitely cover Kibana in detail, but we need to start with Logstash, as that is where we add meaning to the text of our logging. Without that context, Kibana will be of limited use to us. Garbage in, garbage out – right?

Magic ain’t Free

Again, I am selling its full capabilities way short, but Logstash’s job here is to take a single line of text and – based on filters, either pre-defined or custom – assign chunks of that text to fields we define. We will use these fields later on not only to more intelligently query our logs, but to target specific data to be shown in Panels in Kibana.

  1. Be the logs
  2. Know the end goal
  3. Grok grok (no, that’s not a typo)
  4. Iterate, because you ain’t gonna get it right the first time

This will take work, but to crib from Alton Brown, “Your hard work will be rewarded.”

Be the Logs

To really get the value out of this stack, you need to know your logs, as well as what you want to get out of them.

At this stage, we just want to look at our target logging and see what kind of data might be useful. So, do the obvious: copy a typical log statement into a text editor and look at it in isolation. Logstash doesn’t have the benefit of context. It’s going to chop up those logs one line at a time, hence I find it best to start my analysis by looking at just one log statement.

As mentioned in #4 above, you are likely to find that your logging is not as “normalized” as you think, so you’ll likely have to revisit this step quite few times along your journey.

Let’s take an example Apache access_log statement:

10.129.16.62 - - [11/Nov/2013:07:24:35 -0800] "GET /html/themes/default/images/sprites/core_sprite_left.png?t=74 HTTP/1.1" 304 - "https://my-server.cjchand.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36" 354

There is the following useful/interesting information in that single log statement (in order shown):

  • Client IP address
  • Datestamp
  • Request URL
  • HTTP status code
  • Referring URL
  • User Agent string
  • Response time

Of those, are there any items that are throwaways (e.g.: data that would not be interesting/useful as search or graphing criteria)? In this case, I would argue not, but especially with logging in custom applications, you can end up with some items that might be interesting when viewed with the log statement itself, but might not be something that we would need to filter/chart (e.g.: PID).

That said, at this point, err on the side of capturing everything you can. If you ultimately determine you don’t need that data, you can simply bit-bucket it via your grok pattern and/or Logstash config.

Similarly to knowing what meaningful data lives in your logs, it helps to know what Kibana can visualize for you. Here are the Panel types available (at the least the non-deprecated ones) at the time I am writing this:

  • Bettermap: Uses Longitude and Latitude coordinates to place a “pin” on a map
  • Column: A “wrapper” panel of sorts; used to control the layout of panels in a dashboard
  • Derivequeries: This panel auto-creates facets for a given field (e.g.: might create multiple queries for each HTTP status code seen in the current data set)
  • Histogram: A chart to show the total, count, mean, etc number of log statements in the timerange specified
  • Hits: Total count of lines matching a given query/set of queries
  • Terms: A “Top X” list of the most common terms found in log statements
    • Can be viewed as:
      • Chart: Pie or Bar
      • Table
  • Text: Panel to hold text or sanitized HTML (including in Markdown syntax)
  • Trends: Based on a user-configurable time frame, compare the data shown to the same data across the specified timeframe
    • Example: I am looking at from 13:00-14:00 and I want to know – as a percentage – how many more/less 404’s I have seen from 13:00-14:00 today vs 13:00-14:00 yesterday)

Know the End Goal

“Wait. Shouldn’t I start with this step?” Sure, you can definitely make that argument, but in many cases you have little control over what logging you get – which is often the case if you are not the one developing the code or controlling the deployment of the app.

As such, my philosophy is to see what I have to work with first. Then, I set my target for the final product.

YMMV on when you this step, but you definitely need to perform this step before you sit down to craft grok patterns.

Continuing with our Apache access log example, here are some potential target items we would want to craft grok patterns around – and why:

  • Search Filters:
    • All of these fields
  • Charting/Trending items:
    • Client IP address:
      • Top XX IP addresses
      • Geolocation of client
    • HTTP status code:
      • Histogram of unexpected statuses
    • User Agent String:
      • Client OS
      • Client browser
    • Response Time:
      • Histogram of mean response times (yes, I know that’s not statistically the best measure, but it’s what Kibana gives us. There are no confidence intervals, etc.)

Do you Grok Me?

Logstash comes with some built-in filters (called “grok patterns“). These cover common items like IP addresses, timestamps, and even common log types such as httpd access logs.

Using the Lego analogy again, most logs can be parsed using these out-of-the-box grok patterns. You can check out the full list in the Logstash GitHub.

As with most things, odds are your logging might not dovetail perfectly with these patterns and you will have to create your own. But, hey, that’s job security, right? ;)

Creating your own Grok patterns requires some heavy lifting with regex, on which grok patterns are based.

No! Don’t go! We’re just getting to the good part.

I know regex is a four-letter word for most, but in this case, we don’t have to write one massive regex for a given log statement. In fact, depending on your logging data, you likely won’t need to do any regex work at all, as Logstash comes with grok patterns for lots of things you commonly find in logs like:

  • Timestamps (including days, months, etc)
  • IP addresses
  • Class names (com.example.my.custom.class)
  • Quoted Strings (like requested and referring URLs in Apache logs)
  • Usernames
  • UUIDs
  • MAC addresses

In my case, I had to do some customizations because:

  • The built-in grok pattern for “program” – which is a sub-pattern referenced by the SYSLOGBASE pattern – expects a string containing the process ID (PID) like so: processname[PID] (e.g.: httpd[12345])
    • In the case of our Tomcat localhost_access logs, the program name is customized via our syslog config. As a result, the PID is missing (e.g.: accesslog[])
    • It took me a while to figure out, but because of this Logstash would not properly parse my localhost_access logs (see the Iterate section later on :) )
  • While not discussed extensively in this post, I also created a custom grok pattern to understand the logs our Tomcat application generates so I can get things like the userID the log statement pertains to, the Java Classname, etc to help perform more targeted searches

Iterate

Like anything you write to parse strings, you’ll likely hit corner cases. Maybe a field is not populated in some instances, or what would normally be a string ends up being a number, or maybe some mobile device passes a funky User Agent string.

The point is, once you get a set of Grok patterns in place, you will want to review what Logstash gives you to ensure that you’re parsing as many logs as you can.

The reason why this is important is that if there is not a matching Grok pattern, the entire line of logging will not be split out. While it will still be searchable, you will not get the full benefit of this logging framework.

Going back to my missing PID as an example, this simple deviation caused every single line of my logging not to be parsed. Until I came across that missing PID as the root cause, I was unable to do all of the cool stuff you’ll see later when I discuss Kibana.

Grok Debugger

Unless you’re a regex whiz, you’re likely familiar with using sites like RegexPal to build a regex against a target string. Luckily, there is a similar tool for creating and testing Grok patterns: Grok Debugger.

The really cool thing about Grok Debugger is that it is Grok-aware and understands the built-in Grok patterns (e.g.: HOSTORIP, URI, QS [Quoted String], etc). There is even a “discover” feature that will attempt to build a Grok pattern based on a sample line of logging you give it. While not 100% perfect, it does help make quick work of creating a custom Grok pattern, should the need arise.

Bringing it all Home

At this point I will assume you have done the following:

  • Deployed Logstash
  • Started passing logs through Logstash
  • Stored Logstash’s output in ElasticSearch
  • Used Kibana to query ElasticSearch, just to make sure things are working right

If not, here are some links you need to get these items working:

  1. Simple Logstash deployment (which includes an embedded ElasticSearch instance)
  2. Kibana deployment (simply unzip/untar and plunk in the Web Server of your choosing)

You can certainly get way more elaborate than this, but I’d recommend getting started with the simple version above. I’ve had such a setup ingest ~3GB of logs a day on a plain vanilla CentOS virtual machine with just 2 vCPU and 4GB of RAM – and it barely broke a sweat.

IOW, crawl before you walk ;)

With that PSA out of the way, now comes the real work:

  1. Creating and verifying your customized Logstash config
  2. Creating and falling in love with your Kibana Dashboard

My Logstash Config

Note that I will not go into all of the ins and outs of Logstash file config as there is plenty of info on that across the Intert00bz, but here are some sources to get you started:

I will try – but likely not succeed – to not rehash too much of the content above as I explain why my finished Logstash config file looks like this:

input { 
   syslog {
      host => "0.0.0.0"
      port => 514
      type => "My Custom App"
   }
}

filter {
  grok {
    patterns_dir => "/opt/logstash/patterns/"
     match => [ "message", "%{TOMCAT_ACCESS}", "message", "%{APP_SERVER}" ]
   }
  mutate {
       convert => [ "responsecode", "integer", "responsesize", "integer", "resp_time", "integer", "geoip.latitude", "float", "geoip.longitude", "float" ]
  }
  if [user_agent] {
    useragent {
      source => "user_agent"
      target => "ua"
    }
  }
  if [sourceip] !~ "(^127\.0\.0\.1)|(^10\.)|(^172\.1[6-9]\.)|(^172\.2[0-9]\.)|(^172\.3[0-1]\.)|(^192\.168\.)" and [sourceip] {
    geoip {
         source => "sourceip"
    }
      mutate {
      # 'coords' will be kept, 'tmplat' is temporary.
      # Both of these new fields are strings.
      add_field => [ "coords", "%{[geoip][longitude]}",
             "tmplat", "%{[geoip][latitude]}" ]
      }
      mutate {
      # Merge 'tmplat' into 'coords'
      merge => [ "coords", "tmplat" ]
      }
      mutate {
      # Convert our new array of strings back to float
      convert => [ "coords", "float" ]
      # Delete our temporary latitude field
      remove => [ "tmplat" ]
      }
  } 
}

output { 
   elasticsearch { embedded => true }
}

… and here is the file containing my custom TOMCAT_ACCESS and APP_SERVER Grok patterns:

SYSLOGBASE %{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} %{SYSLOGPROG}:
SYSLOGPROG %{PROG:program}(?:\[%{DATA:pid}\])?
TOMCAT_ACCESS %{SYSLOGBASE} %{IP:sourceip} - - %{DATA:sessionid} \[%{HAPROXYDATE}\] %{QS:request} %{INT:responsecode} %{DATA:responsesize} %{QS:referrer} %{QS:user_agent} %{INT:resp_time}
APP_SERVER %{SYSLOGBASE} %{DATA:level}  \[%{DATA:classname}\] - \[%{DATA:threadid}\] - \[%{DATA:userid}\] - \[%{DATA:sessionid}\]: %{GREEDYDATA:payload}

While I certainly could have put the patterns directly in the Logstash config file itself, by using a separate patterns file it makes the config easier to read and manage.

The reason I had to create a pattern file was twofold:

  1. I had to override the built-in SYSLOGPROG – and by extension, SYSLOGBASE – patterns because of our goofiness in not spitting out a PID along with the program name (e.g.: httpd:[])
  2. Also, since our formatting of our Tomcat Access Logs deviated from the norm, I had to create a custom filter for that

Um, WTF is the “geoip” and “coords” stuff in there?

I briefly mentioned that we can GeoLocate clients based on their IP address. This magic comes courtesy of Logstash, which ships with a built-in DB that maps IP addresses to physical locations – Longitude and Latitude coordinates, to be exact. These coordinates are based on data from the various Internet Registries (ARIN,RIPE, etc).

This leads to an issue for “internal” applications: What to do about private IP addresses? Obviously, those will not be in the various Internet Registries. Unfortunately, we can’t customize these, either. Logstash gets this Geolocation data from the kind folks at MaxMind. In order to keep this free (as in beer), the Geolocation DB is in MaxMind’s proprietary format. More on handling internal IPs coming up in a bit.

Back to the Geolocation DB: The proprietary, binary format means we can’t (not easily, at least) use a customized version of the DB to specify where private subnets actually physically reside within our organization. Of course, Logstash is Open Source, thus you could implement support for the CSV formatted DBs MaxMind provides, but that’s a topic for another post.

I will start off by saying that a vast majority of the Geolocation config was blatantly stolen from Bryan White’s post you can find here. The short version is that you have to do some conversion of the coordinates to get them into the format that Kibana needs for the Bettermap panel. The really shrot version is just copy and paste what he has kindly provided and move on :)

Now, that just leaves the matter of how to handle internal IP addresses. Logstash recently added the idea of “conditionals” in its config files. This allows you to perform actions against a log statement only if certain conditions are met.

In this case, we want to skip Geolocation processing if either of these situations are present in a given log statement:

  • The sourceip field is missing
  • The value of the sourceip field falls into the range of internal IP addresses (e.g.: 10.x.x.x, 172.16.x.x through 172.31.x.x and 192.168.x.x)

The reason we want to skip Geolocation for these is the coordinates end up being 0,0 – which is somewhere West of Africa in the middle of the Atlantic. Not terribly useful data and it makes us wonder why so many people are visiting our site from the middle of the ocean ;)

As a result, I have the entire block of Bryan’s Geolocation config sitting inside this conditional:
if [sourceip] !~ "(^127\.0\.0\.1)|(^10\.)|(^172\.1[6-9]\.)|(^172\.2[0-9]\.)|(^172\.3[0-1]\.)|(^192\.168\.)" and [sourceip]

What follows the if is a regex, which is likely self-explanatory, but just in case it is translated to:

  • If the field sourceip:
    • Is not in the range of private IP addresses
    • and
    • Is present (which is why you see [sourceip] after the and near the end of the statement)

… then go ahead and do Geolocation for this log statement, which is done via the geoip filter and the output of geoip is reformatted by the mutate filters to get them into the format Kibana expects.

The reason for checking that sourceip is present in the first place is because I am receiving logs from all manner of locations (e.g.: messages logs, etc), many of which will not have an IP address in them. If I did not filter these out, I would get alot of log statements with coordinates of 0,0. I learned that the hard way (see the “Iterate” section above).

The end result of this config is:

  • I am parsing Tomcat Access Logs and my Applications custom logs to get custom fields, some of which I defined (e.g.: userID) and some of which leverage built-in field names within Logstash (e.g.: user_agent)
  • All other logs (e.g.: messages, cron, etc) are simply stored as they are received with no special field creation, as I typically don’t have a need to search and/or chart on these
    • Note: Logstash still creates some fields, depending on the log type
    • For example, it will recognize cron logs and set the program field to “cron”
  • The output is going into the embedded ElasticSearch

Kibana Dashboard

At this point, I have laid the groundwork to get the fields necessary to populate my Kibana Dashboard. Again, this Dashboard is comprised of Panels, each of which will look for the presence of a given field for a document in ElasticSearch. Based on the contents of that field – and the type of panel – Kibana will visualize that data for us.

For example, the Bettermap panel will look for the value of the coords field in each record it receives. Then, based on that value, it will place a “pin” in the location corresponding to that value. A Terms panel can be configured to look at things like sourceip and give me a list of the top X (e.g.: 10) most common source IP addresses found in the logs, along with how many times that IP address appeared in the logs for a given time frame.

Before I get to the shiny, let me detail each of the panels in my Dashboard so you can better understand the resulting charts, etc.

  • “Events Over Time”: This is a Histogram panel that gives me the total number of log statements (stored as “documents” in ElasticSearch). Depending on the time range chosen, Kibana will aggregate these. For example, if you look at an hour’s worth of logs, it will aggregate the total number of documents by chunks of 30 seconds, then graph that aggregated count. Note: This is a raw count of all logging seen for the given time frame, regardless of log source.

Screen Shot 2013-11-11 at 8.03.44 AM

  • “Mean Response Time”: This is the mean value of the response time, as found in only the Tomcat Access Logs. When you configure a Panel to look at the target field, that Panel will formulate its query appropriately. For example, the “Mean Response Time” Panel is only concerned about documents in ElasticSearch where the resp_time field (which is one I custom-defined in my patterns file) is present. The result is that the Panel issues a query like this to ElasticSearch to get the data it will graph:
curl -XGET 'http://10.194.190.245:9200/logstash-2013.12.05/_search?pretty' -d '{
  "facets": {
    "2": {
      "date_histogram": {
        "key_field": "@timestamp",
        "value_field": "resp_time",
        "interval": "30s"
      },
      "facet_filter": {
        "fquery": {
          "query": {
            "filtered": {
              "query": {
                "query_string": {
                  "query": "*"
                }
              },
              "filter": {
                "bool": {
                  "must": [
                    {
                      "match_all": {}
                    },
                    {
                      "range": {
                        "@timestamp": {
                          "from": 1386258436590,
                          "to": "now"
                        }
                      }
                    },
                    {
                      "bool": {
                        "must": [
                          {
                            "match_all": {}
                          }
                        ]
                      }
                    }
                  ]
                }
              }
            }
          }
        }
      }
    }
  },
  "size": 0
}'

Screen Shot 2013-12-05 at 4.00.58 PM

  • “Most Frequent IPs”: As the title implies, this is the Top X (25, in this case) most frequently seen IP addresses, based on the sourceip field. As is the case with many of the fields in this Dashboard, this field is only present in Tomcat Access Logs.

Most Frequent IPs - redacted

  • “Top Programs”: This Panel uses Logstash’s built-in capability to discern which application wrote a given log statement (e.g.: cron, httpd, sshd, monit, etc)

Screen Shot 2013-11-11 at 8.02.01 AM

  • “HTTP Response Codes”: This is a histogram charting the (hopefully low) total number of HTTP response codes that are > 400, with the idea being to see how many “errors” are returned to the users.

Screen Shot 2013-11-11 at 8.02.09 AM

  • “Trends”: Using the Trends Panel, this compares the number of >300 Response Codes seen in this time frame to the same time frame in a previous period. In my case, it is 7 days ago, as my default view shows the previous hour. I chose 7 days to avoid situations where I would compare a weekday to a day on the weekend, as usage is mainly during the work week in my case.

Note that the Trends Panel is only as valuable as the data you feed it. If you have a small number of hits for a given term, you can see massive jumps in percentage compared to your trend target.

As you can see from the example below, I need to re-evaluate the usefulness of this panel. But, I did say to iterate, right? :)

Screen Shot 2013-12-05 at 4.03.22 PM

  • “Browsers”: This is a pie chart breaking down the ratio of various client browsers, based on the User Agent string.
  • “Operating Systems”: Another pie chart to show the breakdown of client OS’, again based on the User Agent string.

Screen Shot 2013-10-30 at 3.20.39 PM

  • “HTTP Errors”: A pie chart showing a breakdown of the same data shown in the HTTP Response Codes Histogram.

Screen Shot 2013-12-05 at 4.04.24 PM

  • “Server Activity”: This Terms Panel shows the relative amount of logging from each of the node types in the distributed application I am monitoring (e.g.: Tomcat, memcached, MongoDB, etc)

Screen Shot 2013-12-05 at 4.05.04 PM

  • “Client Locations”: A Bettermap Panel that uses the Geolocation coords field to show where client requests are physically coming from

Screen Shot 2013-11-11 at 8.03.55 AMScreen Shot 2013-12-05 at 4.05.44 PM

  • “Logs”: A Table Panel that shows the @timestamp, message, and sourceip fields. Clicking any row in this table expands to show all of the fields related to this log statement. This is part of the canned Logstash dashboard that ships with Kibana.

Screen Shot 2013-11-11 at 8.04.08 AM

Enough, already. Just show me the whole Dashboard!

OK, OK. Geez.

The final result is below. It is admittedly a bit busy, but I’ve built it to be as much of a tech demo as it is also a useful tool.

Kibana 3

I hope this post helps you create a useful dashboard of your own!

 

 

Update: Adding Kibana Template

I’ve had a few folks ask for the Kibana template used for the dashboards above. So, here it is! Hope you find it useful.

 

{
 "title": "example-dashboard",
 "services": {
 "query": {
 "idQueue": [
 1,
 4,
 7,
 8,
 9,
 10
 ],
 "list": {
 "0": {
 "alias": "200s",
 "query": "responsecode:[200 TO 299]",
 "id": 0,
 "color": "#7EB26D",
 "pin": false,
 "type": "lucene"
 },
 "2": {
 "id": 2,
 "color": "#6ED0E0",
 "query": "*",
 "alias": "All Requests",
 "pin": false,
 "type": "lucene"
 },
 "3": {
 "alias": "300s",
 "query": "responsecode:[300 TO 399]",
 "id": 3,
 "color": "#EF843C",
 "pin": false,
 "type": "lucene"
 },
 "5": {
 "alias": "400s",
 "query": "responsecode:[400 TO 499]",
 "id": 5,
 "color": "#1F78C1",
 "pin": false,
 "type": "lucene"
 },
 "6": {
 "id": 6,
 "color": "#BA43A9",
 "query": "responsecode:[500 TO 599]",
 "alias": "500s",
 "pin": false,
 "type": "lucene"
 }
 },
 "ids": [
 0,
 3,
 5,
 6,
 2
 ]
 },
 "filter": {
 "idQueue": [
 1,
 2,
 3
 ],
 "list": {
 "0": {
 "type": "time",
 "field": "@timestamp",
 "from": "now-1h",
 "to": "now",
 "mandate": "must",
 "active": true,
 "alias": "",
 "id": 0
 }
 },
 "ids": [
 0
 ]
 }
 },
 "rows": [
 {
 "title": "Graph",
 "height": "350px",
 "editable": true,
 "collapse": false,
 "collapsable": true,
 "panels": [
 {
 "span": 12,
 "editable": true,
 "group": [
 "default"
 ],
 "type": "histogram",
 "mode": "count",
 "time_field": "@timestamp",
 "value_field": null,
 "auto_int": true,
 "resolution": 100,
 "interval": "30s",
 "fill": 3,
 "linewidth": 3,
 "timezone": "browser",
 "spyable": true,
 "zoomlinks": true,
 "bars": false,
 "stack": false,
 "points": false,
 "lines": true,
 "legend": true,
 "x-axis": true,
 "y-axis": true,
 "percentage": true,
 "interactive": true,
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "title": "Events over time",
 "intervals": [
 "auto",
 "1s",
 "1m",
 "5m",
 "10m",
 "30m",
 "1h",
 "3h",
 "12h",
 "1d",
 "1w",
 "1M",
 "1y"
 ],
 "options": true,
 "tooltip": {
 "value_type": "cumulative",
 "query_as_alias": false
 }
 }
 ],
 "notice": false
 },
 {
 "title": "Average Response Size",
 "height": "150px",
 "editable": true,
 "collapse": false,
 "collapsable": true,
 "panels": [
 {
 "span": 12,
 "editable": true,
 "type": "histogram",
 "loadingEditor": false,
 "mode": "mean",
 "time_field": "@timestamp",
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "value_field": "resp_time",
 "auto_int": true,
 "resolution": 100,
 "interval": "30s",
 "intervals": [
 "auto",
 "1s",
 "1m",
 "5m",
 "10m",
 "30m",
 "1h",
 "3h",
 "12h",
 "1d",
 "1w",
 "1M",
 "1y"
 ],
 "fill": 0,
 "linewidth": 3,
 "timezone": "browser",
 "spyable": true,
 "zoomlinks": true,
 "bars": false,
 "stack": false,
 "points": false,
 "lines": true,
 "legend": true,
 "x-axis": true,
 "y-axis": true,
 "percentage": false,
 "interactive": true,
 "options": true,
 "tooltip": {
 "value_type": "cumulative",
 "query_as_alias": false
 },
 "title": "Mean Response Time"
 }
 ],
 "notice": false
 },
 {
 "title": "Errors",
 "height": "150px",
 "editable": true,
 "collapse": false,
 "collapsable": true,
 "panels": [
 {
 "error": false,
 "span": 8,
 "editable": true,
 "type": "terms",
 "loadingEditor": false,
 "queries": {
 "mode": "all",
 "ids": [
 0,
 3,
 5,
 2,
 6
 ]
 },
 "field": "sourceip",
 "exclude": [
 "127.0.0.1",
 "10.42.7.5",
 "10.42.7.6"
 ],
 "missing": false,
 "other": false,
 "size": 25,
 "order": "count",
 "style": {
 "font-size": "10pt"
 },
 "donut": false,
 "tilt": false,
 "labels": true,
 "arrangement": "horizontal",
 "chart": "bar",
 "counter_pos": "above",
 "spyable": true,
 "title": "Most Frequent IPs"
 },
 {
 "error": false,
 "span": 4,
 "editable": true,
 "type": "pie",
 "loadingEditor": false,
 "query": {
 "field": "program",
 "goal": 100
 },
 "queries": {
 "mode": "all",
 "ids": [
 0,
 3,
 5,
 2,
 6
 ]
 },
 "size": 10,
 "exclude": [
 "su"
 ],
 "donut": false,
 "tilt": false,
 "legend": "above",
 "labels": true,
 "mode": "terms",
 "default_field": "DEFAULT",
 "spyable": true,
 "title": "Top Programs"
 }
 ],
 "notice": false
 },
 {
 "title": "Response Codes",
 "height": "150px",
 "editable": true,
 "collapse": false,
 "collapsable": true,
 "panels": [
 {
 "span": 8,
 "editable": true,
 "type": "histogram",
 "loadingEditor": false,
 "mode": "count",
 "time_field": "@timestamp",
 "queries": {
 "mode": "selected",
 "ids": [
 3,
 5
 ]
 },
 "value_field": "responsecode",
 "auto_int": true,
 "resolution": 100,
 "interval": "30s",
 "intervals": [
 "auto",
 "1s",
 "1m",
 "5m",
 "10m",
 "30m",
 "1h",
 "3h",
 "12h",
 "1d",
 "1w",
 "1M",
 "1y"
 ],
 "fill": 0,
 "linewidth": 3,
 "timezone": "browser",
 "spyable": true,
 "zoomlinks": true,
 "bars": false,
 "stack": true,
 "points": false,
 "lines": true,
 "legend": true,
 "x-axis": true,
 "y-axis": true,
 "percentage": false,
 "interactive": true,
 "options": false,
 "tooltip": {
 "value_type": "cumulative",
 "query_as_alias": false
 },
 "title": "HTTP Response Codes"
 },
 {
 "span": 2,
 "editable": true,
 "type": "trends",
 "loadingEditor": false,
 "queries": {
 "mode": "selected",
 "ids": [
 3,
 5,
 2,
 6
 ]
 },
 "style": {
 "font-size": "16pt"
 },
 "ago": "1d",
 "arrangement": "vertical",
 "spyable": true,
 "title": "Trends"
 },
 {
 "error": false,
 "span": 2,
 "editable": true,
 "type": "derivequeries",
 "loadingEditor": false,
 "loading": false,
 "label": "Search",
 "query": "*",
 "ids": [
 0,
 1,
 3,
 4,
 5
 ],
 "field": "responsecode",
 "fields": [
 "@timestamp",
 "@version",
 "classname",
 "device",
 "facility",
 "facility_label",
 "haproxy_hour",
 "haproxy_milliseconds",
 "haproxy_minute",
 "haproxy_month",
 "haproxy_monthday",
 "haproxy_second",
 "haproxy_time",
 "haproxy_year",
 "host",
 "level",
 "logsource",
 "major",
 "message",
 "minor",
 "name",
 "os",
 "os_major",
 "os_minor",
 "os_name",
 "patch",
 "payload",
 "pid",
 "priority",
 "program",
 "referrer",
 "request",
 "resp_time",
 "responsecode",
 "responsesize",
 "sessionid",
 "severity",
 "severity_label",
 "sourceip",
 "tags",
 "threadid",
 "timestamp",
 "type",
 "ua.device",
 "ua.major",
 "ua.minor",
 "ua.name",
 "ua.os",
 "ua.os_major",
 "ua.os_minor",
 "ua.os_name",
 "ua.patch",
 "user_agent",
 "userid"
 ],
 "spyable": true,
 "rest": false,
 "size": 5,
 "mode": "terms only",
 "exclude": [],
 "history": [
 "*",
 "206"
 ],
 "remember": 10,
 "title": "HTTP Response Codes"
 }
 ],
 "notice": false
 },
 {
 "title": "OS and Browser Info",
 "height": "150px",
 "editable": true,
 "collapse": false,
 "collapsable": true,
 "panels": [
 {
 "error": false,
 "span": 3,
 "editable": true,
 "type": "terms",
 "loadingEditor": false,
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "field": "ua.name",
 "exclude": [
 "other"
 ],
 "missing": false,
 "other": false,
 "size": 10,
 "order": "count",
 "style": {
 "font-size": "10pt"
 },
 "donut": false,
 "tilt": false,
 "labels": true,
 "arrangement": "horizontal",
 "chart": "pie",
 "counter_pos": "above",
 "spyable": true,
 "title": "Browsers"
 },
 {
 "error": false,
 "span": 3,
 "editable": true,
 "type": "terms",
 "loadingEditor": false,
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "field": "ua.os_name",
 "exclude": [
 "other",
 "x",
 "os",
 "7",
 "8",
 "10",
 "xp",
 "vista"
 ],
 "missing": false,
 "other": false,
 "size": 10,
 "order": "count",
 "style": {
 "font-size": "10pt"
 },
 "donut": false,
 "tilt": false,
 "labels": true,
 "arrangement": "horizontal",
 "chart": "pie",
 "counter_pos": "above",
 "spyable": true,
 "title": "Operating Systems"
 },
 {
 "error": false,
 "span": 3,
 "editable": true,
 "type": "pie",
 "loadingEditor": false,
 "query": {
 "field": "responsecode",
 "goal": 100
 },
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "size": 10,
 "exclude": [
 "200",
 "304",
 "302"
 ],
 "donut": false,
 "tilt": false,
 "legend": "above",
 "labels": true,
 "mode": "terms",
 "default_field": "DEFAULT",
 "spyable": true,
 "title": "HTTP Statuses"
 },
 {
 "error": false,
 "span": 3,
 "editable": true,
 "type": "terms",
 "loadingEditor": false,
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "field": "logsource",
 "exclude": [
 "example.com",
 "1",
 "2",
 "3",
 "server"
 ],
 "missing": false,
 "other": false,
 "size": 10,
 "order": "count",
 "style": {
 "font-size": "10pt"
 },
 "donut": false,
 "tilt": false,
 "labels": true,
 "arrangement": "horizontal",
 "chart": "pie",
 "counter_pos": "above",
 "spyable": true,
 "title": "Server Activity"
 }
 ],
 "notice": false
 },
 {
 "title": "Logs",
 "height": "150px",
 "editable": true,
 "collapse": false,
 "collapsable": true,
 "panels": [
 {
 "error": false,
 "span": 12,
 "editable": true,
 "type": "table",
 "loadingEditor": false,
 "status": "Stable",
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "size": 100,
 "pages": 5,
 "offset": 0,
 "sort": [
 "@timestamp",
 "desc"
 ],
 "group": "default",
 "style": {
 "font-size": "9pt"
 },
 "overflow": "min-height",
 "fields": [
 "@timestamp",
 "message",
 "sourceip"
 ],
 "highlight": [],
 "sortable": true,
 "header": true,
 "paging": true,
 "field_list": true,
 "all_fields": true,
 "trimFactor": 300,
 "normTimes": true,
 "spyable": true,
 "title": "Logs"
 }
 ],
 "notice": false
 }
 ],
 "editable": true,
 "failover": false,
 "index": {
 "interval": "day",
 "pattern": "[logstash-]YYYY.MM.DD",
 "default": "NO_TIME_FILTER_OR_INDEX_PATTERN_NOT_MATCHED"
 },
 "style": "light",
 "panel_hints": true,
 "loader": {
 "save_gist": false,
 "save_elasticsearch": true,
 "save_local": true,
 "save_default": true,
 "save_temp": true,
 "save_temp_ttl_enable": true,
 "save_temp_ttl": "30d",
 "load_gist": true,
 "load_elasticsearch": true,
 "load_elasticsearch_size": 20,
 "load_local": true,
 "hide": false
 },
 "pulldowns": [
 {
 "type": "query",
 "collapse": false,
 "notice": false,
 "enable": true,
 "query": "*",
 "pinned": true,
 "history": [
 "responsecode:[500 TO 599]",
 "responsecode:[400 TO 499]",
 "responsecode:[300 TO 399]",
 "*",
 "responsecode:[200 TO 299]",
 "responsecode:>500",
 "responsecode:\"403\"",
 "responsecode:\"404\"",
 "responsecode:\"201\"",
 "responsecode:\"206\""
 ],
 "remember": 10
 },
 {
 "type": "filtering",
 "collapse": false,
 "notice": true,
 "enable": true
 }
 ],
 "nav": [
 {
 "type": "timepicker",
 "collapse": false,
 "notice": false,
 "enable": true,
 "status": "Stable",
 "time_options": [
 "5m",
 "15m",
 "1h",
 "6h",
 "12h",
 "24h",
 "2d",
 "7d",
 "30d"
 ],
 "refresh_intervals": [
 "5s",
 "10s",
 "30s",
 "1m",
 "5m",
 "15m",
 "30m",
 "1h",
 "2h",
 "1d"
 ],
 "timefield": "@timestamp",
 "now": true,
 "filter_id": 0
 }
 ],
 "refresh": "5m"
}

Defuse your landmines… before it is too late

Anyone involved with development, support – heck even usage – of software knows that there are always little “gotchas” that you learn to avoid. We should eject thumb drives before removing them. Don’t delete your System or Windows directory. Or the dreaded “Saving… Don’t power off your console” message.

Some of these make sense or are pretty obvious. Also, there’s only so much you can do to prevent a user from shooting themselves in the foot. It is not feasible – at least from a cost/benefit standpoint – to design a game console that is capable of gracefully recovering your save if kill the power mid-flight, just to cite one example.

But, there are lots of other scenarios where there is much to be lost if things go bad. As totally pissed as I would be if my Forza save went “poof,” I believe I’d still live. If I lose all of my family pictures from the past decade because of a bug or a rare caveat, that’s a totally different story.

“Hello. My name is Chris and I’m a landmine denier.”

Admit it: You know your software has these types of problems. There are areas where we can reasonably expect users to tread, but we do not put up the proper fencing. Things where it’s not a matter of if it will happen – it’s a matter of when. In a way, I shouldn’t complain. It’s crap like this that keeps me employed.

That said, I view my job is to make myself unemployed. My software should be so simple, so robust, so intuitive that you don’t need a geek like me to deploy, maintain, or troubleshoot it… or at least fewer geeks like me :) [ I’ll let you in on a little secret. We geeks would rather be solving diffrent, more interesting problems, anyway.]

Landmines: The Home Version

I’ll offer up a non-tech example of this. We’ve lived in our current house for over 2 years. Thanks to our <sarcasm>lovely HOA</sarcasm>, our trash cans can’t be left by the alley. So, we’ve kept them on one side of the driveway as a “solution.” Simple enough and we meet the letter of the law to keep the HOA police off our back.

Because of this, parking a car on that side of the garage required pretty precise driving – both coming and going. I knew that some day, likely when we were in a hurry, one of us was going to get it wrong. Put the car into the wall of the garage, run over the trash cans, whatever.

But, I knew the caveat. I had “documented” the workaround: Just don’t run into the garage. SImple, right?

Did I put any effort into proper solution? Absolutely! I thought about digging out a space on the other side of the driveway, install some bricks (we have a stack taking up room in the garage, anyway), and park the cans over there. That would leave the driveway 100% clear.

OK… But did I do anything about it? Well, not really. I “didn’t have time.” There were other, more important things going on.

That said, I did finally manage to find time a couple of weekends ago… only after the side view mirror was ripped off my wife’s car. Amazing how things like that suddenly change our priorities (more on this later).

Rear View Mirror Damage 2 (Not actually my mirror, but you get the point…)

Now, not only did I install the bricks – which I should have done months ago, back when it wasn’t ~100 degrees outside – but I had to drop about $200 in parts to replace the mirror.

Bringing it Back to Tech

We all know of – or even create and support – software that has similar landmines in it. Sure, we can train people to step around these landmines. While the audience is small, this is manageable.

What happens when your product is the wonderful success you want it to be? Does a model that requires special insider info or meticulous attention to details in documentation bode for a great experience in the field?

In my trash can situration, I thought I had it all under control. Things went well for a long time. So long, in fact, that I convinced myself that implementing a proper fix wasn’t a priority. “Nothing has blown up so far. I probably don’t need to dig up the yard, after all.”

But, as I was picking up pieces of broken mirror from the driveway, I kicked myself for not acting sooner. Moreover, I had to rearrange my weekend to complete the work in a couple of days, whereas I could have taken more time (and likely done a better job) had I started earlier and paced myself.

Think back in your career and I am sure you can find examples of this in your product’s history. A major customer hits a critical issue, despite it being “well known” or even documented, which causes an all-hands-on-deck situation. New feature development grinds to a halt… and we were already behind on new features, because we’re human and we all suck at estimation.

“A stitch in time…,” “An ounce of prevention…”, etc. Yes, they’re all cliché, but that doesn’t make them any less true.

The point is: You can bury your skeletons before they are found or while a customer is beating you over the head with a femur. You make the call!

No, your developers should not be writing their own logging

… at least not totally on their own.

4047355843_1493968b1f_o (Image courtesy of Michael Mol)

Self-fulfilling prophecy

Maybe I missed the memo, but am I the only person who thinks Devs writing logs that only they can decipher is a bad idea? Seems so, because on the web and in books, I see statements like this:

Application debug logging is a special type of logging that is useful to application/system developers and not system operators. Such logging is typically disabled in production systems but can be enabled on request. Many of the messages in debugging logs can already be analyzed by the application developer with full knowledge of application internals and sometimes also with possession of application source code.

With all due respect to Chuvakin, et.al., their statement above reflects how most people feel about debug:

“The Devs are the only ones who understand this, so let them put whatever they want in there. I sure as hell won’t understand it. It’ll end up in their lap, anyway.”

I can’t scream loud enough how horribly inefficient this is. It creates a self-fulfilling prophecy where people assume that Devs are the only ones who can troubleshoot their own code. So, why even try, right?

The reality is that properly informed support staff are more than capable of debugging many issues, even without access to the source code… or the humans who wrote it.

But, you have to give them enough info to do their jobs.

Am I saying proper logging will eliminate the need to pull in a Dev from time-to-time? No, but ignoring it ensures they’ll be needed much more than they should.

Devs as Support – AKA: You’re doing it wrong

You can get away with the “Devs as support” philosophy when you work in a startup. In fact, it’s actually a necessity. Justin is the node.js guy, so he does the back-end stuff. He’s a bit of a sysadmin, too. His stuff lives on the servers, after all. All the backend stuff will come to Justin. Let Justin write what makes sense to him.

Once you get into a larger-scale operation, however, people must specialize in order to gain efficiencies of scale.

If I run a 100-person Dev shop, I don’t want Justin debugging backend issues! I want him focused on new code. I want my support staff to take the first crack at figuring out the issue. They can pull him in, if needed.

In order to get to the ideal state – “Devs as Devs” mode – any support staff needs the following to show value:

  1. Logs they can understand, without having to break out the source code to do so
  2. System- and Code-level design documentation to help them understand what the code is doing, without having to break out the source code to do so
  3. Commonly hit issues with simple and speedy resolutions, so even people outside Dev and support can potentially self-service, without having to break out the source code to do so

Noticing a trend?

No, really. You. Are. Doing. It. Wrong.

Perpetuating this pattern creates (at a minimum) the following issues:

  1. It ensures that your devs will be pulled off of new, cool stuff to work on old, boring code
    1. It’s not if – it’s when
    2. Working on old code is tantamount to working in a coal mine for a developer
    3. And, no, troubleshooting someone else’s code is not a good way to bring up junior devs – see 1.1… plus you’re likely perpetuating bad habits
  2. It turns your support staff into “log proxies” (“I don’t know what the hell ‘PC Load Letter’ means. Guess I’ll punt this off to Justin.”)
  3. It dramatically increases time-to-resolution
    1. If there are 5 back-end issues going on at the same time, Justin can’t t-shoot them all simultaneously
    2. … and good luck getting Anne to jump in and figure out Justin’s code, because you didn’t have time to create that System- and Code-level design documentation, right?

My experience has been that Developers take one of three routes when generating logging:

  1. They dump out a full stacktrace for everything (as discussed in an earlier rant) – after all, it tells them where in the code we were when the system crapped itself
  2. They log something so generic as to be non-informational (“Error occurred”)
  3. They log nothing at all, which leads to vague runtime errors (see below)

The problem with #1 is that logging typically has a significant performance impact. So, kicking out dozens of KB of logging – when you typically only need the first few lines of the stack – decreases scalability. Or worse, the logging cripples the system so much you can’t turn it on in the first place.

On #2, why bother? You haven’t really given me anything, except that something bad happened. I believe the screaming customer on the other end was able to relay that message just fine.

OK, smart guy. So I have to teach the support team to code, then?

If you are going to log anything, why not log some basic info about the issue? Here is an example of a log statement that a support team has a valid chance at solving:

User phillp.j.fry is unable to access Post 478989456 - insufficient permissions

The support team’s success depends on the aforementioned documentation. If I have a documented DB schema that tells me how the code looks up permissions, my support staff can at least check the DB to ensure there’s no corruption (or anything else unexpected). Without this, they’re flying blind and will likely be Justin’s cube-buddy for a while.

Runtime logging should tell me all I need to know

For #3, sadly we tend not to get logging like the example above. Instead, let’s look at this real world example of what we typically see – taken from Tomcat’s Bugzilla:

If there is no file CATALINA_BASE/conf/logging.properties, then Tomcat will not start and the JVM outputs the following error:

Exception in thread "main" java.lang.NoClassDefFoundError:

without any additional information.

Sure, the error is technically correct: there was no logging class referenced, thus the JVM couldn’t find that class. But, it gives no one – even the Devs – any idea as to what the code is upset about. If the person who opened this bug didn’t already tell us where the issue was, we could be spending hours trying to figure this out.

Point being:
1. If the code is looking for a file and it can’t find it, log the path and file
2. If the code is looking for a property and can’t find it, log the missing property name
3. If the code… you get the point – give me an idea of what went wrong, and I can likely leave the Devs alone

Simple, silly example. But, if things this basic are left undone, what do you think the logging of the more complex stuff looks like?

Logging as a User Story

Sadly, even though this stuff is painfully obvious, it is also rarely done.

Treat your logging like it is a User Story – and your support staff are the stakeholders. After all, they are the ones consuming your logging… that is, unless you want your Devs doing all your customer support. Maybe there’s a co-dependency issue I need to explore there… Hmmm.

Anyway, sorry I can’t elaborate more. I have to go interview a replacement for Justin. He just stormed out of the office shouting something like “Unsustainable” or whatever. Feh. Whatevs.