There’s Gold in Them Thar Logs!

Oh crap. Another post about logging…

Yep, get over it. This time, however, I’m not here to piss and moan about how much logging stinks. Rather, I want to show the power of extracting data from the logging you have.

Logstash + ElasticSearch + Kibana = Logging Bliss!

I’m not necessarily treading any new ground here, but for the initiated, let me quickly go over what is becoming a very popular logging technology stack in the DevOps space:

  • Logstash
  • ElasticSearch
  • Kibana

I will get into some of the deeper details later, but let’s take a quick tour of the components first.

Logstash

Logstash is the brain child of Jordan Sissel. At a high level, the goal of Logstash is to chew up your logs and spit out contextualized data from what would otherwise be blobs of text. Said another way, instead of mountains of plain-text log files, you get the same log content split out and stuffed into fields you define. This allows you to do all sorts of things with the logs, not the least of which is to perform focused searches in Kibana for exactly the data you need.

Going with the lumberjack analogy, think of Logstash as your lumber mill that takes your logs and spits out 2x4s, etc. There’s a reason why you don’t see tree trunks at Home Depot. Tree trunks do me no good. Studs and plywood, now that I can work with.

ElasticSearch

In this stack, ElasticSearch operates as our super-fast “database” to hold what Logstash kicks out (technically, it’s a search index, but let’s not nerd out too much here). This description sells both ElasticSearch and Logstash short (Logstash can output in many different formats, not just ElasticSearch), but I’m trying to stick to the solution at hand.

ElasticSearch is based on Apache Lucene – just like Apache Solr – and lives to gobble up data broken into fields, then query large volumes of that data near instantly. It also gives a rich query syntax, which Kibana takes advantage of.

Kibana

Kibana is where we see the fruits of Logstash’s and ElasticSearch’s efforts. Kibana is a slick web UI that queries ElasticSearch for various purposes:

  1. To search logs (again, allowing you to use Lucene search syntax)
  2. To make pretty charts
  3. To show the most common items in logs
  4. etc

As an example on the searching, think of the last time you had to visit your Apache or Tomcat access logs to play Columbo for some issue that happened. I am sure it was an exercise of grep, sed, awk, etc to find that needle in the haystack. Even then, you end up with alot of stuff that is likely not relevant (heaven forbid you need to look for common words like “post,” “get,” or “user”). Now you can say things like, “Show me all the log statements where”:

  1. ‘user’ shows up in the referring URL
  2. The user got a 500 and the server took more than 500ms to fulfill the request
  3. My user ID exists, my boss’ user ID exists, and the word “raise” was in the request URL ;)

And here’s a cool benefit: Normal human beings can now have access to logs with a “type what you’re looking for” interface, so, I dunno, maybe they can leave you alone to do more interesting things?

What is interesting about Kibana is that there is not a server-side component. Kibana is all client-side code. All you need is a web server to pass along Kibana’s code to the browser, then the browser issues queries to ElasticSearch directly via ElasticSearch’s APIs.

While Kibana has a built-in dashboard for Logstash that allows you to search your logs, where things really get interesting is applying Kibana’s Panels to display data from your logs. Think of Panels as the Lego from which you construct your dashboard. Behind each of these is an ElasticSearch query that populates that Panel.

While I will cover all of the Panels in my dashboard, here is a teaser:

Screen Shot 2013-11-11 at 8.03.44 AM

My First Dashboard ™

Typically the best way to learn something is through an example. The primary reason I am writing this post is to help others take advantage of this tech stack, which I hope to do by walking though what I did – and shouldn’t have done – to get my Kibana dashboard in place.

Recall that the pipeline is like so:

Logging source (syslog/file/flume/etc) => Logstash => ElasticSearch

“Wait, where’s Kibana?” you might be asking. Remember that Kibana is just the UI to visualize data that Logstash, um, stashed in ElasticSearch. I’ll definitely cover Kibana in detail, but we need to start with Logstash, as that is where we add meaning to the text of our logging. Without that context, Kibana will be of limited use to us. Garbage in, garbage out – right?

Magic ain’t Free

Again, I am selling its full capabilities way short, but Logstash’s job here is to take a single line of text and – based on filters, either pre-defined or custom – assign chunks of that text to fields we define. We will use these fields later on not only to more intelligently query our logs, but to target specific data to be shown in Panels in Kibana.

  1. Be the logs
  2. Know the end goal
  3. Grok grok (no, that’s not a typo)
  4. Iterate, because you ain’t gonna get it right the first time

This will take work, but to crib from Alton Brown, “Your hard work will be rewarded.”

Be the Logs

To really get the value out of this stack, you need to know your logs, as well as what you want to get out of them.

At this stage, we just want to look at our target logging and see what kind of data might be useful. So, do the obvious: copy a typical log statement into a text editor and look at it in isolation. Logstash doesn’t have the benefit of context. It’s going to chop up those logs one line at a time, hence I find it best to start my analysis by looking at just one log statement.

As mentioned in #4 above, you are likely to find that your logging is not as “normalized” as you think, so you’ll likely have to revisit this step quite few times along your journey.

Let’s take an example Apache access_log statement:

10.129.16.62 - - [11/Nov/2013:07:24:35 -0800] "GET /html/themes/default/images/sprites/core_sprite_left.png?t=74 HTTP/1.1" 304 - "https://my-server.cjchand.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36" 354

There is the following useful/interesting information in that single log statement (in order shown):

  • Client IP address
  • Datestamp
  • Request URL
  • HTTP status code
  • Referring URL
  • User Agent string
  • Response time

Of those, are there any items that are throwaways (e.g.: data that would not be interesting/useful as search or graphing criteria)? In this case, I would argue not, but especially with logging in custom applications, you can end up with some items that might be interesting when viewed with the log statement itself, but might not be something that we would need to filter/chart (e.g.: PID).

That said, at this point, err on the side of capturing everything you can. If you ultimately determine you don’t need that data, you can simply bit-bucket it via your grok pattern and/or Logstash config.

Similarly to knowing what meaningful data lives in your logs, it helps to know what Kibana can visualize for you. Here are the Panel types available (at the least the non-deprecated ones) at the time I am writing this:

  • Bettermap: Uses Longitude and Latitude coordinates to place a “pin” on a map
  • Column: A “wrapper” panel of sorts; used to control the layout of panels in a dashboard
  • Derivequeries: This panel auto-creates facets for a given field (e.g.: might create multiple queries for each HTTP status code seen in the current data set)
  • Histogram: A chart to show the total, count, mean, etc number of log statements in the timerange specified
  • Hits: Total count of lines matching a given query/set of queries
  • Terms: A “Top X” list of the most common terms found in log statements
    • Can be viewed as:
      • Chart: Pie or Bar
      • Table
  • Text: Panel to hold text or sanitized HTML (including in Markdown syntax)
  • Trends: Based on a user-configurable time frame, compare the data shown to the same data across the specified timeframe
    • Example: I am looking at from 13:00-14:00 and I want to know – as a percentage – how many more/less 404’s I have seen from 13:00-14:00 today vs 13:00-14:00 yesterday)

Know the End Goal

“Wait. Shouldn’t I start with this step?” Sure, you can definitely make that argument, but in many cases you have little control over what logging you get – which is often the case if you are not the one developing the code or controlling the deployment of the app.

As such, my philosophy is to see what I have to work with first. Then, I set my target for the final product.

YMMV on when you this step, but you definitely need to perform this step before you sit down to craft grok patterns.

Continuing with our Apache access log example, here are some potential target items we would want to craft grok patterns around – and why:

  • Search Filters:
    • All of these fields
  • Charting/Trending items:
    • Client IP address:
      • Top XX IP addresses
      • Geolocation of client
    • HTTP status code:
      • Histogram of unexpected statuses
    • User Agent String:
      • Client OS
      • Client browser
    • Response Time:
      • Histogram of mean response times (yes, I know that’s not statistically the best measure, but it’s what Kibana gives us. There are no confidence intervals, etc.)

Do you Grok Me?

Logstash comes with some built-in filters (called “grok patterns“). These cover common items like IP addresses, timestamps, and even common log types such as httpd access logs.

Using the Lego analogy again, most logs can be parsed using these out-of-the-box grok patterns. You can check out the full list in the Logstash GitHub.

As with most things, odds are your logging might not dovetail perfectly with these patterns and you will have to create your own. But, hey, that’s job security, right? ;)

Creating your own Grok patterns requires some heavy lifting with regex, on which grok patterns are based.

No! Don’t go! We’re just getting to the good part.

I know regex is a four-letter word for most, but in this case, we don’t have to write one massive regex for a given log statement. In fact, depending on your logging data, you likely won’t need to do any regex work at all, as Logstash comes with grok patterns for lots of things you commonly find in logs like:

  • Timestamps (including days, months, etc)
  • IP addresses
  • Class names (com.example.my.custom.class)
  • Quoted Strings (like requested and referring URLs in Apache logs)
  • Usernames
  • UUIDs
  • MAC addresses

In my case, I had to do some customizations because:

  • The built-in grok pattern for “program” – which is a sub-pattern referenced by the SYSLOGBASE pattern – expects a string containing the process ID (PID) like so: processname[PID] (e.g.: httpd[12345])
    • In the case of our Tomcat localhost_access logs, the program name is customized via our syslog config. As a result, the PID is missing (e.g.: accesslog[])
    • It took me a while to figure out, but because of this Logstash would not properly parse my localhost_access logs (see the Iterate section later on :) )
  • While not discussed extensively in this post, I also created a custom grok pattern to understand the logs our Tomcat application generates so I can get things like the userID the log statement pertains to, the Java Classname, etc to help perform more targeted searches

Iterate

Like anything you write to parse strings, you’ll likely hit corner cases. Maybe a field is not populated in some instances, or what would normally be a string ends up being a number, or maybe some mobile device passes a funky User Agent string.

The point is, once you get a set of Grok patterns in place, you will want to review what Logstash gives you to ensure that you’re parsing as many logs as you can.

The reason why this is important is that if there is not a matching Grok pattern, the entire line of logging will not be split out. While it will still be searchable, you will not get the full benefit of this logging framework.

Going back to my missing PID as an example, this simple deviation caused every single line of my logging not to be parsed. Until I came across that missing PID as the root cause, I was unable to do all of the cool stuff you’ll see later when I discuss Kibana.

Grok Debugger

Unless you’re a regex whiz, you’re likely familiar with using sites like RegexPal to build a regex against a target string. Luckily, there is a similar tool for creating and testing Grok patterns: Grok Debugger.

The really cool thing about Grok Debugger is that it is Grok-aware and understands the built-in Grok patterns (e.g.: HOSTORIP, URI, QS [Quoted String], etc). There is even a “discover” feature that will attempt to build a Grok pattern based on a sample line of logging you give it. While not 100% perfect, it does help make quick work of creating a custom Grok pattern, should the need arise.

Bringing it all Home

At this point I will assume you have done the following:

  • Deployed Logstash
  • Started passing logs through Logstash
  • Stored Logstash’s output in ElasticSearch
  • Used Kibana to query ElasticSearch, just to make sure things are working right

If not, here are some links you need to get these items working:

  1. Simple Logstash deployment (which includes an embedded ElasticSearch instance)
  2. Kibana deployment (simply unzip/untar and plunk in the Web Server of your choosing)

You can certainly get way more elaborate than this, but I’d recommend getting started with the simple version above. I’ve had such a setup ingest ~3GB of logs a day on a plain vanilla CentOS virtual machine with just 2 vCPU and 4GB of RAM – and it barely broke a sweat.

IOW, crawl before you walk ;)

With that PSA out of the way, now comes the real work:

  1. Creating and verifying your customized Logstash config
  2. Creating and falling in love with your Kibana Dashboard

My Logstash Config

Note that I will not go into all of the ins and outs of Logstash file config as there is plenty of info on that across the Intert00bz, but here are some sources to get you started:

I will try – but likely not succeed – to not rehash too much of the content above as I explain why my finished Logstash config file looks like this:

input { 
   syslog {
      host => "0.0.0.0"
      port => 514
      type => "My Custom App"
   }
}

filter {
  grok {
    patterns_dir => "/opt/logstash/patterns/"
     match => [ "message", "%{TOMCAT_ACCESS}", "message", "%{APP_SERVER}" ]
   }
  mutate {
       convert => [ "responsecode", "integer", "responsesize", "integer", "resp_time", "integer", "geoip.latitude", "float", "geoip.longitude", "float" ]
  }
  if [user_agent] {
    useragent {
      source => "user_agent"
      target => "ua"
    }
  }
  if [sourceip] !~ "(^127\.0\.0\.1)|(^10\.)|(^172\.1[6-9]\.)|(^172\.2[0-9]\.)|(^172\.3[0-1]\.)|(^192\.168\.)" and [sourceip] {
    geoip {
         source => "sourceip"
    }
      mutate {
      # 'coords' will be kept, 'tmplat' is temporary.
      # Both of these new fields are strings.
      add_field => [ "coords", "%{[geoip][longitude]}",
             "tmplat", "%{[geoip][latitude]}" ]
      }
      mutate {
      # Merge 'tmplat' into 'coords'
      merge => [ "coords", "tmplat" ]
      }
      mutate {
      # Convert our new array of strings back to float
      convert => [ "coords", "float" ]
      # Delete our temporary latitude field
      remove => [ "tmplat" ]
      }
  } 
}

output { 
   elasticsearch { embedded => true }
}

… and here is the file containing my custom TOMCAT_ACCESS and APP_SERVER Grok patterns:

SYSLOGBASE %{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} %{SYSLOGPROG}:
SYSLOGPROG %{PROG:program}(?:\[%{DATA:pid}\])?
TOMCAT_ACCESS %{SYSLOGBASE} %{IP:sourceip} - - %{DATA:sessionid} \[%{HAPROXYDATE}\] %{QS:request} %{INT:responsecode} %{DATA:responsesize} %{QS:referrer} %{QS:user_agent} %{INT:resp_time}
APP_SERVER %{SYSLOGBASE} %{DATA:level}  \[%{DATA:classname}\] - \[%{DATA:threadid}\] - \[%{DATA:userid}\] - \[%{DATA:sessionid}\]: %{GREEDYDATA:payload}

While I certainly could have put the patterns directly in the Logstash config file itself, by using a separate patterns file it makes the config easier to read and manage.

The reason I had to create a pattern file was twofold:

  1. I had to override the built-in SYSLOGPROG – and by extension, SYSLOGBASE – patterns because of our goofiness in not spitting out a PID along with the program name (e.g.: httpd:[])
  2. Also, since our formatting of our Tomcat Access Logs deviated from the norm, I had to create a custom filter for that

Um, WTF is the “geoip” and “coords” stuff in there?

I briefly mentioned that we can GeoLocate clients based on their IP address. This magic comes courtesy of Logstash, which ships with a built-in DB that maps IP addresses to physical locations – Longitude and Latitude coordinates, to be exact. These coordinates are based on data from the various Internet Registries (ARIN,RIPE, etc).

This leads to an issue for “internal” applications: What to do about private IP addresses? Obviously, those will not be in the various Internet Registries. Unfortunately, we can’t customize these, either. Logstash gets this Geolocation data from the kind folks at MaxMind. In order to keep this free (as in beer), the Geolocation DB is in MaxMind’s proprietary format. More on handling internal IPs coming up in a bit.

Back to the Geolocation DB: The proprietary, binary format means we can’t (not easily, at least) use a customized version of the DB to specify where private subnets actually physically reside within our organization. Of course, Logstash is Open Source, thus you could implement support for the CSV formatted DBs MaxMind provides, but that’s a topic for another post.

I will start off by saying that a vast majority of the Geolocation config was blatantly stolen from Bryan White’s post you can find here. The short version is that you have to do some conversion of the coordinates to get them into the format that Kibana needs for the Bettermap panel. The really shrot version is just copy and paste what he has kindly provided and move on :)

Now, that just leaves the matter of how to handle internal IP addresses. Logstash recently added the idea of “conditionals” in its config files. This allows you to perform actions against a log statement only if certain conditions are met.

In this case, we want to skip Geolocation processing if either of these situations are present in a given log statement:

  • The sourceip field is missing
  • The value of the sourceip field falls into the range of internal IP addresses (e.g.: 10.x.x.x, 172.16.x.x through 172.31.x.x and 192.168.x.x)

The reason we want to skip Geolocation for these is the coordinates end up being 0,0 – which is somewhere West of Africa in the middle of the Atlantic. Not terribly useful data and it makes us wonder why so many people are visiting our site from the middle of the ocean ;)

As a result, I have the entire block of Bryan’s Geolocation config sitting inside this conditional:
if [sourceip] !~ "(^127\.0\.0\.1)|(^10\.)|(^172\.1[6-9]\.)|(^172\.2[0-9]\.)|(^172\.3[0-1]\.)|(^192\.168\.)" and [sourceip]

What follows the if is a regex, which is likely self-explanatory, but just in case it is translated to:

  • If the field sourceip:
    • Is not in the range of private IP addresses
    • and
    • Is present (which is why you see [sourceip] after the and near the end of the statement)

… then go ahead and do Geolocation for this log statement, which is done via the geoip filter and the output of geoip is reformatted by the mutate filters to get them into the format Kibana expects.

The reason for checking that sourceip is present in the first place is because I am receiving logs from all manner of locations (e.g.: messages logs, etc), many of which will not have an IP address in them. If I did not filter these out, I would get alot of log statements with coordinates of 0,0. I learned that the hard way (see the “Iterate” section above).

The end result of this config is:

  • I am parsing Tomcat Access Logs and my Applications custom logs to get custom fields, some of which I defined (e.g.: userID) and some of which leverage built-in field names within Logstash (e.g.: user_agent)
  • All other logs (e.g.: messages, cron, etc) are simply stored as they are received with no special field creation, as I typically don’t have a need to search and/or chart on these
    • Note: Logstash still creates some fields, depending on the log type
    • For example, it will recognize cron logs and set the program field to “cron”
  • The output is going into the embedded ElasticSearch

Kibana Dashboard

At this point, I have laid the groundwork to get the fields necessary to populate my Kibana Dashboard. Again, this Dashboard is comprised of Panels, each of which will look for the presence of a given field for a document in ElasticSearch. Based on the contents of that field – and the type of panel – Kibana will visualize that data for us.

For example, the Bettermap panel will look for the value of the coords field in each record it receives. Then, based on that value, it will place a “pin” in the location corresponding to that value. A Terms panel can be configured to look at things like sourceip and give me a list of the top X (e.g.: 10) most common source IP addresses found in the logs, along with how many times that IP address appeared in the logs for a given time frame.

Before I get to the shiny, let me detail each of the panels in my Dashboard so you can better understand the resulting charts, etc.

  • “Events Over Time”: This is a Histogram panel that gives me the total number of log statements (stored as “documents” in ElasticSearch). Depending on the time range chosen, Kibana will aggregate these. For example, if you look at an hour’s worth of logs, it will aggregate the total number of documents by chunks of 30 seconds, then graph that aggregated count. Note: This is a raw count of all logging seen for the given time frame, regardless of log source.

Screen Shot 2013-11-11 at 8.03.44 AM

  • “Mean Response Time”: This is the mean value of the response time, as found in only the Tomcat Access Logs. When you configure a Panel to look at the target field, that Panel will formulate its query appropriately. For example, the “Mean Response Time” Panel is only concerned about documents in ElasticSearch where the resp_time field (which is one I custom-defined in my patterns file) is present. The result is that the Panel issues a query like this to ElasticSearch to get the data it will graph:
curl -XGET 'http://10.194.190.245:9200/logstash-2013.12.05/_search?pretty' -d '{
  "facets": {
    "2": {
      "date_histogram": {
        "key_field": "@timestamp",
        "value_field": "resp_time",
        "interval": "30s"
      },
      "facet_filter": {
        "fquery": {
          "query": {
            "filtered": {
              "query": {
                "query_string": {
                  "query": "*"
                }
              },
              "filter": {
                "bool": {
                  "must": [
                    {
                      "match_all": {}
                    },
                    {
                      "range": {
                        "@timestamp": {
                          "from": 1386258436590,
                          "to": "now"
                        }
                      }
                    },
                    {
                      "bool": {
                        "must": [
                          {
                            "match_all": {}
                          }
                        ]
                      }
                    }
                  ]
                }
              }
            }
          }
        }
      }
    }
  },
  "size": 0
}'

Screen Shot 2013-12-05 at 4.00.58 PM

  • “Most Frequent IPs”: As the title implies, this is the Top X (25, in this case) most frequently seen IP addresses, based on the sourceip field. As is the case with many of the fields in this Dashboard, this field is only present in Tomcat Access Logs.

Most Frequent IPs - redacted

  • “Top Programs”: This Panel uses Logstash’s built-in capability to discern which application wrote a given log statement (e.g.: cron, httpd, sshd, monit, etc)

Screen Shot 2013-11-11 at 8.02.01 AM

  • “HTTP Response Codes”: This is a histogram charting the (hopefully low) total number of HTTP response codes that are > 400, with the idea being to see how many “errors” are returned to the users.

Screen Shot 2013-11-11 at 8.02.09 AM

  • “Trends”: Using the Trends Panel, this compares the number of >300 Response Codes seen in this time frame to the same time frame in a previous period. In my case, it is 7 days ago, as my default view shows the previous hour. I chose 7 days to avoid situations where I would compare a weekday to a day on the weekend, as usage is mainly during the work week in my case.

Note that the Trends Panel is only as valuable as the data you feed it. If you have a small number of hits for a given term, you can see massive jumps in percentage compared to your trend target.

As you can see from the example below, I need to re-evaluate the usefulness of this panel. But, I did say to iterate, right? :)

Screen Shot 2013-12-05 at 4.03.22 PM

  • “Browsers”: This is a pie chart breaking down the ratio of various client browsers, based on the User Agent string.
  • “Operating Systems”: Another pie chart to show the breakdown of client OS’, again based on the User Agent string.

Screen Shot 2013-10-30 at 3.20.39 PM

  • “HTTP Errors”: A pie chart showing a breakdown of the same data shown in the HTTP Response Codes Histogram.

Screen Shot 2013-12-05 at 4.04.24 PM

  • “Server Activity”: This Terms Panel shows the relative amount of logging from each of the node types in the distributed application I am monitoring (e.g.: Tomcat, memcached, MongoDB, etc)

Screen Shot 2013-12-05 at 4.05.04 PM

  • “Client Locations”: A Bettermap Panel that uses the Geolocation coords field to show where client requests are physically coming from

Screen Shot 2013-11-11 at 8.03.55 AMScreen Shot 2013-12-05 at 4.05.44 PM

  • “Logs”: A Table Panel that shows the @timestamp, message, and sourceip fields. Clicking any row in this table expands to show all of the fields related to this log statement. This is part of the canned Logstash dashboard that ships with Kibana.

Screen Shot 2013-11-11 at 8.04.08 AM

Enough, already. Just show me the whole Dashboard!

OK, OK. Geez.

The final result is below. It is admittedly a bit busy, but I’ve built it to be as much of a tech demo as it is also a useful tool.

Kibana 3

I hope this post helps you create a useful dashboard of your own!

 

 

Update: Adding Kibana Template

I’ve had a few folks ask for the Kibana template used for the dashboards above. So, here it is! Hope you find it useful.

 

{
 "title": "example-dashboard",
 "services": {
 "query": {
 "idQueue": [
 1,
 4,
 7,
 8,
 9,
 10
 ],
 "list": {
 "0": {
 "alias": "200s",
 "query": "responsecode:[200 TO 299]",
 "id": 0,
 "color": "#7EB26D",
 "pin": false,
 "type": "lucene"
 },
 "2": {
 "id": 2,
 "color": "#6ED0E0",
 "query": "*",
 "alias": "All Requests",
 "pin": false,
 "type": "lucene"
 },
 "3": {
 "alias": "300s",
 "query": "responsecode:[300 TO 399]",
 "id": 3,
 "color": "#EF843C",
 "pin": false,
 "type": "lucene"
 },
 "5": {
 "alias": "400s",
 "query": "responsecode:[400 TO 499]",
 "id": 5,
 "color": "#1F78C1",
 "pin": false,
 "type": "lucene"
 },
 "6": {
 "id": 6,
 "color": "#BA43A9",
 "query": "responsecode:[500 TO 599]",
 "alias": "500s",
 "pin": false,
 "type": "lucene"
 }
 },
 "ids": [
 0,
 3,
 5,
 6,
 2
 ]
 },
 "filter": {
 "idQueue": [
 1,
 2,
 3
 ],
 "list": {
 "0": {
 "type": "time",
 "field": "@timestamp",
 "from": "now-1h",
 "to": "now",
 "mandate": "must",
 "active": true,
 "alias": "",
 "id": 0
 }
 },
 "ids": [
 0
 ]
 }
 },
 "rows": [
 {
 "title": "Graph",
 "height": "350px",
 "editable": true,
 "collapse": false,
 "collapsable": true,
 "panels": [
 {
 "span": 12,
 "editable": true,
 "group": [
 "default"
 ],
 "type": "histogram",
 "mode": "count",
 "time_field": "@timestamp",
 "value_field": null,
 "auto_int": true,
 "resolution": 100,
 "interval": "30s",
 "fill": 3,
 "linewidth": 3,
 "timezone": "browser",
 "spyable": true,
 "zoomlinks": true,
 "bars": false,
 "stack": false,
 "points": false,
 "lines": true,
 "legend": true,
 "x-axis": true,
 "y-axis": true,
 "percentage": true,
 "interactive": true,
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "title": "Events over time",
 "intervals": [
 "auto",
 "1s",
 "1m",
 "5m",
 "10m",
 "30m",
 "1h",
 "3h",
 "12h",
 "1d",
 "1w",
 "1M",
 "1y"
 ],
 "options": true,
 "tooltip": {
 "value_type": "cumulative",
 "query_as_alias": false
 }
 }
 ],
 "notice": false
 },
 {
 "title": "Average Response Size",
 "height": "150px",
 "editable": true,
 "collapse": false,
 "collapsable": true,
 "panels": [
 {
 "span": 12,
 "editable": true,
 "type": "histogram",
 "loadingEditor": false,
 "mode": "mean",
 "time_field": "@timestamp",
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "value_field": "resp_time",
 "auto_int": true,
 "resolution": 100,
 "interval": "30s",
 "intervals": [
 "auto",
 "1s",
 "1m",
 "5m",
 "10m",
 "30m",
 "1h",
 "3h",
 "12h",
 "1d",
 "1w",
 "1M",
 "1y"
 ],
 "fill": 0,
 "linewidth": 3,
 "timezone": "browser",
 "spyable": true,
 "zoomlinks": true,
 "bars": false,
 "stack": false,
 "points": false,
 "lines": true,
 "legend": true,
 "x-axis": true,
 "y-axis": true,
 "percentage": false,
 "interactive": true,
 "options": true,
 "tooltip": {
 "value_type": "cumulative",
 "query_as_alias": false
 },
 "title": "Mean Response Time"
 }
 ],
 "notice": false
 },
 {
 "title": "Errors",
 "height": "150px",
 "editable": true,
 "collapse": false,
 "collapsable": true,
 "panels": [
 {
 "error": false,
 "span": 8,
 "editable": true,
 "type": "terms",
 "loadingEditor": false,
 "queries": {
 "mode": "all",
 "ids": [
 0,
 3,
 5,
 2,
 6
 ]
 },
 "field": "sourceip",
 "exclude": [
 "127.0.0.1",
 "10.42.7.5",
 "10.42.7.6"
 ],
 "missing": false,
 "other": false,
 "size": 25,
 "order": "count",
 "style": {
 "font-size": "10pt"
 },
 "donut": false,
 "tilt": false,
 "labels": true,
 "arrangement": "horizontal",
 "chart": "bar",
 "counter_pos": "above",
 "spyable": true,
 "title": "Most Frequent IPs"
 },
 {
 "error": false,
 "span": 4,
 "editable": true,
 "type": "pie",
 "loadingEditor": false,
 "query": {
 "field": "program",
 "goal": 100
 },
 "queries": {
 "mode": "all",
 "ids": [
 0,
 3,
 5,
 2,
 6
 ]
 },
 "size": 10,
 "exclude": [
 "su"
 ],
 "donut": false,
 "tilt": false,
 "legend": "above",
 "labels": true,
 "mode": "terms",
 "default_field": "DEFAULT",
 "spyable": true,
 "title": "Top Programs"
 }
 ],
 "notice": false
 },
 {
 "title": "Response Codes",
 "height": "150px",
 "editable": true,
 "collapse": false,
 "collapsable": true,
 "panels": [
 {
 "span": 8,
 "editable": true,
 "type": "histogram",
 "loadingEditor": false,
 "mode": "count",
 "time_field": "@timestamp",
 "queries": {
 "mode": "selected",
 "ids": [
 3,
 5
 ]
 },
 "value_field": "responsecode",
 "auto_int": true,
 "resolution": 100,
 "interval": "30s",
 "intervals": [
 "auto",
 "1s",
 "1m",
 "5m",
 "10m",
 "30m",
 "1h",
 "3h",
 "12h",
 "1d",
 "1w",
 "1M",
 "1y"
 ],
 "fill": 0,
 "linewidth": 3,
 "timezone": "browser",
 "spyable": true,
 "zoomlinks": true,
 "bars": false,
 "stack": true,
 "points": false,
 "lines": true,
 "legend": true,
 "x-axis": true,
 "y-axis": true,
 "percentage": false,
 "interactive": true,
 "options": false,
 "tooltip": {
 "value_type": "cumulative",
 "query_as_alias": false
 },
 "title": "HTTP Response Codes"
 },
 {
 "span": 2,
 "editable": true,
 "type": "trends",
 "loadingEditor": false,
 "queries": {
 "mode": "selected",
 "ids": [
 3,
 5,
 2,
 6
 ]
 },
 "style": {
 "font-size": "16pt"
 },
 "ago": "1d",
 "arrangement": "vertical",
 "spyable": true,
 "title": "Trends"
 },
 {
 "error": false,
 "span": 2,
 "editable": true,
 "type": "derivequeries",
 "loadingEditor": false,
 "loading": false,
 "label": "Search",
 "query": "*",
 "ids": [
 0,
 1,
 3,
 4,
 5
 ],
 "field": "responsecode",
 "fields": [
 "@timestamp",
 "@version",
 "classname",
 "device",
 "facility",
 "facility_label",
 "haproxy_hour",
 "haproxy_milliseconds",
 "haproxy_minute",
 "haproxy_month",
 "haproxy_monthday",
 "haproxy_second",
 "haproxy_time",
 "haproxy_year",
 "host",
 "level",
 "logsource",
 "major",
 "message",
 "minor",
 "name",
 "os",
 "os_major",
 "os_minor",
 "os_name",
 "patch",
 "payload",
 "pid",
 "priority",
 "program",
 "referrer",
 "request",
 "resp_time",
 "responsecode",
 "responsesize",
 "sessionid",
 "severity",
 "severity_label",
 "sourceip",
 "tags",
 "threadid",
 "timestamp",
 "type",
 "ua.device",
 "ua.major",
 "ua.minor",
 "ua.name",
 "ua.os",
 "ua.os_major",
 "ua.os_minor",
 "ua.os_name",
 "ua.patch",
 "user_agent",
 "userid"
 ],
 "spyable": true,
 "rest": false,
 "size": 5,
 "mode": "terms only",
 "exclude": [],
 "history": [
 "*",
 "206"
 ],
 "remember": 10,
 "title": "HTTP Response Codes"
 }
 ],
 "notice": false
 },
 {
 "title": "OS and Browser Info",
 "height": "150px",
 "editable": true,
 "collapse": false,
 "collapsable": true,
 "panels": [
 {
 "error": false,
 "span": 3,
 "editable": true,
 "type": "terms",
 "loadingEditor": false,
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "field": "ua.name",
 "exclude": [
 "other"
 ],
 "missing": false,
 "other": false,
 "size": 10,
 "order": "count",
 "style": {
 "font-size": "10pt"
 },
 "donut": false,
 "tilt": false,
 "labels": true,
 "arrangement": "horizontal",
 "chart": "pie",
 "counter_pos": "above",
 "spyable": true,
 "title": "Browsers"
 },
 {
 "error": false,
 "span": 3,
 "editable": true,
 "type": "terms",
 "loadingEditor": false,
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "field": "ua.os_name",
 "exclude": [
 "other",
 "x",
 "os",
 "7",
 "8",
 "10",
 "xp",
 "vista"
 ],
 "missing": false,
 "other": false,
 "size": 10,
 "order": "count",
 "style": {
 "font-size": "10pt"
 },
 "donut": false,
 "tilt": false,
 "labels": true,
 "arrangement": "horizontal",
 "chart": "pie",
 "counter_pos": "above",
 "spyable": true,
 "title": "Operating Systems"
 },
 {
 "error": false,
 "span": 3,
 "editable": true,
 "type": "pie",
 "loadingEditor": false,
 "query": {
 "field": "responsecode",
 "goal": 100
 },
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "size": 10,
 "exclude": [
 "200",
 "304",
 "302"
 ],
 "donut": false,
 "tilt": false,
 "legend": "above",
 "labels": true,
 "mode": "terms",
 "default_field": "DEFAULT",
 "spyable": true,
 "title": "HTTP Statuses"
 },
 {
 "error": false,
 "span": 3,
 "editable": true,
 "type": "terms",
 "loadingEditor": false,
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "field": "logsource",
 "exclude": [
 "example.com",
 "1",
 "2",
 "3",
 "server"
 ],
 "missing": false,
 "other": false,
 "size": 10,
 "order": "count",
 "style": {
 "font-size": "10pt"
 },
 "donut": false,
 "tilt": false,
 "labels": true,
 "arrangement": "horizontal",
 "chart": "pie",
 "counter_pos": "above",
 "spyable": true,
 "title": "Server Activity"
 }
 ],
 "notice": false
 },
 {
 "title": "Logs",
 "height": "150px",
 "editable": true,
 "collapse": false,
 "collapsable": true,
 "panels": [
 {
 "error": false,
 "span": 12,
 "editable": true,
 "type": "table",
 "loadingEditor": false,
 "status": "Stable",
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "size": 100,
 "pages": 5,
 "offset": 0,
 "sort": [
 "@timestamp",
 "desc"
 ],
 "group": "default",
 "style": {
 "font-size": "9pt"
 },
 "overflow": "min-height",
 "fields": [
 "@timestamp",
 "message",
 "sourceip"
 ],
 "highlight": [],
 "sortable": true,
 "header": true,
 "paging": true,
 "field_list": true,
 "all_fields": true,
 "trimFactor": 300,
 "normTimes": true,
 "spyable": true,
 "title": "Logs"
 }
 ],
 "notice": false
 }
 ],
 "editable": true,
 "failover": false,
 "index": {
 "interval": "day",
 "pattern": "[logstash-]YYYY.MM.DD",
 "default": "NO_TIME_FILTER_OR_INDEX_PATTERN_NOT_MATCHED"
 },
 "style": "light",
 "panel_hints": true,
 "loader": {
 "save_gist": false,
 "save_elasticsearch": true,
 "save_local": true,
 "save_default": true,
 "save_temp": true,
 "save_temp_ttl_enable": true,
 "save_temp_ttl": "30d",
 "load_gist": true,
 "load_elasticsearch": true,
 "load_elasticsearch_size": 20,
 "load_local": true,
 "hide": false
 },
 "pulldowns": [
 {
 "type": "query",
 "collapse": false,
 "notice": false,
 "enable": true,
 "query": "*",
 "pinned": true,
 "history": [
 "responsecode:[500 TO 599]",
 "responsecode:[400 TO 499]",
 "responsecode:[300 TO 399]",
 "*",
 "responsecode:[200 TO 299]",
 "responsecode:>500",
 "responsecode:\"403\"",
 "responsecode:\"404\"",
 "responsecode:\"201\"",
 "responsecode:\"206\""
 ],
 "remember": 10
 },
 {
 "type": "filtering",
 "collapse": false,
 "notice": true,
 "enable": true
 }
 ],
 "nav": [
 {
 "type": "timepicker",
 "collapse": false,
 "notice": false,
 "enable": true,
 "status": "Stable",
 "time_options": [
 "5m",
 "15m",
 "1h",
 "6h",
 "12h",
 "24h",
 "2d",
 "7d",
 "30d"
 ],
 "refresh_intervals": [
 "5s",
 "10s",
 "30s",
 "1m",
 "5m",
 "15m",
 "30m",
 "1h",
 "2h",
 "1d"
 ],
 "timefield": "@timestamp",
 "now": true,
 "filter_id": 0
 }
 ],
 "refresh": "5m"
}

Why Latency – not Bandwidth – is Your Page Load Time Enemy

Preface

If you’re one of my non-nerd friends reading this, you might want to bail now. What follows is likely to not be your cup of tea.

tl;dr

If you are a nerd, yet don’t want to wade through the information below, here is a summary. Assuming all other variables (bandwidth, RTT, browser, etc) remain constant, you can use the following rules-of-thumb:
  • Page load times (PLT) grow linerally as Round Trip Time (RTT) increases
    • In the case of this study, it grows roughly according to this formula: PLT = 10 x RTT + 2 seconds – YMWV depending on the page in question
    • The exact formula depends on the web app in question and the browser used, but the (near) linearity is universal*
  • For modern web apps – that is, ones where various, typically large JavaScript libraries are used – and/or large, non-caheable assets like images that change often – clients should have a bare minimum of 512kbps of bandwidth to/from the servers hosting the app/assets
    • Preferably, 1Mbps or greater for the best experience

* Note: Nothing is universal. I’m painting with a relatively broad brush, so please hold the “Yeah, but…” comments. I know there’s always other variables to consider and corner cases to be found. I’m trying to keep this as “simple” as I can :)

Overview

The intent of this Post is to get some high-level insight into how latency (Round Trip Time, or RTT) and bandwidth affect page load performance for modern web apps.
This is not a new topic, per se. I’m not claiming to plant any flags here. The extra piece I have to add – which I haven’t had luck in finding, at least not all in one place – is understanding why we see these results… not just that we see them, period.
Also, I plan to follow this up with a study combining these two variables – RTT and bandwidth – interact together. In the real world, there are various “profiles” of these combinations, such as:
  • Branch Office: Low RTT but limited – and contentious – bandwidth
  • Teleworker: Low/Mid RTT and limited bandwidth (often without any QoS)
  • Coffee Shop WiFi: Mid RTT and highly contentious, relatively low bandwidth
  • Hotel WiFi (AKA: damn near a 56k modem): Mid RTT and significantly oversubscrubed bandwidth
  • Overseas worker: Plenty of bandwidth, but 150ms+ latency

Testing Methodology

I took measurements of various load times and subtracted server-side delay.
The items measured were:
  • Server-side response time (server-side delay)
  • Time to fully download the framework HTML
    • This particular web app uses Facebook’s BigPipe framework, which helps improve perceived performance on “slower” browsers (read: IE 8 and lower)
  • Time to the DOMContentLoaded event
  • Time to the Page Load event
Note that the Page Load event may or may not correlate with the end-user’s actual perceived page load time, but should be pretty close. To truly accurately measure that, we would need to use an automation framework like Selenium, which is outside the scope of this exercise.
These times were obtained via Chrome/Safari Dev Tools and Firebug for Chrome, Safari, and Firefox, respectively.
The variables changed to observe their effect were:
  • Round Trip Time (RTT): From 0 – 300ms added delay
  • Bandwidth: From 64k – unrestricted WiFi (again, exact bandwidth won’t matter, as you’ll read later)
These variables were manipulated using Charles Proxy. Unfortunately, Charles Proxy does not have a facility to add random packet loss, so this was not tested.
Also, I opted to use WiFi, as this is the connection medium of choice for most end users. Short of adding a touch of latency, it is not expected that WiFi would introduce any significant delays. I did measure the OS-level RTT times, which were ~59ms server(s). Keep that in mind when looking at the charts below. In other words, where you see 50ms in the RTT table below, that is actually closer to 110ms RTT in real life. Baseline (shown as 0), in this case, is ~59ms.

Summary

Unsurprisingly, the results mimicked prior studies on this subject. Similar findings from Googler Ilya Grigorik are posted here, which themselves are similar to this post by Stuart Cheshire back in 1996.
So, we’re not talking about a recent discovery. In the age of modern web applications, however, it has become an increasingly important one.
At the risk of spoiling the rest of this Post, in general:
  • Page load times increase linearly when RTT is increased
  • Page load times increase on a curve when bandwidth is reduced
What I hope to add to the conversation is to detail some of the reasons why these “truths” are, well, truths.
I will also see what rules-of-thumb we might be able to come up with regarding how these variables affect page load times.
Finally, I will also look at different browsers to see how they affect page load times.
Let’s get to it.

RTT

Here is a graph of the page load times in Chrome, as RTT was increased:
 Chrome RTT
You can see that the longer RTT affects the speed at which we get the initial page, but not very much. It does, however, affect times to load the DOM and – most importantly – the page as a whole.
The question is: Why?
This is because the first page you receive has references to tons of other items such as JavaScript and Images. As RTT increases, the browser spends more of its time waiting for each request to be fulfilled. This, in turn, delays when other resources can be downloaded.
This screenshot from Firebug shows the times needed to request and receive various items on the page:
Screen Shot 2013-09-05 at 11.43.00 AM
Notice how it looks like a Gantt Chart, where you see roughly 6 items being downloaded at once. This goes back to our old friend, “The (typically) 6 TCP Session Limit.” For those new to this limitation, most all modern browsers can only have 6 requests to the server outstanding at any one time.
[True, these can be tweaked, but I am trying to stick to a “stock” config here. Telling users to hack their registry or fiddle about in “about://settings” isn’t something non-geeks will be prone to try.]
To elaborate, let’s say that we need to download items 1 – 100 to fully load a given page. To do this, we consume all 6 TCP sessions getting the first 6 items. We can’t get the 7th item (or any other items) until one of those 6 requests is fulfilled. If one of those first 6 items takes a while to get, it holds up the other 94 behind it.
This is why you see the staggered waterfall in the image above. Most of these requests come back after roughly the same amount of delay hence this “stair step”/Gantt Chart-like effect. Items 1-6 come back, then 7-12, then 13-18, and so on.
To help illustrate how the combination of longer RTT and the 6 TCP session limit affect page load times, take note of the brownish color in the image above. That indicates the amount of time that the browser was “blocked” before it could send that particular request. This blocking comes from the TCP session limit; it is the time a given request spends “waiting” on
Here is an example breakdown for one of these affected requests at 300ms RTT. Again, note the brown bar, indicating how long the request is blocked:
Screen Shot 2013-09-05 at 11.43.57 AM
That’s almost half a second (425ms)… just waiting to send the request. There were others later on in the page load waiting > 1.5 seconds at 300ms RTT.
When you consider that many modern web applications will average close to 100 requests (my Gmail page involves ~95 such requests, for example), this is not insignificant – which is what the RTT data shows us.

RTT Rule-of-Thumb

OK, great. How do we get to a simple formula to help understand how RTT affects page load time? Simple: We use a trend line.
Looking at the trend line in the RTT chart above, we see the formula for that line is:
  • y = 9.119x + 2.124
Translated, this means we can estimate page load times for my test app in Chrome by:
  • Multiplying 9x the observed RTT (in fractions of a second, e.g.: 100ms = 0.1s)
    • Run a ping to get the current RTT from a given client
  • Adding 2 seconds to the result
Again, we are not looking to be 100% accurate; it is a rule-of-thumb, not an absolute. Also, remember that we are talking about changing RTT only. Once you start playing with multiple variables (RTT and bandwidth, as would be the case for a telecommuter), the story is not so straight-forward.
The rule-of-thumb for Safari and Firefox are similar, except that the multipliers for each are roughly 11 and 10, respectively.
To make things simple, we can split the difference and say that:
  • Page load time = (10 x RTT) + 2
    • This assumes other variables like bandwidth, browser used, etc. remain constant
Again: This ROT is based on one page in a modern web application. Pages with more or less HTTP requests will be affected more or less, hence have different values in their ROT.
The point is that for any given page, it is a reasonable assumption that page load times increase roughly linerally as RTT increases.

Bandwidth

Developing a formula for how reduced bandwidth affects page load times is a bit more tricky.
Here are page load times for my test web app starting at 64k, working up to T1 speed and through to WiFi connectivity:
Chrome - Bandwidth
An aside: The Y axis is seconds and the X axis is kbps. Note that “1600 kbps” entry is data for unrestricted WiFi-connectivity. If I put that value at 802.11n speeds in the chart, it squishes everything to the “left” such that it was just a cliff. OK, back to the Post…
The trend line for this one is not so simple – nor conducive to a simple rule-of-thumb. What we can see that bandwidth significantly affects page load times once you get below 512kbps. At 256kbps, we see a small uptick, but at 128kbps and 64kbps that page load times jump drastically.
Given that most of the target audience for modern web apps are for users on connections greater than 512k, that means that bandwidth will typically not be a significant contributor to page load times for most users.
But, keep this in mind the next time you go to load GMail while on Gogo or at crappy coffee house WiFi (crappy can define the coffee house or their WiFi, or both – your call :) ).
In conclusion, our rule-of-thumb for bandwidth is:
  • Minimum bandwidth: 512kbps
  • Preferred bandwidth: 1M or greater

Comparing Browsers

We saw that they track similarly when bandwidth is constrained and we know that bandwidth is not a significant issue if available bandwidth is >=512k.
We also saw that page load time tracks linerally with RTT. What about different browsers? How does this change things?
Comparing Chrome, Safari, and Firefox, there was not much to differentiate either browser. The numbers speak for themselves.

An Aside: Internet Explorer

This chart will become more interesting when I add Internet Explorer in a future update, as Internet Explorer (at least through IE9) has some well-known limitations for “modern” websites. Namely, it downloads JavaScript serially whereas other browsers download them in parallel.
True, JavaScript still has to be processed “in order,” but that doesn’t mean you need to wait to download each .js file until the one before it has been downloaded and processed. In many cases, one of the first chunks of JavaScript to be downloaded will be a large library like Dojo or jQuery. This means for IE that the user will sit for a long time, showing no apparent page load progress, while it downloads this JavaScript. It logjams everything behind it.

RTT

RTT - Browsers

Bandwidth

Interestingly, there was a bit more of a delta between the browsers when bandwidth is seriously constrained. That said, I’m not sure that the difference between 21 (Chrome) and 27 second (Firefox) page load times matters. You’ve likely lost the user to some other distraction at that point.
Bandwidth - Browsers

OK, So What Can We Do About This?

Again, everything above isn’t newly discovered. Because of this, a few different schools of thought have come about to mitigate the effect of RTT and bandwidth constraints, including:

  1. Split content across multiple hosts to get around the 6 TCP session limit
  2. Content Delivery Networks
  3. Minifying
  4. Lazy loading
  5. The oldest of them all: Caching

The Internet is full of Posts covering these topics, so I won’t rehash those, but they gist is they all look to minimize:

  1. The size of content needed to load a given page
  2. The number of requests needed to load a given page
  3. Parallelizing the requests needed to load a given page

While reducing the size helps with bandwidth-constrained clients, a key effect of reducing the size of a given payload is actually better realized by freeing up that TCP session faster, thus unblocking later requests faster.

If we look at this from the bigger picture, there is really only one answer to “how do we make pages load faster”: Reduce the number of requests. That ultimately has the biggest bang for the buck.

CDNs, caching, etc all help, but if I still have to wait for the server to tell me that a given piece of content is unchanged (302), I still had to wait for that response, which holds up later requests, and so on.

As a case in point, for the RTT times shown above, about 2/3 of the requests got a 302 back. The same number of requests and 200s vs 302s came back in all tests. So, it’s not a matter of needing a bigger pipe. The only solution is to send less requests to begin with. This is one (of the many) things that minification intends to solve. Without minification and JavaScript bundling, the story would be significantly worse.

Conclusion

If we take the info above and apply it to where things are going – namely mobility, both apps and the mobile web – we can see that while 3G vs LTE makes a difference, our biggest enemy is latency. Yes, LTE definitely gives us a bigger “pipe” but you’d be surprised at how much of the speed you perceive is related to the lower latency from your device to the carrier on LTE vs 3G technologies.

In fact, Ars Technica recently posted about how ViaSat – a satellite-based ISP – has speeds that rank well with the big boys. But, ask anyone who has been forced to use the high-latency technology and you will hear tales of woe. My Dad, for example, ditched satellite Internet the second that he could get DSL in his area. They love their, low(er)-latency 1.5MBps DSL way better than their satellite link, which would top out around 12MBps on SpeedTest.net.

To crib from SNL’s “Lazy Sunday” skit, “It’s all about the latency, baby!”

… and I’m Ghost like Swayze.