Driving Service Discovery with Sensu: Two Great Tastes that Taste Great Together…

Prologue (AKA: WTF are you doing this?)

In my new gig, I’ve been given some new and interesting challenges to solve. One of them is Service Discovery (SD).

Aside: If you are unfamiliar with Service Discovery, check out this post by Julia Evans over at Stripe. Not only does it provide a succinct explanation and real-world application, but in typical Julia fashion, it has a cute comic that explains it all.

Another thing I am tasked with is to modernize our monitoring, enabling a more DevOps-y way of building monitors (read: “Build your own damned monitors… erm, I mean, let’s help enable you to create your own monitoring”).

For me, that means bringing Sensu to the party. I am a huge fan of using standalone checks, rather than driving monitors from the server down. Not only does this allow for that DevOps-y workflow, but it means I can focus on improving my monitoring platform and enabling teams to use it. But, I digress…

After an internal bakeoff, we settled on Consul as our SD platform of choice.

OK, We Picked Consul… What Now?

We loved Consul, but there was a problem: YAN (Yet Another Agent).

Being both a large Enterprise and a telco, there is a fair amount of hoop-jumping (InfoSec, customer data protection, etc) that needs to be done to add anything on a box – especially when it’s something you want on all boxes.

That said, even if we did install the Consul agent, it doesn’t pay us any benefit without the service owners either registering services themselves or using config management to create the required JSON files on each host.

While we have some teams who are living in Microserviceville, an overwhelming majority of our services are aging J2EE stacks built with proprietary frameworks; “Java middleware and sadness,” as Bridget Kromhout would call it. This means many of them have very little automation (which is something my team was brought in to fix!).

Rather than pack up the SD effort, I realized I could come at this from another direction. Consul has an HTTP API, so why not use the Sensu checks we have for each service endpoint to drive the registration – and state – of those services?

Go on.jpg

That way, we get a “twofer” – monitoring and service discovery. Two great tastes… wait, I did that one already.

Making Some Reese’s

Now we get to the fun part. It turns out that having Sensu drive the SD updates is pretty easy. You just need a couple of moving parts:

  1. Metadata about the service
  2. Handler

Note: Being relatively new, I’m working to find out what permissions I need to get to put my docker-compose based PoC on GitHub. I’ll update this post with the URL when/if that happens. That said, most of the important bits are detailed below.

Service Metadata

This part just involves adding a bit of metadata to whatever sensu-client check is monitoring the service is question. You are monitoring the services you care about? Right?

Say you have a web server you are monitoring like so:

{
    "checks": {
        "check-http": {
            "command": "/opt/sensu/embedded/bin/check-http.rb -u http://localhost",
            "occurrences": 2,
            "handler": "consul",
            "subscribers": ["web"],
            "standalone": true,
            "interval": 60,
            "ttl": 120,
            "service_registry": {
                "service_name": "web",
                "port": 80,
                "tags": ["green"]
            }
        }
    }
}

It’s that “service_registry” section which drives the interaction with Consul. While it’s pretty self-explanatory, let’s cover the pieces just to be sure:

  • service_name: How you want the service to appear in Consul
  • port: The port on which the service can be found
  • tags: This is optional, but tags allow you to find hosts based on things other than the service(s) they host and their hostname

Consul Handler

This metadata is used by the Consul handler to drive updates to the status of the service in Consul. Luckily, if we update the state of a service in Consul – and that service does not yet exist – then Consul will add it. This makes the logic of the handler pretty simple.

When an event is triggered, then the handler will shoot JSON to Consul’s /v1/catalog/register endpoint:

{
    "Datacenter": "qlab01",
    "Node": "web-node-01",
    "Address": "172.16.15.15",
    "Service": {
        "Service": "web",
        "Tags": "green",
        "Address": "172.16.15.15",
        "Port": 80
    },
    "Check": {
        "Node": "web-node-01",
        "CheckID": "web:web-node-01",
        "Name": "sensu-driven check",
        "Status": "failing",
        "ServiceID": "web",
        "Output": 'CheckHTTP CRITICAL: Request error: Failed to open TCP connection to localhost:80 (Connection refused - connect(2) for "localhost" port 80)'
    }
}

This causes the service to be updated in Consul, like so:

Screen Shot 2017-06-02 at 4.48.10 PM

Likewise, when the check clears, it will post back the same JSON, except setting the status to “passing”.

Getting back to my penchant for “pushing monitoring to the left,” this means my Developers can add a standalone check on their boxes to get both a monitor and service registry, two great… you get the point.

But Wait! There’s More!

It turns out there is a maturity model you can enable with this setup, allowing you to get some of the benefits of SD as your teams to gain the maturity to do SD the right way.

But, that’s a post for another day.

Some Shortcomings

There is one glaring shortcoming of this approach: The handler only triggers on a state transition. I am hoping is just a matter of education on my part to find a way to get Sensu to auto-push the data to Consul. Worst case, I can have a check on the Sensu server that iterates the list of checks for the service_discovery metadata, but that feels a little dirty.

Again, I hope to be able to post my working prototype once I get permission. But, hopefully this helps someone either directly or just by helping people think outside the box and optimize for the constraints that surround your work.

Bloglet: Tales from WTFLand – nginx-auth-ldap misconfig causes crash with no error logged

… other than the crash itself, of course :)

Will keep this one short and sweet (not my norm, I know). Just wanted to post it out there, as I didn’t find a hit despite some GoogleFu.

I was refactoring some Ansible automation, which involved using conditionals in my templates when LDAP was enabled or not. I missed a change and had this section in one of my .conf files:

 auth_ldap "Closed content";
 auth_ldap_servers {{ nginx_ldap.server }};

… but, the other conditionals were working properly, thus the config that would have the LDAP config referenced above was not present.

Turns out nginx-auth-ldap doesn’t handle this kind of screwup gracefully. Instead, this is all you get in /var/log/messages:

Mar 20 09:27:59 logging-01 systemd: Starting The nginx HTTP and reverse proxy server...
Mar 20 09:27:59 logging-01 kernel: nginx[427]: segfault at 8 ip 00007f68ff51ca3a sp 00007fffd67e5a30 error 4 in nginx[7f68ff457000+f8000]
Mar 20 09:27:59 logging-01 systemd: nginx.service: control process exited, code=killed status=11
Mar 20 09:27:59 logging-01 systemd: Failed to start The nginx HTTP and reverse proxy server.
Mar 20 09:27:59 logging-01 systemd: Unit nginx.service entered failed state.

Nothing is logged to /var/log/nginx/error.log (or any other nginx logs).

Anyway, hope that ends up saving someone from the couple of hours of head scratching I did last night trying to grok WTF happened here.

Home is Where the Track Is

My only steady hobby is racing my RC car. It’s a blast to drive, but I am pretty terrible at it.

Now that I can keep it off the walls (mostly), I am working on laying down consistent laps. I ham having making and significant improvements because I can only make it out to Mike’s one day a week to practice. Couple that with the fact that they’re changing the track layout almost weekly and you’ll undertand my dilemma: I’ve basically plateaued.

Since I am not able to make it to the track as often as I’d like, I’ve decided to try and bring the track to me.

My Christmas toy this year was a Mini-Z Sports. I opted for the Sauber C9 since I am a fan of the Sauber F1 team.

Anyway, the idea behind getting the Mini-Z – besides having another cool toy :) – was to turn laps at home. Sure, it’s not 100% like-for-like with my 1:10 F1 car, but they are both essentially pan cars. Here is a lengthy, but thorough description of what a pan car is, for those who are interested.

Long and short, they should drive similarly. Even if not, I would still be refining my hand-eye coordination, as these Mini-Z’s are pretty quick and nimble.

Now that I had the car, it was time to focus on making a track for my garage – taking into account these key limitations:

  1. The end product had to aim for as high of a Wife Acceptance Factor as possible
  2. Make setup and teardown as quick and simple as possible
  3. Part of hitting #1, above, is amount of room the solution takes up when put away
  4. Try to replicate an ozite carpet surface, as closely as possible
  5. As always, minimize cost
  6. Finally, I’ve only got 13′ x 21′ of usable garage space to play with

Options Considered

With that in mind, it was time to focus on the two key items for the track:

  1. Track surface/material
  2. Track walls/dividers

Track Surface

Being the obsessive person I am, I spent way too much time reading many accounts of others’ experiences with their home tracks. Ultimately, a few front-runners for the track surface emerged.

  1. RCP: This is far and away the #1 track surface for Mini-Z’s in the US (though carpet is the choice in Asia)
  2. EVA Foam Tiles: EVA Foam is the stuff they make those interlocking tiles for kids playrooms, gyms, etc. from
  3. Rolled rubber/neoprene/etc
  4. Canvas
  5. Tar paper
  6. Carpet

RCP

RCP is popular because it has high levels of grip. It also gets cool points for the Lego-like “snapability”, not to mention the rails to keep the cars on the track. RCP is the most professional-looking out of all of these options, hands down.

I passed on RCP for a few reasons, though:

  1. Cost: Professionalism costs :) A typical RCP setup is to buy two of these “Wide L” setups… at $210 a piece retail. I don’t think that includes shipping, either, but I might be wrong. Just seemed wrong to spend ~$500 for a track to drive a $160 car on it.
  2. Space: One Wide L set would be too simplistic of a track, but I didn’t have room to accomodate all of two Wide L’s. So, I’d be paying for track that I couldn’t use.
  3. Grip: Having seen these cars in action at Inside Line Racing, I can say they stick to RCP pretty damned well. That doesn’t fit with my goal of replicating the medium levels of grip found on the ozite at Mikes.
  4. Limited layout options: The shape of RCP tiles is pre-defined. A straight is a straight. All curves are 90 degrees (unless you spend $70-140 for one 45 degree turn). This limits your options on track layouts.

On a side note, the owners at Inside Line Racing are super cool peeps. Definitely stop by if you’re in the South Bay area, even if it is just to watch some laps.

EVA Foam Tiles

The story on EVA tiles was a mixed bag. You had some people saying that they were nearly identical to RCP and others – who also seemed to be higher-skilled drivers, BTW – saying there was a major difference between the two.

In either case, there was a fair amount of prep work that would need to go into the EVA foam in order to get the mold release agent (read: grease) off of the tiles before you could run on them.

EVA didn’t make the cut because:

  1. Cost: Looking around many places on the Internet, the best deal I could find for EVA tiles was in the ballpark of $0.85 per sqft. That’s about 2x the cost of what I ultimately settled on.
  2. Setup/Teardown: I would be looking at 60 or so 2’x2′ pieces that I would have to assemble every time I wanted to drive. Given that I will only have an hour or two here and there, I’d rather spend that precious time driving.
  3. Storage: At 3/8″, we’re talking 7.5 cubic feet (if I’m doing my math right) of tiles to store. Either way, we’re talking a minimum of 2′ in from any of the garage walls, which is quite a bit when you only have about 20′ total width… and you’re trying to fit two cars along with enough space between to actuall get out of the cars.
  4. Limited layout options: Same issue as RCP

Rolled Rubber/Neoprene

These were interesting options, but they ultimately didn’t pan out for a few reasons:

  1. Grip: This would likely be even grippier than RCP. They use something like this in Europe, so there is precedent, but recall that I am trying to approximate carpet-levels of grip.
  2. Cost: Best I could find was about $1/sqft
  3. Durability (maybe imagined): The rolled rubber – at the $1/sqft price point – was chunks of cut rubber, which concerned me about how it would fare being rolled and unrolled
  4. Material width: The widest roll I could find – again, at a reasonable price point – was 6′, but that much more than $1/sqft. I was not keen on having to lay two or three runs of this together and hoping there isn’t a big seam, having to tape them together, etc.

Canvas

I came across a couple of posts where people mentioned using canvas for their track. Honestly, this wouldn’t have occurred to me, but they universally agreed that grip levels were really great.

I passed on canvas because:

  1. Grip: Sounded like the grip was too good
  2. Difficult to lay flat: There would be seams and creases in the material
  3. Durability: On a per sqft basis, the price was similar to my final solution, but I was concerned how well it would hold up over time
  4. Material width: Ultimately, there wasn’t a cost-appropriate setup that would have fit my garage without having to tape multiple sheets together, etc.

Tar Paper

I read many accounts of people using tar paper for Mini Z tracks. The issues here were several, and similar to the other “rolled” options:

  1. Longevity. Tar paper (“roofing felt”) simply isn’t made to be rolled and unrolled. Plus, it tears easily.
  2. Cost
  3. Material Width
  4. Grip: From what I read, sounds like the grip would have been too good

Walls/Rails

RCP and (somewhat) EVA foam come with their own rails, which is nice. That said, both are limited in their shape, thus limiting layout options. Besides, they were out of the running due to cost, anyway.

Other posts turned up some good ideas, but here are the ones I passed on:

PVC Pipe

  • Too expensive
  • Inflexible (literally, and in options for layout)

Pool Noodles

  • Same issues as PVC, save flexibility
  • Much more expensive, unless I found them on clearance or something

Garden Hose with Sand

Some people said they used garden hose with sand inside, just to give the hose weight and some “positionability.”

“Final” Product

Ultimately, here is what I settled on:

  • “Commercial,” low-pile indoor/outdoor carpet
  • Garden hose, with dowels to keep the hose straight and give them some weight

Carpet

Carpet won out for the following reasons:

  1. Cost: $0.49/sqft, which was about 1/2 the closest other material
  2. Width: Carpet is available in 12′ rolls, which uses almost all of the 13′ available
  3. Ease of setup/teardown: Roll out and roll up :)
  4. Closest resembelence to ozite
  5. Can get locally, thus not pay shipping costs (which are non-trivial)
  6. This style of carpet is what Kyosho themselves use
    • The folks at Kyosho were kind enough to tell me the following about their carpet:
      • Manufacturer: Shaw
      • Style: “Putting Green” is the name, though it’s not a grass-like pile – and it’s black, not green :)
      • Bought at: Lowes

In the process, I learned more about carpet than I rightly should. Here are the key takeaways:

  1. Ozite is pretty damned expensive – and must be ordered in large quantities, plus comes in 6′ wide rolls
  2. Mini Z’s are pretty sensitive to the height/pile of the carpet, as fibers start binding wheels, get caught in gears, etc
  3. There are three main carpet materials: polyester, nylon and Olefin. You want Olefin for RC cars, as it is the most resistent to generating static electricity.
  4. Ideally, you want a needlepunch carpet, as this minimizes the about of fuzz that comes off the carpet
  5. The outdoor-style carpet has a different type of backing vs “normal” carpet (more on this in a bit)

Here is the carpet I chose, for those who are interested.

An Unexpected Discovery

Reading further, it seems that you really need foam tires to run Mini-Zs well – and I don’t have foams at the moment.

Indeed, the rear was pretty squirly with slicks on the carpet. The front seemed planted, but the rear as all over the place – which is the opposite of my experience at Mike’s, where the rear sticks well, but the front pushes if I take a corner too hot.

I noticed that the backside of the carpet was a rough, semi-rubbery surface. Seemed kind of like canvas in a way. So, on a whim, I flipped the carpet and ran on that.

Oddly enough, the grip and handling were pretty similar to my F1 car at Mike’s.

Here’s a picture of the underside of the carpet to show you what it looks like:

IMG_1265

 

It looks fuzzy as all heck, but the car mainly just comes off with dust – which I understand is an issue with RCP, as well.

Garden Hose

I opted for a hose that was not necessarily the cheapest for a few reasons:

  1. Diameter: It seemed that 3/8″ inside diameter hose woudln’t keep the car on the track
  2. Finish: The cheaper hoses had a rubbery finish, which I suspected would have not let the car glance off the “wall”
  3. Color: Vein, but I liked the dark gray over green or tan :)

I ended up getting 100′ of 5/8″ inside diameter hose, as I needed enough for the perimeter (12’x2 + 20’x2 = 64′) as well as making the walls inside the track.

Initially, I thought I could straighten out the hose by bending it opposite to the way it was rolled up. This helped, but it didn’t work that well. It did not take long for the hose to revert back to a curve/circle.

As mentioned above, I was thinking about using sand in the hose to weigh it down, but I feared that would not address the curling issue. So, I opted to get 4′ x 1/2″ dowels to keep te hose straight. The 3/8″ were 2/3 the cost, but they seemed a bit too flimsy for the job, as the hose was pretty robust and didn’t want to be straight :)

Originally, I just planned on using circles of hose at the corners – kind of like the “dots” you’ll see at most on road tracks. But, I found that the dowels can be used not just to make a run of host straight, but you can use them to hook runs of hose together – including the circles of hose. This allows me to make pretty clean, curved corners, like so:

Of course… Duct Tape

The one major gap with my creation is that any free-floating pieces of hose don’t have enough weight to keep from being knocked around. Even those that are hooked together can be moved over time.

For now, my answer to those issues is duct tape. Keeps with the simple setup/teardown criteria and gives it that DIY feel :)

Here’s a pic of the end result. Again, this is just one of an endless possibility of layouts.

IMG_1252

Future Plans

The main thing I need now is a timing solution. I tried using Easy Lap Counter. It actually worked pretty well, but it unfortunately (and oddly) doesn’t announce the lap times.

Knowing immedialtely how a lap went is the best feedback tool to know what you’re doing right – and wrong. Not having that really diminishes the value of this setup.

So, my next plan is to use the Ping))) sonic rangefinder with my Arduino and modify the sketch for the “Flip Side Dio” so I can use the excellent Flip Side Lap Counter software.

Ultimately, I might retool it to use RFID, but since it’s just me for now, this should work just fine.

Transforming Cumulative Ceilometer Stats to Gauges

Overview

I’ve been dabbling a bit more with OpenStack as of late. If you know me, you can likely guess my goal is how to ingest logs, monitor resources, etc.

I’ve been trying to see how well Ceilometer, one of the core components of OpenStack that actually provides some of this stuff, would work. Initially, I was a bit bummed, but after fumbling around for a while, I am starting to see the light.

You see, the reason I almost abandoned the idea of using Ceilometer was due to the fact that some of the “meters” it provides are, well, a bit nonsensical (IMHO). For example, there’s network.outgoing.bytes, which is what you would expect… sort of. Turns out, this is a “cumulative” meter. In other words, this meter tells me the total number of bytes sent out a given Instance’s virtual NIC. Ceilometer has the following meter types to measure :

  • Cumulative: Increasing over time (e.g.: instance hours)
  • Gauge: Discrete items (e.g.: floating IPs, image uploads) and fluctuating values (e.g.: disk I/O)
  • Delta: Changing over time (bandwidth)

Maybe I am naive, but seems quite a bit more helpful to track this as a value based on a given period… you know, so I can get a hint of how much bandwidth a given instance is using. In Ceilometer parlance, this would be a delta metric.

I’ll take this aside and defend the fine folks working on Celiometer on this one. Ceilometer was built initially to generate non-repudiable bulling info. Technically, AFAIK, that is the project’s only goal – though it has morphed to gain things like an alarm framework.

So, now you can see why network.outgoing.bytes would be cumulative: so you can bill a customer for serving up torrents, etc.

Anyway, I can’t imagine that I’m the only person looking to get Delta metrics out of a Cumulative one, so I thought I’d document my way of getting there. Ultimately there might be a better way, YMMV, caveat emptor, covering my backside, yadda yadda.

Transformers to the Rescue!

… no, not that kind of Transformer. Lemme ‘splain. No, there is too much. Let me sum up.

Actually, before we start diving in, let’s take a quick tour of Ceilometer’s workflow.

The general steps to the Ceilometer workflow are:

Collect -> (Optionally) Transform -> Publish -> Store -> Read

Collect

There are two methods of collection:

  1. Services (Nova, Neutron, Cinder, Glance, Swift) push data into AMQP and Ceilometer slurps them down
  2. Agents/Notification Handlers poll APIs of the services

This is where our meter data comes from.

Transform/Publish

This is the focus of this post. Transforms are done via the “Pipeline.”

The flow for the Pipeline is:

Meter -> Transformer -> Publisher -> Receiver

  • Meter: Data/Event being collected
  • Transformer: (Optional) Take meters and output new data based on those meters
  • Publisher: How you want to push the data out of Ceilometer
    • To my knowledge, there are only two options:
  • Receiver: This is the system outside Ceilometer that will receive what the Publisher sends (Logstash for me, at least at the present – will likely move to StatsD + Graphite later on)

Store

While tangental to this post, I won’t leave you wondering about the “Store” part of the pipeline. Here are the storage options:
* Default: Embedded MongoDB
* Optional:
* SQL
* HBase
* DB2

Honorable Mention: Ceilometer APIs

Like pretty much everything else in OpenStack, Ceilometer has a suite of OpenAPIs that can also be used to fetch metering data. I initially considered this route, but in the interest of efficiency (read: laziness), I opted to use the Publisher vs rolling my own code to call the APIs.

Working the Pipeline

There are two Transformers (at least that I see in the source):

  • Scaling
  • Rate of Change

In our case, we are interested in the latter, as it will give us the delta between two samples.

To change/use a given Transformer, we need to create a new pipeline via /etc/ceilometer/pipeline.yaml

Here is the default pipeline.yaml:

---
 -
     name: meter_pipeline
     interval: 600
     meters:
         - "*"
     transformers:
     publishers:
         - rpc://
 -
     name: cpu_pipeline
     interval: 600
     meters:
         - "cpu"
     transformers:
         - name: "rate_of_change"
           parameters:
               target:
                   name: "cpu_util"
                   unit: "%"
                   type: "gauge"
                   scale: "100.0 / (10**9 * (resource_metadata.cpu_number or 1))"
     publishers:
         - rpc://

The “cpu_pipeline” pipeline gives us a good example of what we will need:

  • A name for the pipeline
  • The interval (in seconds) that we want the pipeline triggered
  • Which meters we are interested in (“*” is a wildcard for everything, but you can also have an explicit list for when you want the same transformer to act on multiple meters)
  • The name of the transformation we want to use (scaling|rate_of_change)
  • Some parameters to do our transformation:
    • Name: Optionally used if you want to override the metric’s original name * Unit: Like Name, can be used to override the original unit (useful for things like converting network.*.bytes from B(ytes) to MB or GB)
    • Type: If you want to override the default type (remember they are: (cumulative|gauge|delta))
    • Scale: A snippet of Python for when you want to scale the result in some way (would typically be used along with Unit)
      • Side note: This one seems to be required, as when I omitted it, I got the value of the cumulative metric. Please feel free to comment if I goobered something up there.

Looking at all of this, we can see that the cpu_pipeline, er, pipeline:

  1. Multiplies the number of vCPUs in the instance (resource_metadata.cpu_number) times 1 billion (10^9, or 10**9 in Python)
    • Note the “or 1”, which is a catch for when resource_metadata.cpu_number doesn’t exist
  2. Divides 100 by the result

The end result is a value that tells us how taxed the Instance’s is from a CPU standpoint, expressed as a percentage.

Bringing it all Home

Armed with this knowledge, here is what I came up with to get a delta metric out of the network.*.bytes metrics:

    name: network_stats
    interval: 10
    meters:
        - "network.incoming.bytes"
        - "network.outgoing.bytes"
    transformers:
        - name: "rate_of_change"
          parameters:
              target:
                  type: "gauge"
                  scale: "1"
    publishers:
        - udp://192.168.255.149:31337

In this case, I’m taking the network.incoming.bytes and network.outgoing.bytes metrics and passing them through the “Rate of Change” transformer to spit a gauge out of what was previously a comulative metric.

I could have (and likely will) taken it a step further and used the scale parameter to change it from bytes to KB. For now, I am playing with OpenStack in a VM on my laptop, so the amount of traffic is small. After all, the difference between 1.1 and 1.4 in a histogram panel in Kibana isn’t very interesting looking :)

Oh, I forgot… the Publisher. Remember how I said the UDP Publisher uses msgpack to stuff its data in? It just so happens that Logstash has both a UDP input and a msgpack codec. As a result, my Receiver is Logstash – at least for now. Again, it would make alot more sense to ship this through StatsD and use Graphite to visualize the data. But, even then, I can still use Logstash’s StatsD output for that. Decisions, decisions :)

Since the data is in Logstash, that means I can use Kibana to make pretty charts with the data.

Here are the bits I added to my Logstash config to make this happen:

input {
  udp {
    port => 31337
    codec => msgpack
    type => ceilometer
  }
}

At that point, I get lovely input in ElasticSearch like the following:

{
  "_index": "logstash-2014.01.16",
  "_type": "logs",
  "_id": "CDPI8-ADSDCoPiqY9YqlEw",
  "_score": null,
  "_source": {
    "user_id": "21c98bfa03f14d56bb7a3a44285acf12",
    "name": "network.incoming.bytes",
    "resource_id": "instance-00000009-41a3ff24-f47e-4e29-86ce-4f50a61f78bd-tap30829cd9-5e",
    "timestamp": "2014-01-16T21:54:56Z",
    "resource_metadata": {
      "name": "tap30829cd9-5e",
      "parameters": {},
      "fref": null,
      "instance_id": "41a3ff24-f47e-4e29-86ce-4f50a61f78bd",
      "instance_type": "bfeabe24-08dc-4ea9-9321-1f7cf74b858b",
      "mac": "fa:16:3e:95:84:b8"
    },
    "volume": 1281.7777777777778,
    "source": "openstack",
    "project_id": "81ad9bf97d5f47da9d85569a50bdf4c2",
    "type": "gauge",
    "id": "d66f268c-7ef8-11e3-98cb-000c29785579",
    "unit": "B",
    "@timestamp": "2014-01-16T21:54:56.573Z",
    "@version": "1",
    "tags": [],
    "host": "192.168.255.147"
  },
  "sort": [
    1389909296573
  ]
}

Finally, I can then key a Histogram panel in Kibana to key on the “Volume” field for these documents and graph the result, like so:

Screen Shot 2014-01-16 at 4.30.28 PM

OK, so maybe not as pretty as I sold it to be, but that’s the data’s fault – not the toolchain :) It will look much more interesting once I mirror this on a live system.

Hopefully someone out there on the Intertubes will find this useful and let me know if there’s a better way to get at this!

There’s Gold in Them Thar Logs!

Oh crap. Another post about logging…

Yep, get over it. This time, however, I’m not here to piss and moan about how much logging stinks. Rather, I want to show the power of extracting data from the logging you have.

Logstash + ElasticSearch + Kibana = Logging Bliss!

I’m not necessarily treading any new ground here, but for the initiated, let me quickly go over what is becoming a very popular logging technology stack in the DevOps space:

  • Logstash
  • ElasticSearch
  • Kibana

I will get into some of the deeper details later, but let’s take a quick tour of the components first.

Logstash

Logstash is the brain child of Jordan Sissel. At a high level, the goal of Logstash is to chew up your logs and spit out contextualized data from what would otherwise be blobs of text. Said another way, instead of mountains of plain-text log files, you get the same log content split out and stuffed into fields you define. This allows you to do all sorts of things with the logs, not the least of which is to perform focused searches in Kibana for exactly the data you need.

Going with the lumberjack analogy, think of Logstash as your lumber mill that takes your logs and spits out 2x4s, etc. There’s a reason why you don’t see tree trunks at Home Depot. Tree trunks do me no good. Studs and plywood, now that I can work with.

ElasticSearch

In this stack, ElasticSearch operates as our super-fast “database” to hold what Logstash kicks out (technically, it’s a search index, but let’s not nerd out too much here). This description sells both ElasticSearch and Logstash short (Logstash can output in many different formats, not just ElasticSearch), but I’m trying to stick to the solution at hand.

ElasticSearch is based on Apache Lucene – just like Apache Solr – and lives to gobble up data broken into fields, then query large volumes of that data near instantly. It also gives a rich query syntax, which Kibana takes advantage of.

Kibana

Kibana is where we see the fruits of Logstash’s and ElasticSearch’s efforts. Kibana is a slick web UI that queries ElasticSearch for various purposes:

  1. To search logs (again, allowing you to use Lucene search syntax)
  2. To make pretty charts
  3. To show the most common items in logs
  4. etc

As an example on the searching, think of the last time you had to visit your Apache or Tomcat access logs to play Columbo for some issue that happened. I am sure it was an exercise of grep, sed, awk, etc to find that needle in the haystack. Even then, you end up with alot of stuff that is likely not relevant (heaven forbid you need to look for common words like “post,” “get,” or “user”). Now you can say things like, “Show me all the log statements where”:

  1. ‘user’ shows up in the referring URL
  2. The user got a 500 and the server took more than 500ms to fulfill the request
  3. My user ID exists, my boss’ user ID exists, and the word “raise” was in the request URL ;)

And here’s a cool benefit: Normal human beings can now have access to logs with a “type what you’re looking for” interface, so, I dunno, maybe they can leave you alone to do more interesting things?

What is interesting about Kibana is that there is not a server-side component. Kibana is all client-side code. All you need is a web server to pass along Kibana’s code to the browser, then the browser issues queries to ElasticSearch directly via ElasticSearch’s APIs.

While Kibana has a built-in dashboard for Logstash that allows you to search your logs, where things really get interesting is applying Kibana’s Panels to display data from your logs. Think of Panels as the Lego from which you construct your dashboard. Behind each of these is an ElasticSearch query that populates that Panel.

While I will cover all of the Panels in my dashboard, here is a teaser:

Screen Shot 2013-11-11 at 8.03.44 AM

My First Dashboard ™

Typically the best way to learn something is through an example. The primary reason I am writing this post is to help others take advantage of this tech stack, which I hope to do by walking though what I did – and shouldn’t have done – to get my Kibana dashboard in place.

Recall that the pipeline is like so:

Logging source (syslog/file/flume/etc) => Logstash => ElasticSearch

“Wait, where’s Kibana?” you might be asking. Remember that Kibana is just the UI to visualize data that Logstash, um, stashed in ElasticSearch. I’ll definitely cover Kibana in detail, but we need to start with Logstash, as that is where we add meaning to the text of our logging. Without that context, Kibana will be of limited use to us. Garbage in, garbage out – right?

Magic ain’t Free

Again, I am selling its full capabilities way short, but Logstash’s job here is to take a single line of text and – based on filters, either pre-defined or custom – assign chunks of that text to fields we define. We will use these fields later on not only to more intelligently query our logs, but to target specific data to be shown in Panels in Kibana.

  1. Be the logs
  2. Know the end goal
  3. Grok grok (no, that’s not a typo)
  4. Iterate, because you ain’t gonna get it right the first time

This will take work, but to crib from Alton Brown, “Your hard work will be rewarded.”

Be the Logs

To really get the value out of this stack, you need to know your logs, as well as what you want to get out of them.

At this stage, we just want to look at our target logging and see what kind of data might be useful. So, do the obvious: copy a typical log statement into a text editor and look at it in isolation. Logstash doesn’t have the benefit of context. It’s going to chop up those logs one line at a time, hence I find it best to start my analysis by looking at just one log statement.

As mentioned in #4 above, you are likely to find that your logging is not as “normalized” as you think, so you’ll likely have to revisit this step quite few times along your journey.

Let’s take an example Apache access_log statement:

10.129.16.62 - - [11/Nov/2013:07:24:35 -0800] "GET /html/themes/default/images/sprites/core_sprite_left.png?t=74 HTTP/1.1" 304 - "https://my-server.cjchand.com/" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36" 354

There is the following useful/interesting information in that single log statement (in order shown):

  • Client IP address
  • Datestamp
  • Request URL
  • HTTP status code
  • Referring URL
  • User Agent string
  • Response time

Of those, are there any items that are throwaways (e.g.: data that would not be interesting/useful as search or graphing criteria)? In this case, I would argue not, but especially with logging in custom applications, you can end up with some items that might be interesting when viewed with the log statement itself, but might not be something that we would need to filter/chart (e.g.: PID).

That said, at this point, err on the side of capturing everything you can. If you ultimately determine you don’t need that data, you can simply bit-bucket it via your grok pattern and/or Logstash config.

Similarly to knowing what meaningful data lives in your logs, it helps to know what Kibana can visualize for you. Here are the Panel types available (at the least the non-deprecated ones) at the time I am writing this:

  • Bettermap: Uses Longitude and Latitude coordinates to place a “pin” on a map
  • Column: A “wrapper” panel of sorts; used to control the layout of panels in a dashboard
  • Derivequeries: This panel auto-creates facets for a given field (e.g.: might create multiple queries for each HTTP status code seen in the current data set)
  • Histogram: A chart to show the total, count, mean, etc number of log statements in the timerange specified
  • Hits: Total count of lines matching a given query/set of queries
  • Terms: A “Top X” list of the most common terms found in log statements
    • Can be viewed as:
      • Chart: Pie or Bar
      • Table
  • Text: Panel to hold text or sanitized HTML (including in Markdown syntax)
  • Trends: Based on a user-configurable time frame, compare the data shown to the same data across the specified timeframe
    • Example: I am looking at from 13:00-14:00 and I want to know – as a percentage – how many more/less 404’s I have seen from 13:00-14:00 today vs 13:00-14:00 yesterday)

Know the End Goal

“Wait. Shouldn’t I start with this step?” Sure, you can definitely make that argument, but in many cases you have little control over what logging you get – which is often the case if you are not the one developing the code or controlling the deployment of the app.

As such, my philosophy is to see what I have to work with first. Then, I set my target for the final product.

YMMV on when you this step, but you definitely need to perform this step before you sit down to craft grok patterns.

Continuing with our Apache access log example, here are some potential target items we would want to craft grok patterns around – and why:

  • Search Filters:
    • All of these fields
  • Charting/Trending items:
    • Client IP address:
      • Top XX IP addresses
      • Geolocation of client
    • HTTP status code:
      • Histogram of unexpected statuses
    • User Agent String:
      • Client OS
      • Client browser
    • Response Time:
      • Histogram of mean response times (yes, I know that’s not statistically the best measure, but it’s what Kibana gives us. There are no confidence intervals, etc.)

Do you Grok Me?

Logstash comes with some built-in filters (called “grok patterns“). These cover common items like IP addresses, timestamps, and even common log types such as httpd access logs.

Using the Lego analogy again, most logs can be parsed using these out-of-the-box grok patterns. You can check out the full list in the Logstash GitHub.

As with most things, odds are your logging might not dovetail perfectly with these patterns and you will have to create your own. But, hey, that’s job security, right? ;)

Creating your own Grok patterns requires some heavy lifting with regex, on which grok patterns are based.

No! Don’t go! We’re just getting to the good part.

I know regex is a four-letter word for most, but in this case, we don’t have to write one massive regex for a given log statement. In fact, depending on your logging data, you likely won’t need to do any regex work at all, as Logstash comes with grok patterns for lots of things you commonly find in logs like:

  • Timestamps (including days, months, etc)
  • IP addresses
  • Class names (com.example.my.custom.class)
  • Quoted Strings (like requested and referring URLs in Apache logs)
  • Usernames
  • UUIDs
  • MAC addresses

In my case, I had to do some customizations because:

  • The built-in grok pattern for “program” – which is a sub-pattern referenced by the SYSLOGBASE pattern – expects a string containing the process ID (PID) like so: processname[PID] (e.g.: httpd[12345])
    • In the case of our Tomcat localhost_access logs, the program name is customized via our syslog config. As a result, the PID is missing (e.g.: accesslog[])
    • It took me a while to figure out, but because of this Logstash would not properly parse my localhost_access logs (see the Iterate section later on :) )
  • While not discussed extensively in this post, I also created a custom grok pattern to understand the logs our Tomcat application generates so I can get things like the userID the log statement pertains to, the Java Classname, etc to help perform more targeted searches

Iterate

Like anything you write to parse strings, you’ll likely hit corner cases. Maybe a field is not populated in some instances, or what would normally be a string ends up being a number, or maybe some mobile device passes a funky User Agent string.

The point is, once you get a set of Grok patterns in place, you will want to review what Logstash gives you to ensure that you’re parsing as many logs as you can.

The reason why this is important is that if there is not a matching Grok pattern, the entire line of logging will not be split out. While it will still be searchable, you will not get the full benefit of this logging framework.

Going back to my missing PID as an example, this simple deviation caused every single line of my logging not to be parsed. Until I came across that missing PID as the root cause, I was unable to do all of the cool stuff you’ll see later when I discuss Kibana.

Grok Debugger

Unless you’re a regex whiz, you’re likely familiar with using sites like RegexPal to build a regex against a target string. Luckily, there is a similar tool for creating and testing Grok patterns: Grok Debugger.

The really cool thing about Grok Debugger is that it is Grok-aware and understands the built-in Grok patterns (e.g.: HOSTORIP, URI, QS [Quoted String], etc). There is even a “discover” feature that will attempt to build a Grok pattern based on a sample line of logging you give it. While not 100% perfect, it does help make quick work of creating a custom Grok pattern, should the need arise.

Bringing it all Home

At this point I will assume you have done the following:

  • Deployed Logstash
  • Started passing logs through Logstash
  • Stored Logstash’s output in ElasticSearch
  • Used Kibana to query ElasticSearch, just to make sure things are working right

If not, here are some links you need to get these items working:

  1. Simple Logstash deployment (which includes an embedded ElasticSearch instance)
  2. Kibana deployment (simply unzip/untar and plunk in the Web Server of your choosing)

You can certainly get way more elaborate than this, but I’d recommend getting started with the simple version above. I’ve had such a setup ingest ~3GB of logs a day on a plain vanilla CentOS virtual machine with just 2 vCPU and 4GB of RAM – and it barely broke a sweat.

IOW, crawl before you walk ;)

With that PSA out of the way, now comes the real work:

  1. Creating and verifying your customized Logstash config
  2. Creating and falling in love with your Kibana Dashboard

My Logstash Config

Note that I will not go into all of the ins and outs of Logstash file config as there is plenty of info on that across the Intert00bz, but here are some sources to get you started:

I will try – but likely not succeed – to not rehash too much of the content above as I explain why my finished Logstash config file looks like this:

input { 
   syslog {
      host => "0.0.0.0"
      port => 514
      type => "My Custom App"
   }
}

filter {
  grok {
    patterns_dir => "/opt/logstash/patterns/"
     match => [ "message", "%{TOMCAT_ACCESS}", "message", "%{APP_SERVER}" ]
   }
  mutate {
       convert => [ "responsecode", "integer", "responsesize", "integer", "resp_time", "integer", "geoip.latitude", "float", "geoip.longitude", "float" ]
  }
  if [user_agent] {
    useragent {
      source => "user_agent"
      target => "ua"
    }
  }
  if [sourceip] !~ "(^127\.0\.0\.1)|(^10\.)|(^172\.1[6-9]\.)|(^172\.2[0-9]\.)|(^172\.3[0-1]\.)|(^192\.168\.)" and [sourceip] {
    geoip {
         source => "sourceip"
    }
      mutate {
      # 'coords' will be kept, 'tmplat' is temporary.
      # Both of these new fields are strings.
      add_field => [ "coords", "%{[geoip][longitude]}",
             "tmplat", "%{[geoip][latitude]}" ]
      }
      mutate {
      # Merge 'tmplat' into 'coords'
      merge => [ "coords", "tmplat" ]
      }
      mutate {
      # Convert our new array of strings back to float
      convert => [ "coords", "float" ]
      # Delete our temporary latitude field
      remove => [ "tmplat" ]
      }
  } 
}

output { 
   elasticsearch { embedded => true }
}

… and here is the file containing my custom TOMCAT_ACCESS and APP_SERVER Grok patterns:

SYSLOGBASE %{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} %{SYSLOGPROG}:
SYSLOGPROG %{PROG:program}(?:\[%{DATA:pid}\])?
TOMCAT_ACCESS %{SYSLOGBASE} %{IP:sourceip} - - %{DATA:sessionid} \[%{HAPROXYDATE}\] %{QS:request} %{INT:responsecode} %{DATA:responsesize} %{QS:referrer} %{QS:user_agent} %{INT:resp_time}
APP_SERVER %{SYSLOGBASE} %{DATA:level}  \[%{DATA:classname}\] - \[%{DATA:threadid}\] - \[%{DATA:userid}\] - \[%{DATA:sessionid}\]: %{GREEDYDATA:payload}

While I certainly could have put the patterns directly in the Logstash config file itself, by using a separate patterns file it makes the config easier to read and manage.

The reason I had to create a pattern file was twofold:

  1. I had to override the built-in SYSLOGPROG – and by extension, SYSLOGBASE – patterns because of our goofiness in not spitting out a PID along with the program name (e.g.: httpd:[])
  2. Also, since our formatting of our Tomcat Access Logs deviated from the norm, I had to create a custom filter for that

Um, WTF is the “geoip” and “coords” stuff in there?

I briefly mentioned that we can GeoLocate clients based on their IP address. This magic comes courtesy of Logstash, which ships with a built-in DB that maps IP addresses to physical locations – Longitude and Latitude coordinates, to be exact. These coordinates are based on data from the various Internet Registries (ARIN,RIPE, etc).

This leads to an issue for “internal” applications: What to do about private IP addresses? Obviously, those will not be in the various Internet Registries. Unfortunately, we can’t customize these, either. Logstash gets this Geolocation data from the kind folks at MaxMind. In order to keep this free (as in beer), the Geolocation DB is in MaxMind’s proprietary format. More on handling internal IPs coming up in a bit.

Back to the Geolocation DB: The proprietary, binary format means we can’t (not easily, at least) use a customized version of the DB to specify where private subnets actually physically reside within our organization. Of course, Logstash is Open Source, thus you could implement support for the CSV formatted DBs MaxMind provides, but that’s a topic for another post.

I will start off by saying that a vast majority of the Geolocation config was blatantly stolen from Bryan White’s post you can find here. The short version is that you have to do some conversion of the coordinates to get them into the format that Kibana needs for the Bettermap panel. The really shrot version is just copy and paste what he has kindly provided and move on :)

Now, that just leaves the matter of how to handle internal IP addresses. Logstash recently added the idea of “conditionals” in its config files. This allows you to perform actions against a log statement only if certain conditions are met.

In this case, we want to skip Geolocation processing if either of these situations are present in a given log statement:

  • The sourceip field is missing
  • The value of the sourceip field falls into the range of internal IP addresses (e.g.: 10.x.x.x, 172.16.x.x through 172.31.x.x and 192.168.x.x)

The reason we want to skip Geolocation for these is the coordinates end up being 0,0 – which is somewhere West of Africa in the middle of the Atlantic. Not terribly useful data and it makes us wonder why so many people are visiting our site from the middle of the ocean ;)

As a result, I have the entire block of Bryan’s Geolocation config sitting inside this conditional:
if [sourceip] !~ "(^127\.0\.0\.1)|(^10\.)|(^172\.1[6-9]\.)|(^172\.2[0-9]\.)|(^172\.3[0-1]\.)|(^192\.168\.)" and [sourceip]

What follows the if is a regex, which is likely self-explanatory, but just in case it is translated to:

  • If the field sourceip:
    • Is not in the range of private IP addresses
    • and
    • Is present (which is why you see [sourceip] after the and near the end of the statement)

… then go ahead and do Geolocation for this log statement, which is done via the geoip filter and the output of geoip is reformatted by the mutate filters to get them into the format Kibana expects.

The reason for checking that sourceip is present in the first place is because I am receiving logs from all manner of locations (e.g.: messages logs, etc), many of which will not have an IP address in them. If I did not filter these out, I would get alot of log statements with coordinates of 0,0. I learned that the hard way (see the “Iterate” section above).

The end result of this config is:

  • I am parsing Tomcat Access Logs and my Applications custom logs to get custom fields, some of which I defined (e.g.: userID) and some of which leverage built-in field names within Logstash (e.g.: user_agent)
  • All other logs (e.g.: messages, cron, etc) are simply stored as they are received with no special field creation, as I typically don’t have a need to search and/or chart on these
    • Note: Logstash still creates some fields, depending on the log type
    • For example, it will recognize cron logs and set the program field to “cron”
  • The output is going into the embedded ElasticSearch

Kibana Dashboard

At this point, I have laid the groundwork to get the fields necessary to populate my Kibana Dashboard. Again, this Dashboard is comprised of Panels, each of which will look for the presence of a given field for a document in ElasticSearch. Based on the contents of that field – and the type of panel – Kibana will visualize that data for us.

For example, the Bettermap panel will look for the value of the coords field in each record it receives. Then, based on that value, it will place a “pin” in the location corresponding to that value. A Terms panel can be configured to look at things like sourceip and give me a list of the top X (e.g.: 10) most common source IP addresses found in the logs, along with how many times that IP address appeared in the logs for a given time frame.

Before I get to the shiny, let me detail each of the panels in my Dashboard so you can better understand the resulting charts, etc.

  • “Events Over Time”: This is a Histogram panel that gives me the total number of log statements (stored as “documents” in ElasticSearch). Depending on the time range chosen, Kibana will aggregate these. For example, if you look at an hour’s worth of logs, it will aggregate the total number of documents by chunks of 30 seconds, then graph that aggregated count. Note: This is a raw count of all logging seen for the given time frame, regardless of log source.

Screen Shot 2013-11-11 at 8.03.44 AM

  • “Mean Response Time”: This is the mean value of the response time, as found in only the Tomcat Access Logs. When you configure a Panel to look at the target field, that Panel will formulate its query appropriately. For example, the “Mean Response Time” Panel is only concerned about documents in ElasticSearch where the resp_time field (which is one I custom-defined in my patterns file) is present. The result is that the Panel issues a query like this to ElasticSearch to get the data it will graph:
curl -XGET 'http://10.194.190.245:9200/logstash-2013.12.05/_search?pretty' -d '{
  "facets": {
    "2": {
      "date_histogram": {
        "key_field": "@timestamp",
        "value_field": "resp_time",
        "interval": "30s"
      },
      "facet_filter": {
        "fquery": {
          "query": {
            "filtered": {
              "query": {
                "query_string": {
                  "query": "*"
                }
              },
              "filter": {
                "bool": {
                  "must": [
                    {
                      "match_all": {}
                    },
                    {
                      "range": {
                        "@timestamp": {
                          "from": 1386258436590,
                          "to": "now"
                        }
                      }
                    },
                    {
                      "bool": {
                        "must": [
                          {
                            "match_all": {}
                          }
                        ]
                      }
                    }
                  ]
                }
              }
            }
          }
        }
      }
    }
  },
  "size": 0
}'

Screen Shot 2013-12-05 at 4.00.58 PM

  • “Most Frequent IPs”: As the title implies, this is the Top X (25, in this case) most frequently seen IP addresses, based on the sourceip field. As is the case with many of the fields in this Dashboard, this field is only present in Tomcat Access Logs.

Most Frequent IPs - redacted

  • “Top Programs”: This Panel uses Logstash’s built-in capability to discern which application wrote a given log statement (e.g.: cron, httpd, sshd, monit, etc)

Screen Shot 2013-11-11 at 8.02.01 AM

  • “HTTP Response Codes”: This is a histogram charting the (hopefully low) total number of HTTP response codes that are > 400, with the idea being to see how many “errors” are returned to the users.

Screen Shot 2013-11-11 at 8.02.09 AM

  • “Trends”: Using the Trends Panel, this compares the number of >300 Response Codes seen in this time frame to the same time frame in a previous period. In my case, it is 7 days ago, as my default view shows the previous hour. I chose 7 days to avoid situations where I would compare a weekday to a day on the weekend, as usage is mainly during the work week in my case.

Note that the Trends Panel is only as valuable as the data you feed it. If you have a small number of hits for a given term, you can see massive jumps in percentage compared to your trend target.

As you can see from the example below, I need to re-evaluate the usefulness of this panel. But, I did say to iterate, right? :)

Screen Shot 2013-12-05 at 4.03.22 PM

  • “Browsers”: This is a pie chart breaking down the ratio of various client browsers, based on the User Agent string.
  • “Operating Systems”: Another pie chart to show the breakdown of client OS’, again based on the User Agent string.

Screen Shot 2013-10-30 at 3.20.39 PM

  • “HTTP Errors”: A pie chart showing a breakdown of the same data shown in the HTTP Response Codes Histogram.

Screen Shot 2013-12-05 at 4.04.24 PM

  • “Server Activity”: This Terms Panel shows the relative amount of logging from each of the node types in the distributed application I am monitoring (e.g.: Tomcat, memcached, MongoDB, etc)

Screen Shot 2013-12-05 at 4.05.04 PM

  • “Client Locations”: A Bettermap Panel that uses the Geolocation coords field to show where client requests are physically coming from

Screen Shot 2013-11-11 at 8.03.55 AMScreen Shot 2013-12-05 at 4.05.44 PM

  • “Logs”: A Table Panel that shows the @timestamp, message, and sourceip fields. Clicking any row in this table expands to show all of the fields related to this log statement. This is part of the canned Logstash dashboard that ships with Kibana.

Screen Shot 2013-11-11 at 8.04.08 AM

Enough, already. Just show me the whole Dashboard!

OK, OK. Geez.

The final result is below. It is admittedly a bit busy, but I’ve built it to be as much of a tech demo as it is also a useful tool.

Kibana 3

I hope this post helps you create a useful dashboard of your own!

 

 

Update: Adding Kibana Template

I’ve had a few folks ask for the Kibana template used for the dashboards above. So, here it is! Hope you find it useful.

 

{
 "title": "example-dashboard",
 "services": {
 "query": {
 "idQueue": [
 1,
 4,
 7,
 8,
 9,
 10
 ],
 "list": {
 "0": {
 "alias": "200s",
 "query": "responsecode:[200 TO 299]",
 "id": 0,
 "color": "#7EB26D",
 "pin": false,
 "type": "lucene"
 },
 "2": {
 "id": 2,
 "color": "#6ED0E0",
 "query": "*",
 "alias": "All Requests",
 "pin": false,
 "type": "lucene"
 },
 "3": {
 "alias": "300s",
 "query": "responsecode:[300 TO 399]",
 "id": 3,
 "color": "#EF843C",
 "pin": false,
 "type": "lucene"
 },
 "5": {
 "alias": "400s",
 "query": "responsecode:[400 TO 499]",
 "id": 5,
 "color": "#1F78C1",
 "pin": false,
 "type": "lucene"
 },
 "6": {
 "id": 6,
 "color": "#BA43A9",
 "query": "responsecode:[500 TO 599]",
 "alias": "500s",
 "pin": false,
 "type": "lucene"
 }
 },
 "ids": [
 0,
 3,
 5,
 6,
 2
 ]
 },
 "filter": {
 "idQueue": [
 1,
 2,
 3
 ],
 "list": {
 "0": {
 "type": "time",
 "field": "@timestamp",
 "from": "now-1h",
 "to": "now",
 "mandate": "must",
 "active": true,
 "alias": "",
 "id": 0
 }
 },
 "ids": [
 0
 ]
 }
 },
 "rows": [
 {
 "title": "Graph",
 "height": "350px",
 "editable": true,
 "collapse": false,
 "collapsable": true,
 "panels": [
 {
 "span": 12,
 "editable": true,
 "group": [
 "default"
 ],
 "type": "histogram",
 "mode": "count",
 "time_field": "@timestamp",
 "value_field": null,
 "auto_int": true,
 "resolution": 100,
 "interval": "30s",
 "fill": 3,
 "linewidth": 3,
 "timezone": "browser",
 "spyable": true,
 "zoomlinks": true,
 "bars": false,
 "stack": false,
 "points": false,
 "lines": true,
 "legend": true,
 "x-axis": true,
 "y-axis": true,
 "percentage": true,
 "interactive": true,
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "title": "Events over time",
 "intervals": [
 "auto",
 "1s",
 "1m",
 "5m",
 "10m",
 "30m",
 "1h",
 "3h",
 "12h",
 "1d",
 "1w",
 "1M",
 "1y"
 ],
 "options": true,
 "tooltip": {
 "value_type": "cumulative",
 "query_as_alias": false
 }
 }
 ],
 "notice": false
 },
 {
 "title": "Average Response Size",
 "height": "150px",
 "editable": true,
 "collapse": false,
 "collapsable": true,
 "panels": [
 {
 "span": 12,
 "editable": true,
 "type": "histogram",
 "loadingEditor": false,
 "mode": "mean",
 "time_field": "@timestamp",
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "value_field": "resp_time",
 "auto_int": true,
 "resolution": 100,
 "interval": "30s",
 "intervals": [
 "auto",
 "1s",
 "1m",
 "5m",
 "10m",
 "30m",
 "1h",
 "3h",
 "12h",
 "1d",
 "1w",
 "1M",
 "1y"
 ],
 "fill": 0,
 "linewidth": 3,
 "timezone": "browser",
 "spyable": true,
 "zoomlinks": true,
 "bars": false,
 "stack": false,
 "points": false,
 "lines": true,
 "legend": true,
 "x-axis": true,
 "y-axis": true,
 "percentage": false,
 "interactive": true,
 "options": true,
 "tooltip": {
 "value_type": "cumulative",
 "query_as_alias": false
 },
 "title": "Mean Response Time"
 }
 ],
 "notice": false
 },
 {
 "title": "Errors",
 "height": "150px",
 "editable": true,
 "collapse": false,
 "collapsable": true,
 "panels": [
 {
 "error": false,
 "span": 8,
 "editable": true,
 "type": "terms",
 "loadingEditor": false,
 "queries": {
 "mode": "all",
 "ids": [
 0,
 3,
 5,
 2,
 6
 ]
 },
 "field": "sourceip",
 "exclude": [
 "127.0.0.1",
 "10.42.7.5",
 "10.42.7.6"
 ],
 "missing": false,
 "other": false,
 "size": 25,
 "order": "count",
 "style": {
 "font-size": "10pt"
 },
 "donut": false,
 "tilt": false,
 "labels": true,
 "arrangement": "horizontal",
 "chart": "bar",
 "counter_pos": "above",
 "spyable": true,
 "title": "Most Frequent IPs"
 },
 {
 "error": false,
 "span": 4,
 "editable": true,
 "type": "pie",
 "loadingEditor": false,
 "query": {
 "field": "program",
 "goal": 100
 },
 "queries": {
 "mode": "all",
 "ids": [
 0,
 3,
 5,
 2,
 6
 ]
 },
 "size": 10,
 "exclude": [
 "su"
 ],
 "donut": false,
 "tilt": false,
 "legend": "above",
 "labels": true,
 "mode": "terms",
 "default_field": "DEFAULT",
 "spyable": true,
 "title": "Top Programs"
 }
 ],
 "notice": false
 },
 {
 "title": "Response Codes",
 "height": "150px",
 "editable": true,
 "collapse": false,
 "collapsable": true,
 "panels": [
 {
 "span": 8,
 "editable": true,
 "type": "histogram",
 "loadingEditor": false,
 "mode": "count",
 "time_field": "@timestamp",
 "queries": {
 "mode": "selected",
 "ids": [
 3,
 5
 ]
 },
 "value_field": "responsecode",
 "auto_int": true,
 "resolution": 100,
 "interval": "30s",
 "intervals": [
 "auto",
 "1s",
 "1m",
 "5m",
 "10m",
 "30m",
 "1h",
 "3h",
 "12h",
 "1d",
 "1w",
 "1M",
 "1y"
 ],
 "fill": 0,
 "linewidth": 3,
 "timezone": "browser",
 "spyable": true,
 "zoomlinks": true,
 "bars": false,
 "stack": true,
 "points": false,
 "lines": true,
 "legend": true,
 "x-axis": true,
 "y-axis": true,
 "percentage": false,
 "interactive": true,
 "options": false,
 "tooltip": {
 "value_type": "cumulative",
 "query_as_alias": false
 },
 "title": "HTTP Response Codes"
 },
 {
 "span": 2,
 "editable": true,
 "type": "trends",
 "loadingEditor": false,
 "queries": {
 "mode": "selected",
 "ids": [
 3,
 5,
 2,
 6
 ]
 },
 "style": {
 "font-size": "16pt"
 },
 "ago": "1d",
 "arrangement": "vertical",
 "spyable": true,
 "title": "Trends"
 },
 {
 "error": false,
 "span": 2,
 "editable": true,
 "type": "derivequeries",
 "loadingEditor": false,
 "loading": false,
 "label": "Search",
 "query": "*",
 "ids": [
 0,
 1,
 3,
 4,
 5
 ],
 "field": "responsecode",
 "fields": [
 "@timestamp",
 "@version",
 "classname",
 "device",
 "facility",
 "facility_label",
 "haproxy_hour",
 "haproxy_milliseconds",
 "haproxy_minute",
 "haproxy_month",
 "haproxy_monthday",
 "haproxy_second",
 "haproxy_time",
 "haproxy_year",
 "host",
 "level",
 "logsource",
 "major",
 "message",
 "minor",
 "name",
 "os",
 "os_major",
 "os_minor",
 "os_name",
 "patch",
 "payload",
 "pid",
 "priority",
 "program",
 "referrer",
 "request",
 "resp_time",
 "responsecode",
 "responsesize",
 "sessionid",
 "severity",
 "severity_label",
 "sourceip",
 "tags",
 "threadid",
 "timestamp",
 "type",
 "ua.device",
 "ua.major",
 "ua.minor",
 "ua.name",
 "ua.os",
 "ua.os_major",
 "ua.os_minor",
 "ua.os_name",
 "ua.patch",
 "user_agent",
 "userid"
 ],
 "spyable": true,
 "rest": false,
 "size": 5,
 "mode": "terms only",
 "exclude": [],
 "history": [
 "*",
 "206"
 ],
 "remember": 10,
 "title": "HTTP Response Codes"
 }
 ],
 "notice": false
 },
 {
 "title": "OS and Browser Info",
 "height": "150px",
 "editable": true,
 "collapse": false,
 "collapsable": true,
 "panels": [
 {
 "error": false,
 "span": 3,
 "editable": true,
 "type": "terms",
 "loadingEditor": false,
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "field": "ua.name",
 "exclude": [
 "other"
 ],
 "missing": false,
 "other": false,
 "size": 10,
 "order": "count",
 "style": {
 "font-size": "10pt"
 },
 "donut": false,
 "tilt": false,
 "labels": true,
 "arrangement": "horizontal",
 "chart": "pie",
 "counter_pos": "above",
 "spyable": true,
 "title": "Browsers"
 },
 {
 "error": false,
 "span": 3,
 "editable": true,
 "type": "terms",
 "loadingEditor": false,
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "field": "ua.os_name",
 "exclude": [
 "other",
 "x",
 "os",
 "7",
 "8",
 "10",
 "xp",
 "vista"
 ],
 "missing": false,
 "other": false,
 "size": 10,
 "order": "count",
 "style": {
 "font-size": "10pt"
 },
 "donut": false,
 "tilt": false,
 "labels": true,
 "arrangement": "horizontal",
 "chart": "pie",
 "counter_pos": "above",
 "spyable": true,
 "title": "Operating Systems"
 },
 {
 "error": false,
 "span": 3,
 "editable": true,
 "type": "pie",
 "loadingEditor": false,
 "query": {
 "field": "responsecode",
 "goal": 100
 },
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "size": 10,
 "exclude": [
 "200",
 "304",
 "302"
 ],
 "donut": false,
 "tilt": false,
 "legend": "above",
 "labels": true,
 "mode": "terms",
 "default_field": "DEFAULT",
 "spyable": true,
 "title": "HTTP Statuses"
 },
 {
 "error": false,
 "span": 3,
 "editable": true,
 "type": "terms",
 "loadingEditor": false,
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "field": "logsource",
 "exclude": [
 "example.com",
 "1",
 "2",
 "3",
 "server"
 ],
 "missing": false,
 "other": false,
 "size": 10,
 "order": "count",
 "style": {
 "font-size": "10pt"
 },
 "donut": false,
 "tilt": false,
 "labels": true,
 "arrangement": "horizontal",
 "chart": "pie",
 "counter_pos": "above",
 "spyable": true,
 "title": "Server Activity"
 }
 ],
 "notice": false
 },
 {
 "title": "Logs",
 "height": "150px",
 "editable": true,
 "collapse": false,
 "collapsable": true,
 "panels": [
 {
 "error": false,
 "span": 12,
 "editable": true,
 "type": "table",
 "loadingEditor": false,
 "status": "Stable",
 "queries": {
 "mode": "selected",
 "ids": [
 2
 ]
 },
 "size": 100,
 "pages": 5,
 "offset": 0,
 "sort": [
 "@timestamp",
 "desc"
 ],
 "group": "default",
 "style": {
 "font-size": "9pt"
 },
 "overflow": "min-height",
 "fields": [
 "@timestamp",
 "message",
 "sourceip"
 ],
 "highlight": [],
 "sortable": true,
 "header": true,
 "paging": true,
 "field_list": true,
 "all_fields": true,
 "trimFactor": 300,
 "normTimes": true,
 "spyable": true,
 "title": "Logs"
 }
 ],
 "notice": false
 }
 ],
 "editable": true,
 "failover": false,
 "index": {
 "interval": "day",
 "pattern": "[logstash-]YYYY.MM.DD",
 "default": "NO_TIME_FILTER_OR_INDEX_PATTERN_NOT_MATCHED"
 },
 "style": "light",
 "panel_hints": true,
 "loader": {
 "save_gist": false,
 "save_elasticsearch": true,
 "save_local": true,
 "save_default": true,
 "save_temp": true,
 "save_temp_ttl_enable": true,
 "save_temp_ttl": "30d",
 "load_gist": true,
 "load_elasticsearch": true,
 "load_elasticsearch_size": 20,
 "load_local": true,
 "hide": false
 },
 "pulldowns": [
 {
 "type": "query",
 "collapse": false,
 "notice": false,
 "enable": true,
 "query": "*",
 "pinned": true,
 "history": [
 "responsecode:[500 TO 599]",
 "responsecode:[400 TO 499]",
 "responsecode:[300 TO 399]",
 "*",
 "responsecode:[200 TO 299]",
 "responsecode:>500",
 "responsecode:\"403\"",
 "responsecode:\"404\"",
 "responsecode:\"201\"",
 "responsecode:\"206\""
 ],
 "remember": 10
 },
 {
 "type": "filtering",
 "collapse": false,
 "notice": true,
 "enable": true
 }
 ],
 "nav": [
 {
 "type": "timepicker",
 "collapse": false,
 "notice": false,
 "enable": true,
 "status": "Stable",
 "time_options": [
 "5m",
 "15m",
 "1h",
 "6h",
 "12h",
 "24h",
 "2d",
 "7d",
 "30d"
 ],
 "refresh_intervals": [
 "5s",
 "10s",
 "30s",
 "1m",
 "5m",
 "15m",
 "30m",
 "1h",
 "2h",
 "1d"
 ],
 "timefield": "@timestamp",
 "now": true,
 "filter_id": 0
 }
 ],
 "refresh": "5m"
}

Why Latency – not Bandwidth – is Your Page Load Time Enemy

Preface

If you’re one of my non-nerd friends reading this, you might want to bail now. What follows is likely to not be your cup of tea.

tl;dr

If you are a nerd, yet don’t want to wade through the information below, here is a summary. Assuming all other variables (bandwidth, RTT, browser, etc) remain constant, you can use the following rules-of-thumb:
  • Page load times (PLT) grow linerally as Round Trip Time (RTT) increases
    • In the case of this study, it grows roughly according to this formula: PLT = 10 x RTT + 2 seconds – YMWV depending on the page in question
    • The exact formula depends on the web app in question and the browser used, but the (near) linearity is universal*
  • For modern web apps – that is, ones where various, typically large JavaScript libraries are used – and/or large, non-caheable assets like images that change often – clients should have a bare minimum of 512kbps of bandwidth to/from the servers hosting the app/assets
    • Preferably, 1Mbps or greater for the best experience

* Note: Nothing is universal. I’m painting with a relatively broad brush, so please hold the “Yeah, but…” comments. I know there’s always other variables to consider and corner cases to be found. I’m trying to keep this as “simple” as I can :)

Overview

The intent of this Post is to get some high-level insight into how latency (Round Trip Time, or RTT) and bandwidth affect page load performance for modern web apps.
This is not a new topic, per se. I’m not claiming to plant any flags here. The extra piece I have to add – which I haven’t had luck in finding, at least not all in one place – is understanding why we see these results… not just that we see them, period.
Also, I plan to follow this up with a study combining these two variables – RTT and bandwidth – interact together. In the real world, there are various “profiles” of these combinations, such as:
  • Branch Office: Low RTT but limited – and contentious – bandwidth
  • Teleworker: Low/Mid RTT and limited bandwidth (often without any QoS)
  • Coffee Shop WiFi: Mid RTT and highly contentious, relatively low bandwidth
  • Hotel WiFi (AKA: damn near a 56k modem): Mid RTT and significantly oversubscrubed bandwidth
  • Overseas worker: Plenty of bandwidth, but 150ms+ latency

Testing Methodology

I took measurements of various load times and subtracted server-side delay.
The items measured were:
  • Server-side response time (server-side delay)
  • Time to fully download the framework HTML
    • This particular web app uses Facebook’s BigPipe framework, which helps improve perceived performance on “slower” browsers (read: IE 8 and lower)
  • Time to the DOMContentLoaded event
  • Time to the Page Load event
Note that the Page Load event may or may not correlate with the end-user’s actual perceived page load time, but should be pretty close. To truly accurately measure that, we would need to use an automation framework like Selenium, which is outside the scope of this exercise.
These times were obtained via Chrome/Safari Dev Tools and Firebug for Chrome, Safari, and Firefox, respectively.
The variables changed to observe their effect were:
  • Round Trip Time (RTT): From 0 – 300ms added delay
  • Bandwidth: From 64k – unrestricted WiFi (again, exact bandwidth won’t matter, as you’ll read later)
These variables were manipulated using Charles Proxy. Unfortunately, Charles Proxy does not have a facility to add random packet loss, so this was not tested.
Also, I opted to use WiFi, as this is the connection medium of choice for most end users. Short of adding a touch of latency, it is not expected that WiFi would introduce any significant delays. I did measure the OS-level RTT times, which were ~59ms server(s). Keep that in mind when looking at the charts below. In other words, where you see 50ms in the RTT table below, that is actually closer to 110ms RTT in real life. Baseline (shown as 0), in this case, is ~59ms.

Summary

Unsurprisingly, the results mimicked prior studies on this subject. Similar findings from Googler Ilya Grigorik are posted here, which themselves are similar to this post by Stuart Cheshire back in 1996.
So, we’re not talking about a recent discovery. In the age of modern web applications, however, it has become an increasingly important one.
At the risk of spoiling the rest of this Post, in general:
  • Page load times increase linearly when RTT is increased
  • Page load times increase on a curve when bandwidth is reduced
What I hope to add to the conversation is to detail some of the reasons why these “truths” are, well, truths.
I will also see what rules-of-thumb we might be able to come up with regarding how these variables affect page load times.
Finally, I will also look at different browsers to see how they affect page load times.
Let’s get to it.

RTT

Here is a graph of the page load times in Chrome, as RTT was increased:
 Chrome RTT
You can see that the longer RTT affects the speed at which we get the initial page, but not very much. It does, however, affect times to load the DOM and – most importantly – the page as a whole.
The question is: Why?
This is because the first page you receive has references to tons of other items such as JavaScript and Images. As RTT increases, the browser spends more of its time waiting for each request to be fulfilled. This, in turn, delays when other resources can be downloaded.
This screenshot from Firebug shows the times needed to request and receive various items on the page:
Screen Shot 2013-09-05 at 11.43.00 AM
Notice how it looks like a Gantt Chart, where you see roughly 6 items being downloaded at once. This goes back to our old friend, “The (typically) 6 TCP Session Limit.” For those new to this limitation, most all modern browsers can only have 6 requests to the server outstanding at any one time.
[True, these can be tweaked, but I am trying to stick to a “stock” config here. Telling users to hack their registry or fiddle about in “about://settings” isn’t something non-geeks will be prone to try.]
To elaborate, let’s say that we need to download items 1 – 100 to fully load a given page. To do this, we consume all 6 TCP sessions getting the first 6 items. We can’t get the 7th item (or any other items) until one of those 6 requests is fulfilled. If one of those first 6 items takes a while to get, it holds up the other 94 behind it.
This is why you see the staggered waterfall in the image above. Most of these requests come back after roughly the same amount of delay hence this “stair step”/Gantt Chart-like effect. Items 1-6 come back, then 7-12, then 13-18, and so on.
To help illustrate how the combination of longer RTT and the 6 TCP session limit affect page load times, take note of the brownish color in the image above. That indicates the amount of time that the browser was “blocked” before it could send that particular request. This blocking comes from the TCP session limit; it is the time a given request spends “waiting” on
Here is an example breakdown for one of these affected requests at 300ms RTT. Again, note the brown bar, indicating how long the request is blocked:
Screen Shot 2013-09-05 at 11.43.57 AM
That’s almost half a second (425ms)… just waiting to send the request. There were others later on in the page load waiting > 1.5 seconds at 300ms RTT.
When you consider that many modern web applications will average close to 100 requests (my Gmail page involves ~95 such requests, for example), this is not insignificant – which is what the RTT data shows us.

RTT Rule-of-Thumb

OK, great. How do we get to a simple formula to help understand how RTT affects page load time? Simple: We use a trend line.
Looking at the trend line in the RTT chart above, we see the formula for that line is:
  • y = 9.119x + 2.124
Translated, this means we can estimate page load times for my test app in Chrome by:
  • Multiplying 9x the observed RTT (in fractions of a second, e.g.: 100ms = 0.1s)
    • Run a ping to get the current RTT from a given client
  • Adding 2 seconds to the result
Again, we are not looking to be 100% accurate; it is a rule-of-thumb, not an absolute. Also, remember that we are talking about changing RTT only. Once you start playing with multiple variables (RTT and bandwidth, as would be the case for a telecommuter), the story is not so straight-forward.
The rule-of-thumb for Safari and Firefox are similar, except that the multipliers for each are roughly 11 and 10, respectively.
To make things simple, we can split the difference and say that:
  • Page load time = (10 x RTT) + 2
    • This assumes other variables like bandwidth, browser used, etc. remain constant
Again: This ROT is based on one page in a modern web application. Pages with more or less HTTP requests will be affected more or less, hence have different values in their ROT.
The point is that for any given page, it is a reasonable assumption that page load times increase roughly linerally as RTT increases.

Bandwidth

Developing a formula for how reduced bandwidth affects page load times is a bit more tricky.
Here are page load times for my test web app starting at 64k, working up to T1 speed and through to WiFi connectivity:
Chrome - Bandwidth
An aside: The Y axis is seconds and the X axis is kbps. Note that “1600 kbps” entry is data for unrestricted WiFi-connectivity. If I put that value at 802.11n speeds in the chart, it squishes everything to the “left” such that it was just a cliff. OK, back to the Post…
The trend line for this one is not so simple – nor conducive to a simple rule-of-thumb. What we can see that bandwidth significantly affects page load times once you get below 512kbps. At 256kbps, we see a small uptick, but at 128kbps and 64kbps that page load times jump drastically.
Given that most of the target audience for modern web apps are for users on connections greater than 512k, that means that bandwidth will typically not be a significant contributor to page load times for most users.
But, keep this in mind the next time you go to load GMail while on Gogo or at crappy coffee house WiFi (crappy can define the coffee house or their WiFi, or both – your call :) ).
In conclusion, our rule-of-thumb for bandwidth is:
  • Minimum bandwidth: 512kbps
  • Preferred bandwidth: 1M or greater

Comparing Browsers

We saw that they track similarly when bandwidth is constrained and we know that bandwidth is not a significant issue if available bandwidth is >=512k.
We also saw that page load time tracks linerally with RTT. What about different browsers? How does this change things?
Comparing Chrome, Safari, and Firefox, there was not much to differentiate either browser. The numbers speak for themselves.

An Aside: Internet Explorer

This chart will become more interesting when I add Internet Explorer in a future update, as Internet Explorer (at least through IE9) has some well-known limitations for “modern” websites. Namely, it downloads JavaScript serially whereas other browsers download them in parallel.
True, JavaScript still has to be processed “in order,” but that doesn’t mean you need to wait to download each .js file until the one before it has been downloaded and processed. In many cases, one of the first chunks of JavaScript to be downloaded will be a large library like Dojo or jQuery. This means for IE that the user will sit for a long time, showing no apparent page load progress, while it downloads this JavaScript. It logjams everything behind it.

RTT

RTT - Browsers

Bandwidth

Interestingly, there was a bit more of a delta between the browsers when bandwidth is seriously constrained. That said, I’m not sure that the difference between 21 (Chrome) and 27 second (Firefox) page load times matters. You’ve likely lost the user to some other distraction at that point.
Bandwidth - Browsers

OK, So What Can We Do About This?

Again, everything above isn’t newly discovered. Because of this, a few different schools of thought have come about to mitigate the effect of RTT and bandwidth constraints, including:

  1. Split content across multiple hosts to get around the 6 TCP session limit
  2. Content Delivery Networks
  3. Minifying
  4. Lazy loading
  5. The oldest of them all: Caching

The Internet is full of Posts covering these topics, so I won’t rehash those, but they gist is they all look to minimize:

  1. The size of content needed to load a given page
  2. The number of requests needed to load a given page
  3. Parallelizing the requests needed to load a given page

While reducing the size helps with bandwidth-constrained clients, a key effect of reducing the size of a given payload is actually better realized by freeing up that TCP session faster, thus unblocking later requests faster.

If we look at this from the bigger picture, there is really only one answer to “how do we make pages load faster”: Reduce the number of requests. That ultimately has the biggest bang for the buck.

CDNs, caching, etc all help, but if I still have to wait for the server to tell me that a given piece of content is unchanged (302), I still had to wait for that response, which holds up later requests, and so on.

As a case in point, for the RTT times shown above, about 2/3 of the requests got a 302 back. The same number of requests and 200s vs 302s came back in all tests. So, it’s not a matter of needing a bigger pipe. The only solution is to send less requests to begin with. This is one (of the many) things that minification intends to solve. Without minification and JavaScript bundling, the story would be significantly worse.

Conclusion

If we take the info above and apply it to where things are going – namely mobility, both apps and the mobile web – we can see that while 3G vs LTE makes a difference, our biggest enemy is latency. Yes, LTE definitely gives us a bigger “pipe” but you’d be surprised at how much of the speed you perceive is related to the lower latency from your device to the carrier on LTE vs 3G technologies.

In fact, Ars Technica recently posted about how ViaSat – a satellite-based ISP – has speeds that rank well with the big boys. But, ask anyone who has been forced to use the high-latency technology and you will hear tales of woe. My Dad, for example, ditched satellite Internet the second that he could get DSL in his area. They love their, low(er)-latency 1.5MBps DSL way better than their satellite link, which would top out around 12MBps on SpeedTest.net.

To crib from SNL’s “Lazy Sunday” skit, “It’s all about the latency, baby!”

… and I’m Ghost like Swayze.

Great Kid-Friendly RPG: Hero Kids

Hero Kids - Cover - Landscape - Awards - 800x600

Being the son of a geek, my son, Zachary, is absolutely obsessed with video games. Left unchecked, I’m sure he’d end up like one of those folks in an Internet café, dead due to playing games non-stop for 48+ hours.

I needed to find an outlet that, well, didn’t require an electrical outlet. I wanted to get him to exercise his imagination and find something that involved, you know, actually interacting with other humans.

I suspect you know where I am going with this (especially based on the title). My thoughts turned to when I was a couple years older than him. The classic days of Dungeons and Dragons.

After all, he loved the exploration and leveling aspects of video games. And, while I won’t earn any parenting awards for this, he loved watching me play Skyrim and other RPGs. So, I was pretty sure he’d take to it.

In an interesting coincidence, there has been a resurgence of late for the older, simpler times of Basic D&D. Just Google retro-clone or “fantasy heartbreaker” and you’ll come across a plethora of re-spins of various versions of D&D. The issue is that in my quest for a kid-friendly pen-and-paper RPG, even the streamlined “retro-clones” (not to mention originals like Basic D&D) seemed a bit too much for an 8 year-old cutting his teeth on this stuff.

Luckily, my searching came across (almost) exactly what I was looking for: Hero Kids, by Justin Halliday.

tl;dr

  • Hero Kids nails its target demographic perfectly
  • It is a highly-produced, well thought-out indie product (what’s not to love about supporting indie devs?)
  • There are several pre-baked adventures that are very well written and produced (which is highly-valued when your kids want to play a new adventure each night)
  • It’s currently $15 for the core ruleset and all the adventures. If you are at all interested, just buy it, already.
  • Despite all of the praise above, there are some chinks in the armor – but none that keep me from wholeheartedly recommending Hero Kids

Hero Kids Review: The Adult Version

Hero Kids is exactly what you would expect it to be: Cartoony, but not overly so, young kids tackling big challenges. Kids, say around 4–10, which is the suggested age range for the game. They go on large adventures, tackle many foes pulled from the normal RPG bestiary (skeletons, giant rats, giant spiders, werewolves, etc), but in a way that’s not overly scary or gruesome. Think Saturday morning cartoon-style, not Heavy Metal or anime.

Game Mechanics

The game runs on a pretty simple mechanic: The attacker will have a pool of dice (anywhere from 1–3, depending on the character’s build) and the defender will have the same (again, 1–3, depending). Both sides roll their pool’s worth of d6’s, and the side with the highest number shown – not the highest sum – wins (with ties going to the attacker).

This means even the really young ones can count the pips on each die and get what’s going on. No modifiers, no addition. Given the target audience, Justin has designed this well. It is quick, simple, and fun.

Speaking of design, Justin obviously put alot of thought into the mechanic. You can read more on his blog about how he arrived at the final product.

Of course, there are also character abilities that can be active (“Split your <melee|ranged|magic> dice to attack multiple targets”) and passive (“Gain 1 extra dice to attack a target that an ally attacked since your last turn”, which suits the Rogue quite well).

On the downside, this system does not leave much in the room for character advancement. As discussed on the link immediately above, simply adding a +1 modifier significantly affects the success rate, much less adding a die to the dice pool. That said, while I am not a game designer, I have some ideas for how we can add some character growth that I will bring up in a bit.

Intro to RPG Combat Tactics

Battles are fought on a standard grid, so this does give our wee players some introduction to combat tactics. There are terrain penalties (though few, in order to keep the game moving), cover bonuses (at DM’s discretion), and penalties for (or outright denial of) certain attacks depending on the range to the target.

The aforementioned character abilities (think: skills or feats) gives the player a chance to engage some simplistic tactics and teamwork.

There are also Ability Checks, based on the character’s dice pool. For example, moving a large boulder might call for an Ability Check against strength, represented by the Melee dice pool, in this case. Again, it’s about the highest number – not the sum – to determine success.

These are all gentle, yet very “big-boy”, concepts that will help your kiddo transition from Hero Kids to other, more complex RPGs when the time comes.

A quick aside: Justin has also authored “Heroes Against Darkness” (AKA: HAD). HAD is a “retro-clone” of sorts, even though it implements some of the more recent D&D mechanics. It is available for the very agreeable price of FREE(!), and just like Hero Kids, is extremely highly-polished and ready-to-go… unlike alot of other free RPG rulesets. HAD leans more towards streamlined combat, which is right up my alley. If you are looking for something to graduate your little one to once they have outgrown Hero Kids, you could definitely do worse than HAD.

Hero Kids: Adventures

To continue with a recurring thread, the published Adventures for Hero Kids are very well produced. Each adventure:

  • States the typical time to complete (usually 30–60 minutes, which is great for pint-sized attention spans)
  • States the difficulty
  • Employs clearly boxed-in text to clue the GM on what to read to the players
  • Has blocks for each encounter explaining:
    • An overview of the encounter itself
    • The tactics of the monsters in the encounter

As you can imagine, the story lines are pretty simple, but what can you expect to flesh out in under an hour. They definitely lean more towards an “intro-combat-next room-more combat-boss-conclusion” style, again, which suits the target audience. There are a couple of the adventures that call for a bit more role play, but any GM worth their salt can (and will) tweak the target material to what the group wants.

Monsters

Part of the adventures is – of course – the bad guys. Inside, you’ll find pre-scaled amounts and types of baddies for each encounter, depending on the number of players. Very handy for when your kid grabs you immediately after you walk in the door and says, “Can we play the next adventure right now, please?” Having this info already at-hand makes it simple to give the adventure a quick skim and jump right in. There’s even Hit Point boxes for each of the monsters so you have a convenient place to track their damage (though many only take one hit to KO).

Maps and Pawns

Best of all, each adventure comes with full-page 1″ grid maps for that adventure, ready to be printed. It also comes with character sheets for the various types of bad guys, each with their own cutout pawn.

Player Characters

Speaking of character sheets, the core rulebook comes with a variety of the classic classes: Fighter, Mage (called Warlock), Healers, Rangers, etc – in both boy and girl designs. Of course, there are blank character sheets you can use to craft your own character, too.

In the case of Hero Kids, there is no “rolling” your character’s stats. Rather, you have 4 “dice” to allocate to the 4 types of core abilities:

  • Melee
  • Ranged
  • Magic
  • Armor

Ranged and Magic each cost 2 “dice” for the first die in their respective dice pools, but melee and armor are straight 1:1. This means if you opt for a non-Fighter that you will have to go either Ranged or Magic, as going with both would leave zero dice in your armor dice pool. That leaves the offensive combos as:

  • Melee
  • Melee + Ranged
  • Melee + Magic
  • Magic
  • Ranged

That, along with the fact that your Special Action and Bonus Ability are not limited to a skills/feats list, gives your child plenty of opportunity for creating a character that is uniquely their own. Of course, you’ll have to temper these a bit. Having a Bonus Ability like “All of your damage is healed at the end of your turn” would not make for a challenging dungeon romp.

Leveling? We don’t need no stiNkin’ leveling?

There are some limitations with character development that do crop up. Namely, there is no real character development. There is no leveling. There is no XP. There is no ability to change your dice pools, gain modifiers, etc. (at least not without the Equipment cards, which I’ll try to use as salvation for this limitation in a bit).

So, if your player is on the older end of the 4–10 range, they likely are familiar with XP, leveling, and characters getting more powerful as they engage in new adventures. From Pokémon to Clash of Clans, kids have been exposed to this in many different ways. Leveling is a core part – at least for me – of the RPG experience. It is one of the more rewarding parts and one of the big motivators. Who doesn’t want to grow to be an all-powerful bad ass that smites foes with impunity?

A (half-baked) idea for the Leveling issue

As you can read in this blog post, this shortcoming isn’t lost on the author. With so few “knobs” to play with, it is a challenging issue.

I have not playtested this yet, but I did have the following thought:

  • Getting treasure (e.g.: GP) is another common trope of RPGs, yet it’s not in Hero Kids
  • Taking that gold to town to buy new/better stuff is also another worn out mechanic… but, we still use it because it works :)
  • So, what if we awarded GP based on encounters?
  • Players could then use that gold to buy stuff – namely items from the Equipment pack – to pimp out their character

Many of the Equipment cards work similar to improved skills and/or equipment that would be obtained in a “big-boy” RPG. For example, one shield adds one die to your armor dice pool, at the expense of –2 movement.

What I am totally lost on is how to work the “economy” of such a system. Again, I’m not a game designer :) But, it’s something that I plan to play around with and see how it works. I’ll keep the super powerful equipment from the store until I get the numbers right.

OK, but what does your kid think of it?

Oh, yeah. I did but this for my son – definitely not for me. Totally selfless act, I can assure you.

I could go into alot of details, but the best way to put it is this: My son has never done his chores and expectations as quickly – and without prompting – as since we started playing Hero Kids. He wants to play a new adventure each night.

I couldn’t keep up, so I diverted his attention into making a character. Here’s the drawing he did last night of the Elf (sort-of) Ranger he’s working on:

Zach's Elf

(Not sure why his arms are so stiff, but…) I’ve never seen him take to a project before with the level of interest as this character.

He is asking, however, for info on how he levels his character up – which leads back to the previous section. I suspect this will be a stumbling block for the older kids that are familiar with the leveling concept. We’ll see how my cockamamie GP scheme works out.

Also, despite the fact that he’ll be 9 in December, Zach isn’t really one for scary stuff or blood and guts. This works well with the characters in Hero Kids, as they are drawn appropriately for the age group. I just wonder if other kids his age – some of which have played games like Halo – might be turned off by the art. Again, those kids would likely be candidates to transition to HAD or the Pathfinder Beginner Box.

Parting Thoughts

Again, despite any negatives you’ve read above, if you’re even remotely interested in this type of game for your kids, I say you’ve wasted time reading my ramblings when you could have just downloaded the PDFs and started playing already. My nitpicking is minor and is only because when a product is this close to perfection, you want to see it get there.

Two thumbs up, 5/5, goes to 11… whatever you want, but it’s a must-buy if you’re in the market.