Transforming Cumulative Ceilometer Stats to Gauges

Overview

I’ve been dabbling a bit more with OpenStack as of late. If you know me, you can likely guess my goal is how to ingest logs, monitor resources, etc.

I’ve been trying to see how well Ceilometer, one of the core components of OpenStack that actually provides some of this stuff, would work. Initially, I was a bit bummed, but after fumbling around for a while, I am starting to see the light.

You see, the reason I almost abandoned the idea of using Ceilometer was due to the fact that some of the “meters” it provides are, well, a bit nonsensical (IMHO). For example, there’s network.outgoing.bytes, which is what you would expect… sort of. Turns out, this is a “cumulative” meter. In other words, this meter tells me the total number of bytes sent out a given Instance’s virtual NIC. Ceilometer has the following meter types to measure :

  • Cumulative: Increasing over time (e.g.: instance hours)
  • Gauge: Discrete items (e.g.: floating IPs, image uploads) and fluctuating values (e.g.: disk I/O)
  • Delta: Changing over time (bandwidth)

Maybe I am naive, but seems quite a bit more helpful to track this as a value based on a given period… you know, so I can get a hint of how much bandwidth a given instance is using. In Ceilometer parlance, this would be a delta metric.

I’ll take this aside and defend the fine folks working on Celiometer on this one. Ceilometer was built initially to generate non-repudiable bulling info. Technically, AFAIK, that is the project’s only goal – though it has morphed to gain things like an alarm framework.

So, now you can see why network.outgoing.bytes would be cumulative: so you can bill a customer for serving up torrents, etc.

Anyway, I can’t imagine that I’m the only person looking to get Delta metrics out of a Cumulative one, so I thought I’d document my way of getting there. Ultimately there might be a better way, YMMV, caveat emptor, covering my backside, yadda yadda.

Transformers to the Rescue!

… no, not that kind of Transformer. Lemme ‘splain. No, there is too much. Let me sum up.

Actually, before we start diving in, let’s take a quick tour of Ceilometer’s workflow.

The general steps to the Ceilometer workflow are:

Collect -> (Optionally) Transform -> Publish -> Store -> Read

Collect

There are two methods of collection:

  1. Services (Nova, Neutron, Cinder, Glance, Swift) push data into AMQP and Ceilometer slurps them down
  2. Agents/Notification Handlers poll APIs of the services

This is where our meter data comes from.

Transform/Publish

This is the focus of this post. Transforms are done via the “Pipeline.”

The flow for the Pipeline is:

Meter -> Transformer -> Publisher -> Receiver

  • Meter: Data/Event being collected
  • Transformer: (Optional) Take meters and output new data based on those meters
  • Publisher: How you want to push the data out of Ceilometer
    • To my knowledge, there are only two options:
  • Receiver: This is the system outside Ceilometer that will receive what the Publisher sends (Logstash for me, at least at the present – will likely move to StatsD + Graphite later on)

Store

While tangental to this post, I won’t leave you wondering about the “Store” part of the pipeline. Here are the storage options:
* Default: Embedded MongoDB
* Optional:
* SQL
* HBase
* DB2

Honorable Mention: Ceilometer APIs

Like pretty much everything else in OpenStack, Ceilometer has a suite of OpenAPIs that can also be used to fetch metering data. I initially considered this route, but in the interest of efficiency (read: laziness), I opted to use the Publisher vs rolling my own code to call the APIs.

Working the Pipeline

There are two Transformers (at least that I see in the source):

  • Scaling
  • Rate of Change

In our case, we are interested in the latter, as it will give us the delta between two samples.

To change/use a given Transformer, we need to create a new pipeline via /etc/ceilometer/pipeline.yaml

Here is the default pipeline.yaml:

---
 -
     name: meter_pipeline
     interval: 600
     meters:
         - "*"
     transformers:
     publishers:
         - rpc://
 -
     name: cpu_pipeline
     interval: 600
     meters:
         - "cpu"
     transformers:
         - name: "rate_of_change"
           parameters:
               target:
                   name: "cpu_util"
                   unit: "%"
                   type: "gauge"
                   scale: "100.0 / (10**9 * (resource_metadata.cpu_number or 1))"
     publishers:
         - rpc://

The “cpu_pipeline” pipeline gives us a good example of what we will need:

  • A name for the pipeline
  • The interval (in seconds) that we want the pipeline triggered
  • Which meters we are interested in (“*” is a wildcard for everything, but you can also have an explicit list for when you want the same transformer to act on multiple meters)
  • The name of the transformation we want to use (scaling|rate_of_change)
  • Some parameters to do our transformation:
    • Name: Optionally used if you want to override the metric’s original name * Unit: Like Name, can be used to override the original unit (useful for things like converting network.*.bytes from B(ytes) to MB or GB)
    • Type: If you want to override the default type (remember they are: (cumulative|gauge|delta))
    • Scale: A snippet of Python for when you want to scale the result in some way (would typically be used along with Unit)
      • Side note: This one seems to be required, as when I omitted it, I got the value of the cumulative metric. Please feel free to comment if I goobered something up there.

Looking at all of this, we can see that the cpu_pipeline, er, pipeline:

  1. Multiplies the number of vCPUs in the instance (resource_metadata.cpu_number) times 1 billion (10^9, or 10**9 in Python)
    • Note the “or 1”, which is a catch for when resource_metadata.cpu_number doesn’t exist
  2. Divides 100 by the result

The end result is a value that tells us how taxed the Instance’s is from a CPU standpoint, expressed as a percentage.

Bringing it all Home

Armed with this knowledge, here is what I came up with to get a delta metric out of the network.*.bytes metrics:

    name: network_stats
    interval: 10
    meters:
        - "network.incoming.bytes"
        - "network.outgoing.bytes"
    transformers:
        - name: "rate_of_change"
          parameters:
              target:
                  type: "gauge"
                  scale: "1"
    publishers:
        - udp://192.168.255.149:31337

In this case, I’m taking the network.incoming.bytes and network.outgoing.bytes metrics and passing them through the “Rate of Change” transformer to spit a gauge out of what was previously a comulative metric.

I could have (and likely will) taken it a step further and used the scale parameter to change it from bytes to KB. For now, I am playing with OpenStack in a VM on my laptop, so the amount of traffic is small. After all, the difference between 1.1 and 1.4 in a histogram panel in Kibana isn’t very interesting looking :)

Oh, I forgot… the Publisher. Remember how I said the UDP Publisher uses msgpack to stuff its data in? It just so happens that Logstash has both a UDP input and a msgpack codec. As a result, my Receiver is Logstash – at least for now. Again, it would make alot more sense to ship this through StatsD and use Graphite to visualize the data. But, even then, I can still use Logstash’s StatsD output for that. Decisions, decisions :)

Since the data is in Logstash, that means I can use Kibana to make pretty charts with the data.

Here are the bits I added to my Logstash config to make this happen:

input {
  udp {
    port => 31337
    codec => msgpack
    type => ceilometer
  }
}

At that point, I get lovely input in ElasticSearch like the following:

{
  "_index": "logstash-2014.01.16",
  "_type": "logs",
  "_id": "CDPI8-ADSDCoPiqY9YqlEw",
  "_score": null,
  "_source": {
    "user_id": "21c98bfa03f14d56bb7a3a44285acf12",
    "name": "network.incoming.bytes",
    "resource_id": "instance-00000009-41a3ff24-f47e-4e29-86ce-4f50a61f78bd-tap30829cd9-5e",
    "timestamp": "2014-01-16T21:54:56Z",
    "resource_metadata": {
      "name": "tap30829cd9-5e",
      "parameters": {},
      "fref": null,
      "instance_id": "41a3ff24-f47e-4e29-86ce-4f50a61f78bd",
      "instance_type": "bfeabe24-08dc-4ea9-9321-1f7cf74b858b",
      "mac": "fa:16:3e:95:84:b8"
    },
    "volume": 1281.7777777777778,
    "source": "openstack",
    "project_id": "81ad9bf97d5f47da9d85569a50bdf4c2",
    "type": "gauge",
    "id": "d66f268c-7ef8-11e3-98cb-000c29785579",
    "unit": "B",
    "@timestamp": "2014-01-16T21:54:56.573Z",
    "@version": "1",
    "tags": [],
    "host": "192.168.255.147"
  },
  "sort": [
    1389909296573
  ]
}

Finally, I can then key a Histogram panel in Kibana to key on the “Volume” field for these documents and graph the result, like so:

Screen Shot 2014-01-16 at 4.30.28 PM

OK, so maybe not as pretty as I sold it to be, but that’s the data’s fault – not the toolchain :) It will look much more interesting once I mirror this on a live system.

Hopefully someone out there on the Intertubes will find this useful and let me know if there’s a better way to get at this!