… at least not totally on their own.
(Image courtesy of Michael Mol)
Maybe I missed the memo, but am I the only person who thinks Devs writing logs that only they can decipher is a bad idea? Seems so, because on the web and in books, I see statements like this:
Application debug logging is a special type of logging that is useful to application/system developers and not system operators. Such logging is typically disabled in production systems but can be enabled on request. Many of the messages in debugging logs can already be analyzed by the application developer with full knowledge of application internals and sometimes also with possession of application source code.
With all due respect to Chuvakin, et.al., their statement above reflects how most people feel about debug:
“The Devs are the only ones who understand this, so let them put whatever they want in there. I sure as hell won’t understand it. It’ll end up in their lap, anyway.”
I can’t scream loud enough how horribly inefficient this is. It creates a self-fulfilling prophecy where people assume that Devs are the only ones who can troubleshoot their own code. So, why even try, right?
The reality is that properly informed support staff are more than capable of debugging many issues, even without access to the source code… or the humans who wrote it.
But, you have to give them enough info to do their jobs.
Am I saying proper logging will eliminate the need to pull in a Dev from time-to-time? No, but ignoring it ensures they’ll be needed much more than they should.
Devs as Support – AKA: You’re doing it wrong
You can get away with the “Devs as support” philosophy when you work in a startup. In fact, it’s actually a necessity. Justin is the node.js guy, so he does the back-end stuff. He’s a bit of a sysadmin, too. His stuff lives on the servers, after all. All the backend stuff will come to Justin. Let Justin write what makes sense to him.
Once you get into a larger-scale operation, however, people must specialize in order to gain efficiencies of scale.
If I run a 100-person Dev shop, I don’t want Justin debugging backend issues! I want him focused on new code. I want my support staff to take the first crack at figuring out the issue. They can pull him in, if needed.
In order to get to the ideal state – “Devs as Devs” mode – any support staff needs the following to show value:
- Logs they can understand, without having to break out the source code to do so
- System- and Code-level design documentation to help them understand what the code is doing, without having to break out the source code to do so
- Commonly hit issues with simple and speedy resolutions, so even people outside Dev and support can potentially self-service, without having to break out the source code to do so
Noticing a trend?
No, really. You. Are. Doing. It. Wrong.
Perpetuating this pattern creates (at a minimum) the following issues:
- It ensures that your devs will be pulled off of new, cool stuff to work on old, boring code
- It’s not if – it’s when
- Working on old code is tantamount to working in a coal mine for a developer
- And, no, troubleshooting someone else’s code is not a good way to bring up junior devs – see 1.1… plus you’re likely perpetuating bad habits
- It turns your support staff into “log proxies” (“I don’t know what the hell ‘PC Load Letter’ means. Guess I’ll punt this off to Justin.”)
- It dramatically increases time-to-resolution
- If there are 5 back-end issues going on at the same time, Justin can’t t-shoot them all simultaneously
- … and good luck getting Anne to jump in and figure out Justin’s code, because you didn’t have time to create that System- and Code-level design documentation, right?
My experience has been that Developers take one of three routes when generating logging:
- They dump out a full stacktrace for everything (as discussed in an earlier rant) – after all, it tells them where in the code we were when the system crapped itself
- They log something so generic as to be non-informational (“Error occurred”)
- They log nothing at all, which leads to vague runtime errors (see below)
The problem with #1 is that logging typically has a significant performance impact. So, kicking out dozens of KB of logging – when you typically only need the first few lines of the stack – decreases scalability. Or worse, the logging cripples the system so much you can’t turn it on in the first place.
On #2, why bother? You haven’t really given me anything, except that something bad happened. I believe the screaming customer on the other end was able to relay that message just fine.
OK, smart guy. So I have to teach the support team to code, then?
If you are going to log anything, why not log some basic info about the issue? Here is an example of a log statement that a support team has a valid chance at solving:
User phillp.j.fry is unable to access Post 478989456 - insufficient permissions
The support team’s success depends on the aforementioned documentation. If I have a documented DB schema that tells me how the code looks up permissions, my support staff can at least check the DB to ensure there’s no corruption (or anything else unexpected). Without this, they’re flying blind and will likely be Justin’s cube-buddy for a while.
Runtime logging should tell me all I need to know
For #3, sadly we tend not to get logging like the example above. Instead, let’s look at this real world example of what we typically see – taken from Tomcat’s Bugzilla:
If there is no file CATALINA_BASE/conf/logging.properties, then Tomcat will not start and the JVM outputs the following error:
Exception in thread "main" java.lang.NoClassDefFoundError:
without any additional information.
Sure, the error is technically correct: there was no logging class referenced, thus the JVM couldn’t find that class. But, it gives no one – even the Devs – any idea as to what the code is upset about. If the person who opened this bug didn’t already tell us where the issue was, we could be spending hours trying to figure this out.
1. If the code is looking for a file and it can’t find it, log the path and file
2. If the code is looking for a property and can’t find it, log the missing property name
3. If the code… you get the point – give me an idea of what went wrong, and I can likely leave the Devs alone
Simple, silly example. But, if things this basic are left undone, what do you think the logging of the more complex stuff looks like?
Logging as a User Story
Sadly, even though this stuff is painfully obvious, it is also rarely done.
Treat your logging like it is a User Story – and your support staff are the stakeholders. After all, they are the ones consuming your logging… that is, unless you want your Devs doing all your customer support. Maybe there’s a co-dependency issue I need to explore there… Hmmm.
Anyway, sorry I can’t elaborate more. I have to go interview a replacement for Justin. He just stormed out of the office shouting something like “Unsustainable” or whatever. Feh. Whatevs.