Diagnose Like a Doctor: What IT Professionals Can Learn from Medicine

Over the past few years, my girlfriend has gone through medical school. That has given me a wonderful opportunity to get some exposure to what I am convinced is one of the most rigorous and most effective training regimens to learn troubleshooting. Talking with her and many of her classmates, I've been able to see the strong parallels between medicine and IT and apply some of the procedures taught in medical school to my work (after all, isn't CPR pretty much just percussive maintenance?).

The frameworks that medical school teaches to diagnose patients are completely adaptable to IT troubleshooting, and I have enjoyed more effective troubleshooting sessions, faster problem resolution, and a more enjoyable experience troubleshooting because of them. And now, I think my understanding of that adaptation is mature enough to share. In this post, I'll talk a little more about the similarities and differences between medicine and IT, describe two diagnosis procedures, and walk through a sample scenario to apply them.

Comparing Medicine and IT

In my opinion, the training that medicine has in place and the general level of competence almost everyone exhibits are traits that the IT world should aspire to. And the reason medicine does this so well is due to the rigor of training.

Someone who wants to be a doctor will finish their undergraduate degree, attend medical school (4 years), attend residency (2-5 years), and possibly attend a fellowship (2-8 years) to enjoy a position roughly equivalent to Tier 3 helpdesk. The reason there's even content for that much education and training is because today's medicine is the culmination of something like 5,000 years of focused research on the human body, which hasn't changed all that much in that time. In comparison, computer time has only been around for 50 years, and someone released a new Javascript framework in the time it took to read this paragraph.

Medical school, in particular, really focuses on two things: the things that can go wrong with the human body and how to figure out which thing is going wrong right now. In fact, someone interested in pure research (solutions architecture?) would not even be required to go to medical school. Those developing vaccines, for example, are much more likely to have PhDs in immunology or microbiology than MDs.

Put through the lens of medical education, our current system of putting Computer Science graduates into helpdesk roles seems flawed. "Oh, you think you broke your leg? Press 1 to speak with a microbiologist."

Imagine a system like medical school for IT: all the knowledge the wizened sysadmin in the corner shouts out is taught in a classroom setting. Performance issues depending on types of load for the default Java Virtual Machine (JVM) configuration. What does deadlock in the database look like? What's an effective response to different types of DDoS attacks?

That sort of education is impossible in the current state of IT. Things change so fast, and companies' implementations differ so greatly, that there is no way to keep coursework relevant and up-to-date. Imagine learning all about network troubleshooting and then going to work for Google: oh SNAP, they don't do networking like anyone else. Really, the best thing we can do is maintain good documentation of how our systems are supposed to function and hire people with the right fundamentals to adapt well to the quirks.

But once we assume that an individual has a decent understanding of a system having issues, whatever the method, we can start discussing how that person would troubleshoot those issues.

Differential Diagnosis

The first diagnostic framework to discuss is differential diagnosis. It's something that talented IT people likely already implement, but it isn't something taught very well. The idea is very simple. Take a well defined problem and come up with a ranked list of possible causes, in order of likeliness. Then, select tests to perform to either circle or cross out items on that list. Keep an eye out for possible causes that are very severe; you may want to try to rule those out first.

Consider getting a report that a user can't access a file on a file share. Your list might look something like:

User isn't connected to the network
Someone moved the file
Potential permissions issue
Cryptolocker

While ransomware is really low on that list in terms of likeliness, it would have a massive impact. Therefore, your first test might be to try to access that file yourself. If you can, you end up ruling out the most severe item on the list. And maybe even #2 as a bonus.

This is an iterative process, so as you check more things and obtain more information, add and remove items from that list as necessary.

Because the medical field is already so rigorous as to what symptoms mean which possible causes, differential diagnosis is a tested skill. One exam a medical student has to take is called Step 2: Clinical Knowledge. This exam features word problems that describe a set of symptoms and ask the student what the best test to perform would be. And while each test listed in the multiple choice may be useful, there is only one that is the correct next step for this patient. IT has a long way to go before it gets to that level of maturity.

The second framework I want to discuss is nothing more than a procedure that ensures you have the best information possible to build out your differential diagnosis.

HOPS: Structured Information Gathering and Testing

The next time you go to the doctor's office for an issue, you may recognize this procedure taking place:

History: a nurse will doublecheck your medical background and may ask what happened or what you're experiencing. There's a good chance the doctor will repeat many of those questions.
Observations: the doctor will look at your condition. Unusually pale? Strange mole? Large bruise? Limbs bent the wrong way? Severe bleeding?
Palpations: the doctor will do a brief physical exam. They may check heart rate, breathing function, and blood pressure. They may poke and prod a little bit.
Special Tests: the doctor will do a test that costs extra money and takes extra time. X-Ray, other imaging, throat swabs that go to a lab, or even a referral to a specialist are all examples of special tests.

Notice how there is a definite turn between information gathering and testing. The doctor transitions from observation to poking and prodding. That's where differential diagnosis happens, and all of the tests that follow are based on that list.

Applied to IT, HOPS requires only a couple changes. Palpations becomes interpreted as cheap tests, or tests that can be performed without causing more impact to the user or the business. This could be pinging a server, testing if a TCP port is listening with telnet, or checking the syslog server for a specific message.

Special tests become expensive tests, or any test that will cause even greater impact. At a bank, closing one of the teller windows costs money. At a call center, taking an agent off the phones costs money. At any enterprise, turning off a service to restart with a different logging level costs money. These tests should not be run lightly, and ideally they should be used to confirm what the cheap tests suggest.

In my opinion, investment in monitoring infrastructure should be framed in terms of increasing the number and scope of tests that can be performed cheaply when something is going wrong. Network monitoring infrastructure can turn costly packet captures for timing analysis into cheap tests. Application monitoring solutions like AppDynamics or Dynatrace can transform the expensive operation to enable debug mode on a server into a cheap test to dig into the information already being collected.

When an IT professional applies HOPS, it may look like the following:

History: are there any known issues with the system? What changes were made in the last few days? Patching? Infrastructure changes? Code releases?
Observation: what symptoms are exhibited? What's the user experience? What does the alert in the monitoring system say? What error is displayed on the screen? Was there a stack trace printed somewhere?
Palpations: what tests can be performed cheaply? Ping, traceroute, telnet? Are routing neighbors up? What monitoring is already in place?
Special Tests: What else can be done to investigate?

Again, the differential diagnosis is formed after observation is completed, and it guides the tests that come after.

What Might This Look Like in Practice?

To demonstrate this system in its full form, I'll walk through an example. You just came back from lunch, sleepy from the sizable chicken parmesan sub. Ding! A new email. It says there's a new ticket assigned to you: "High Priority: Web Site is Intermittently Failing to Load Home Page." You're in such a rush to log back into the ITIL system that you almost spill the soda you brought back with you.

History

Fighting back against the carb coma, you begin by checking the ticketing system to see if there are any recorded changes over the last week that list the main web site as an impacted application. You see nothing significant.

You then check active incidents to see if there is anything else going on that could be impacting the main web site. You don't see anything related.

You engage the individual who submitted the ticket and obtain the following pieces of information: the issue has been sporadic since mid-morning, a reload of the page normally fixes things, and the error actually displayed is a generic HTTP 500.

Observation

The error displayed is a generic server-side error code without any detail. A reload usually returns the correct result.

A check on the syslog server shows that the web server logs the HTTP 500 error, but without any other helpful information.

Differential Diagnosis

Database deadlock - deadlock on a specific database table would normally cause a constant issue for the same database call, which a generic web site home page would be requesting, but it could appear transient if the database call is specific to the user-agent string or customer location. If deadlock protection is enabled in the specific database, then it could be fixing the deadlocks almost as fast as they happen, causing a transient issue.
Overloaded database, or middleware (including microservice) server - A returned error code means there probably wasn't a drop in communication between the client and the web server. A transient issue could indicate a failed back-end call. A failed back-end call could be caused by an overloaded server that is only letting some calls through. It's more common to see deadlocks issues than sizing issues.
DDoS, or similar attack - This could certainly result in failed web page loads, but those would normally be 4xx errors. Unless this is an attack that directly affects the web server. Severe case, but less likely.
Flaky connection between web server and back end - These sorts of connections typically have plenty of redundancy, so a single bad OSPF neighbor shouldn't cause this. And that would have generated its own incident.

Palpations

Check firewall and web server CPU and RAM utilization. These are both reasonably low, so that rules out a DDoS.
Check the monitoring dashboard for middleware servers. Look for evidence of a sharp increase in thread count, which is usually evidence of hung threads. You don't see anything, which makes deadlock a little less likely.
Check the syslog server for deadlock alerts. Nothing. This completely rules out deadlocks.
Check the monitoring dashboard for CPU and RAM utilization on the middleware and database servers. Middleware servers look fine, but one database has massively elevated RAM usage. This particular database, ironically, is used to track user experience on the website, and isn't required for normal function.

Special Tests

Take this database offline. At the code level, direct connection failure will be a quick method return that the web server can work around, as opposed to a delayed return that results in a timeout. Over the next 10 minutes, the web server stops logging HTTP 500 responses, and the issue is resolved.

Prescription

Increase the size of the VM running that database and bring it back online. If the HTTP 500 errors return, go through HOPS again, but with a little more focused view.
Recommend a code change for more granular timeouts and more graceful failures in the web server.

Conclusion

I have been applying differential diagnosis and HOPS in my own work over the past year, and I've found that my ability to resolve complex issues quickly has improved dramatically. I have more fun troubleshooting since I don't feel like I'm just banging my head against a wall. And in cases where I'm leading a troubleshooting effort, asking leading questions that guide the other participants on this path really improves the engagement and efficiency of the entire group.

So give this framework a try. See how it works for you. It's frustrating at first to be stuck considering options and making a list while everything is broken, the company is losing money, and you feel like you should be doing something. But the returns are there.