Thoughts on Workflows for Teams

April 23, 2022 by Josh Clark

In Making Work Visible, Dominica DeGrandis defines the concept of Lead Time as the time between feature request and feature delivery. In a standard Kanban board, Lead Time is the time it takes for a card to move to the Done column, starting when it is added to the To Do column.

There are also component metrics to look at that break down Lead Time. The time it takes to move from To Do to Doing is called the Backlog Wait Time. And the time it takes to move from Doing to Done is the Cycle Time.

The goal for an organization concerned with feature delivery is to improve on two aspects of Lead Time: raw Lead Time and Lead Time Variability. Features need to ship as fast as possible and as predictably as possible. To make that happen, modern companies organize developers into small teams that integrate as many skillsets as possible. In a traditional organization, the UI/UX design team may need to submit a design to the front end engineering team to be implemented. Because these are two different teams with two different backlogs of work, that design may languish in the front end engineering team's To Do column, leaving the feature in a Wait State, extending the overall Lead Time. In that scenario, pairing up a UI/UX designer with a front end engineer then makes sense; it eliminates a wait state. If we extend that same train of logic, we end up with UI/UX, front end engineering, back end engineering, database, security, and infrastructure compentencies on a single team.

But that decomposition isn't always possible or desired. There are certain competencies that aren't always required to complete a feature: not every feature needs a full stack performance analysis or a new authentication integration. In a recent conference talk, even Google stated that their SRE team is largely siloed from developers. So there are many cases in which it doesn't make sense to have specialists scattered among the small teams. If we accept that some competencies will be siloed, how do we protect Lead Time and Lead Time Variability?

Backlog Management for A Siloed Team

Network engineers spend a lot of time studying queueing. Routers need to make decisions about where to send packets and those decisions can take time. When the queue of packets waiting for a routing decision begins backing up, each packet has a longer wait time before it gets forwarded. This longer queue can overwhelm the router. It only has a limited amount of memory dedicated to packet queues, and when that queue is full, no more packets are even allowed to get in line (we call that a 100% drop rate). If that happens, then every computer and application that needs to get traffic through the router breaks down. Failure begins to cascade through the entire network.

So routers implement Quality of Service (QoS) and Active Queue Management (AQM) policies. That memory space dedicated to waiting packets is divided into serveral queues of different priorities. The network's QoS design decides what kind of traffic belongs in each queue. For example, VOIP traffic usually gets assigned to the highest priority queue. When VOIP packets begin backing up, phone conversations suddenly start stuttering due to an increase in the variability of packet arrival times, or jitter. AQM is implemented for each queue with the goal of preventing the queue from filling up. The most common algorithm to accomplish this is Random Early Detection (RED). RED defines two thresholds: kind of busy and really busy. When that queue is kind of busy, RED will begin dropping packets at a relatively low drop rate. When the queue becomes really busy, RED increases the drop rate. The logic behind this is that applications are able to send that packet again later (usually 5-20 milliseconds), so the actually service interruption is minimal. But the most important, time critical, data always gets through on time.

What can we learn from this example about that siloed team's To Do column? The longer the queue is, the longer any item will take to make it through the backlog. Think of this in the case of a backlog grooming meeting. If a team has to go through and re-prioritize 300 cards to decide what to work on next, what are the chances that some really important items will be lost in the shuffle? And how long will that backlog grooming meeting run over, impacting the time available to work active items? If a mechanism isn't in place to restrict queue length, lead time will increase due to increased backlog wait time AND due to increased cycle time.

Does that mean teams should start turning away work at random? Probably not. But team leads, product owners, and the like can begin telling requestors that "due to current obligations, the team is unlikely to be able to work on this request until Q4." These team leads can hold deferred requests in place that isn't visible to the rest of the team. They don't show up on the Kanban board's backlog, and thus don't take up team members' mental space unnecessarily.

Once we put conventions in place to protect the queue length, what can we do to reduce Cycle Time and Cycle Time Variability?

Decreasing Cycle Time for Siloed Teams

To protect Lead Time, it is necessary to reduce individual utilization. When someone isn't that busy, it's far easier for them to pick up an urgent work item. When someone isn't that busy, it's far easier to chase all the fine details of a particular request to completion, rather than just doing the essential work and deferring less urgent tasks like documentation (or maybe that's just me). We can make a large part of that utilization visible by enforcing Work In Progress (WIP) limits for individual and team workloads. And for the rest of that utilization, we need to examine overhead.

Application architecture has been moving towards microservices: autonomous, loosely-coupled bits of code that do one thing and one thing only. Because each functional component is isolated, the software environment is modular by design, making it really easy to maintain, update, or even replace one component of a large application without affecting too many dependencies. Before, an application server would contain a large amount of monolithic code to do several different things, interact with several different external services, and return several different pieces of data to an end user.

These monoliths were architected, in large part, because of overhead. The original servers were all physical machines. The overhead in rack space, power, and cooling meant that fewer, beefier servers made more sense than many smaller servers. There was too much overhead to justify running tiny amounts of code on a server. The monolithic design was more cost effective. When OS virtualization was created, it became possible to run several segregated virtual machines (VMs) on a single physical server. The design message going out to developers and system administrators became "only run a single significant process on each VM." Functional decomposition became limited by the overhead of replicated operating systems. Each VM still had to run a kernel, operating system, networking stack, and all the related processes that keep those things operating. Over the past decade, the industry has eliminated even more of the overhead by developing container technology. Containers only replicate software dependencies, sharing all of the OS overhead with tens or hundreds of other containers on a single system.

Right now, many knowledge workers have overhead comparable to an old server. There exists a constant barage of emails, instant messages (IMs), meetings, tracking spreadsheets to update, websites to log into, training to complete, and dozens of other small tasks that need to exist alongside meaningful work. DeGrandis points out that project managers struggle to effectively manage a project when they get interrupted with questions about project status updates several times a day. What opportunities exist to reduce administrative work that individuals and teams need to manage? How can managers treat their teams like a nervous sysadmin treats a problematic VM that "just needs one more agent installed on it, I promise?" Reducing or consolidating meeting attendance, email and IM presence, paperwork, training, and task and time tracking are all possible methods to reduce overhead, thus reducing overall utilization and focusing attention on meaningful work.

The challenge is that this is a culture shift that goes beyond just the company. Our modern workplace culture incentivizes busyness by default. Employees feel like they're doing something wrong when there isn't that much to do. Hearing stories about companies like Amazon modeling what 100% output within a position looks like and judging individuals against that standard reinforces that anxiety. In knowledge work settings, it's always easy to satisfy that anxiety by sending more emails, responding to more IMs, and attending more meetings.

But the cost of those interactions on real work is devastating. As told in Cal Newport's A World Without Email, the company RescueTime performed a study on users of its platform and found that half of its users checked email or IM an average of at least once every six minutes. In fact, they found that most of their users never went more than an hour without checking email or IM. So most RescueTime users are unable to go more than a full hour without their work being interrupted. And since it takes up to 20 minutes to turn full attention to a task, those small interruptions result in a significant blow to what any individual is able to accomplish. And the people who signed up for RescueTime are the ones who are mindful of how they spend their time! How do the rest of us stack up?

But wait! Meetings, emails, and IMs are how people communicate! How can someone get the information they need to complete some piece of work without these tools? It turns out that we can turn to the microservices architecture to offer a suggestion for this, too.

Creating consistent interfaces is key to reducing back-and-forth communication and clarification. The ubiquity of REST APIs means that every piece of software maintains application state client-side. Every piece of software supports providing required identifiers and inputs for every API call. CRUD functions do much of the same thing. A developer knows that creating create, rename, update, and delete methods for an object will satisfy every valid interaction with that object. A consistent interface with a functional team acts in a very similar way. If all required information is provided up front, less back and forth is required to accomplish a goal.

That is quite a tall order: how do you guarantee you can provide every since piece of necessary information up front. A lot of knowledge work is ambiguous, and things are always changing. I've found that some teams I interact with have solved the problem quite well. I can submit a request to them and consistently expect to have a meeting scheduled on my calendar soon after. In that meeting, all of the back and forth required to collect information, clarify intentions, and set requirements can happen synchronously. Asynchronous communication through email and IM is significantly reduced after that meeting.

Conclusion

If a company or organization is concerned with delivering features to customers, it needs to take a serious look at how individuals and teams manage their workloads. If Lead Time is what matters, then teams need to have the freedom to restrict backlog length to something managable. Teams need to have enough capacity so that each individual maintains a moderate utilization. The company culture needs to emphasize completeness over busyness. Teams need to consolidate or abstract overhead so that each individual can reduce the number of interruptions in the day. And teams need to define better interfaces between teams to help reduce those small interruptions.

The companies and teams that transition to this workflow successfully will see more advantages than just protecting Lead Time. Employees will have a lower incidence rate of burnout and they will feel more comfortable with their work/life balance. So turnover will reduce, leading to retained institutional knowledge. That retained institutional knowledge will help with technical debt reduction and bug squash efforts, providing a more stable platform over time.

Flow Control with Exceptions in Python

October 13, 2020 by Josh Clark

My programming education in college was very computer engineery, and not at all computer sciency. Instead of "Hello, World", we started with logic gates and MOSFETs. We learned how to make single purpose circuits, and then how to design an instruction set to combine those circuits to make a programmable circuit. After writing simple programs in machine code in a simplified Instruction Set Architecture, assembly language was remarkably readable. When we started learning C, a good portion of the education was compiling simple functions to the representative assembly.

While this curriculum was very well suited to the embedded systems, digital logic design, and microprocessor design courses that followed, it wasn't the best preparation for building modern applications. I had a difficult time ignoring the layers of abstraction between the Java code I was writing and the resulting machine code. I spent so much time trying to rationalize the abstraction of simple built-in methods like String.toUpper() that I couldn't code productively. In fact, it took years for me to understand that I had fundamental misunderstandings about some features of higher level languages.

I vividly remember the lightbulb moment I had when I realized the utility of exceptions. Until that moment, there wasn't any difference between an exception and a segfault in my mind. When an exception isn't handled, it causes the program to halt, just like a segfault. I spent a lot of time and CPU cycles coding around exceptions. During that lightbulb moment, I was writing a function in Powershell to make sure a given file was available to edit. I spent a couple hours trying to figure out what flag or attribute existed to tell me if I could use it or not, and finally found a StackOverflow post that told me to just try to edit it and deal with the exception if I couldn't. Struggling through that function finally taught me that exceptions are not errors; exceptions are flow control.

Learning How and When

Abraham Maslow popularized the phrase "if all you have is a hammer, everything looks like a nail." His argument, when using that phrase, was for methodological innovation in psychology research. Novel problems must be researched with novel approaches. But he acknowledges that the learning process to find the correct approach is inelegant. "What one mostly learns from such first efforts is how it should be done better the next time."

In the learning process, most of the work is learning to recognize a situation that a particular approach is well suited for, and I think an excellent way to do that is to treat that approach as the only option. In rock climbing, I refined my heel hook technique by using that for every single move until I understood when it would help and when it would hinder. In programming, I challenged myself to write a functional command line tool without a single if statement.

Exceptions in Python

Exceptions in Python and other higher level languages are really just the language trying to tell you that something unexpected happened, and they give you the opportunity to choose how to manage that unexpected thing. For example, something as simple as opening a file carries a bunch of underlying assumptions. The file must exist, the program must have the proper permissions to read the file, and the program must be able to access the directory the file is in (it's tough to get to the corporate network drive when you aren't on the VPN). Depending on which of those requirements is satisfied, you may want to handle things differently. If the program can't access the directory, maybe you want to doublecheck that there aren't any spelling errors in the file path. If the file doesn't exist, maybe you want to create it. Exceptions gives you, the programmer, options to deal with those problems rather than allowing the program to just quit.

There are really only 3 elements of Python I relied on to replace if statements:

try

A try block defines a section of code where an exception might occur. It's common practice to use these only at the edges of a program, or where it interacts with something potentially unreliable like a file, a network connection, or a user.

try:
    do_thing_that_might_fail_weirdly()

except

An except block is the complement to a try block, like an else is to an if.

try:
    do_thing_that_might_fail_weirdly()
except:
    print("The thing failed weirdly")

Excepts can also look for specific exceptions. These can be defined by the programmer or just built in.

except WeirdFailure:
    print("The thing failed weirdly")

Excepts can also be chained together off a single try block, searching for different errors.

try:
    do_thing_that_might_fail_weirdly()
except WeirdFailure:
    print("The thing did the weird thing")
except OtherWeirdFailure:
    print("The thing did the other weird thing")
except:
    print("This block catches every other exception")

assert

This was my big cheating move. Assert statements are usually used in tests to make sure things are as you expect. In this coding challenge, I heavily relied on assert statements to get around conditional statements.

try:
    assert "show" == user_input # Hammer, meet scrambled eggs
except AssertionError:
    pass

TODO List Manager Structure

The user experience I wanted for this todo list manager was a captive command line interface. I wanted to launch the program by choosing one of many different lists and then interacting with only that chosen list. I wanted to make sure I could check items off the list, give myself notes, and support sub-tasks.

Launching the Program

I used the argparse library in Python to automatically define command line flags and a help dialog:

def initiate():
    description = "Manage all your TODOs with this overcomplicated application!"
    parser = argparse.ArgumentParser(description=description)
    parser.add_argument('todo_path', help="File path to the todo file you want to interact with.")

    args = parser.parse_args()
    return args.todo_path

Opening that todo file and reading in the JSON gave me the first opportunity to use the hammer I chose for this exercise. I needed to check two things: that the file exists, and that it is a valid JSON file. The function below will do one of three things: return the dictionary representation of the JSON file passed into it, return a blank dictionary, or close the program.

def read_todo_file(file_path):
    try:
        with open(file_path) as todo_file:
            todo = json.load(todo_file)
    except json.JSONDecodeError:
        # We don't want to corrupt a file that is in the wrong format.
        print(f"{file_path} exists, but is not a valid JSON file. Exiting")
        sys.exit()
    except:
        # If the file doesn't exist, then we create a new file
        print(f"{file_path} does not exist, starting a new todo list.")
        todo = {}
    finally:
        return todo

Once the todo file is read into memory, we enter the user interface.

User Interface

I chose a very simple user interface; it's just >. Once a user enters a command, it runs through a set of expected commands, and has a catch-all clause to handle gibberish or misspellings. If you look carefully, you'll see that it's recursive, but with no way to get back up the layers of recursion. It's a dumb memory leak, but that wasn't the point of this challenge.

def user_interface(todo, file_path):

    user_input = input("> ")

    try:
        try:
            assert "show" == user_input.lower()
            show(todo, file_path)
        except AssertionError:
            pass
        try:
            assert "show closed" == user_input.lower()
            show_closed(todo, file_path)
        except AssertionError:
            pass
        try:
            cmd = "show task "
            assert cmd in user_input.lower()
            show_task(todo, user_input.lower().replace(cmd,''), file_path)
        except AssertionError:
            pass
        try:
            cmd = "add notes "
            assert cmd in user_input.lower()
            add_notes(todo, user_input.lower().replace(cmd,''), file_path)
        except AssertionError:
            pass
        try:
            cmd = "add subtask "
            assert cmd in user_input.lower()
            add_subtask(todo,user_input.lower().replace(cmd,''), file_path)
        except AssertionError:
            pass
        try:
            cmd = "add "
            assert cmd in user_input.lower()
            add_task(todo, user_input.lower().replace(cmd,''), file_path)
        except AssertionError:
            pass
        try:
            cmd = "close "
            assert cmd in user_input.lower()
            close_task(todo, user_input.lower().replace(cmd,''), file_path)
        except AssertionError:
            pass
        try:
            assert "exit" == user_input.lower()
            exit(todo, file_path)
        except AssertionError:
            pass
        try:
            assert "help" == user_input.lower()
            help(todo, file_path)
        except AssertionError:
            pass
        raise SyntaxError
    except SyntaxError:
        print("Syntax Error")
        user_interface(todo, file_path)

One interesting thing I want to point out is that at the end of the upper level try block, I put in raise SyntaxError. SyntaxError is a built-in exception, and I'm manually throwing that exception because it's the best way for me to express that the user didn't type what I wanted them to. The except block at the end will deal with that by means of my horrible recursion.

Using Dictionary Exceptions

I defined the "show" command to list all open tasks. In the JSON schema, an open task is defined as a task that does not have a "completed_timestamp" element. Therefore, I can try to read that timestamp and put the code I care about in the except block.

def show(todo, file_path):
    for task in todo.keys():
        try:
            complete = todo[task]["completed_timestamp"]
        except KeyError:
            print(task)

    user_interface(todo, file_path)

The traditional way to write that would be something like the function below. I mostly turned that not in conditional phrase into what I'll call an "intentionally subverted expectation."

def show(todo):
    for task in todo.keys():
        if "completed_timestamp" not in task.keys():
            print(task)

I am particularly proud of the way I close tasks. Just like when reading in the original todo file, there were three cases to deal with: the task is open, the task is already closed, and the task doesn't exist. First, I try to get the completion timestamp. My expectation is that does not exist and the program will throw a KeyError. However, the program will also throw a KeyError if task_name doesn't exist, so I need another try/except block inside the upper except block. Inside that, I write a line that will throw a KeyError only if the task does not exist. And once we get through all those potential exceptions, we take the useful action of creating the completed_timestamp key and giving it a value.

If the original statement executes without any exception, I know that the task is already closed. Therefore, I raise a ValueError to handle that case.

def close_task(todo, task_name, file_path):
    try:
        complete = todo[task_name]["completed_timestamp"]
        raise ValueError
    except ValueError:
        # Ironicaly, the error is that there is a value
        print("Task already closed. Nothing else to do")
    except KeyError:
        try:
            task = todo[task_name]
            todo[task_name]["completed_timestamp"] = str(datetime.now())
        except KeyError:
            print("No task found")

    user_interface(todo, file_path)

Conclusion

I will never use this program for any of my todo lists. I will never use most of these techniques for any useful code. But I did treat exceptions as the hammer that would solve all of my problems and found the truly useful cases. The best case for using exceptions in this program is reading in the original todo file. I can't think of a more efficient or more readable way to make sure that todo file exists and is a valid JSON file.

As usual, the complete code for this post is available on Github.

Diagnose Like a Doctor: What IT Professionals Can Learn from Medicine

February 29, 2020 by Josh Clark

Over the past few years, my girlfriend has gone through medical school. That has given me a wonderful opportunity to get some exposure to what I am convinced is one of the most rigorous and most effective training regimens to learn troubleshooting. Talking with her and many of her classmates, I've been able to see the strong parallels between medicine and IT and apply some of the procedures taught in medical school to my work (after all, isn't CPR pretty much just percussive maintenance?).

The frameworks that medical school teaches to diagnose patients are completely adaptable to IT troubleshooting, and I have enjoyed more effective troubleshooting sessions, faster problem resolution, and a more enjoyable experience troubleshooting because of them. And now, I think my understanding of that adaptation is mature enough to share. In this post, I'll talk a little more about the similarities and differences between medicine and IT, describe two diagnosis procedures, and walk through a sample scenario to apply them.

Comparing Medicine and IT

In my opinion, the training that medicine has in place and the general level of competence almost everyone exhibits are traits that the IT world should aspire to. And the reason medicine does this so well is due to the rigor of training.

Someone who wants to be a doctor will finish their undergraduate degree, attend medical school (4 years), attend residency (2-5 years), and possibly attend a fellowship (2-8 years) to enjoy a position roughly equivalent to Tier 3 helpdesk. The reason there's even content for that much education and training is because today's medicine is the culmination of something like 5,000 years of focused research on the human body, which hasn't changed all that much in that time. In comparison, computer time has only been around for 50 years, and someone released a new Javascript framework in the time it took to read this paragraph.

Medical school, in particular, really focuses on two things: the things that can go wrong with the human body and how to figure out which thing is going wrong right now. In fact, someone interested in pure research (solutions architecture?) would not even be required to go to medical school. Those developing vaccines, for example, are much more likely to have PhDs in immunology or microbiology than MDs.

Put through the lens of medical education, our current system of putting Computer Science graduates into helpdesk roles seems flawed. "Oh, you think you broke your leg? Press 1 to speak with a microbiologist."

Imagine a system like medical school for IT: all the knowledge the wizened sysadmin in the corner shouts out is taught in a classroom setting. Performance issues depending on types of load for the default Java Virtual Machine (JVM) configuration. What does deadlock in the database look like? What's an effective response to different types of DDoS attacks?

That sort of education is impossible in the current state of IT. Things change so fast, and companies' implementations differ so greatly, that there is no way to keep coursework relevant and up-to-date. Imagine learning all about network troubleshooting and then going to work for Google: oh SNAP, they don't do networking like anyone else. Really, the best thing we can do is maintain good documentation of how our systems are supposed to function and hire people with the right fundamentals to adapt well to the quirks.

But once we assume that an individual has a decent understanding of a system having issues, whatever the method, we can start discussing how that person would troubleshoot those issues.

Differential Diagnosis

The first diagnostic framework to discuss is differential diagnosis. It's something that talented IT people likely already implement, but it isn't something taught very well. The idea is very simple. Take a well defined problem and come up with a ranked list of possible causes, in order of likeliness. Then, select tests to perform to either circle or cross out items on that list. Keep an eye out for possible causes that are very severe; you may want to try to rule those out first.

Consider getting a report that a user can't access a file on a file share. Your list might look something like:

User isn't connected to the network
Someone moved the file
Potential permissions issue
Cryptolocker

While ransomware is really low on that list in terms of likeliness, it would have a massive impact. Therefore, your first test might be to try to access that file yourself. If you can, you end up ruling out the most severe item on the list. And maybe even #2 as a bonus.

This is an iterative process, so as you check more things and obtain more information, add and remove items from that list as necessary.

Because the medical field is already so rigorous as to what symptoms mean which possible causes, differential diagnosis is a tested skill. One exam a medical student has to take is called Step 2: Clinical Knowledge. This exam features word problems that describe a set of symptoms and ask the student what the best test to perform would be. And while each test listed in the multiple choice may be useful, there is only one that is the correct next step for this patient. IT has a long way to go before it gets to that level of maturity.

The second framework I want to discuss is nothing more than a procedure that ensures you have the best information possible to build out your differential diagnosis.

HOPS: Structured Information Gathering and Testing

The next time you go to the doctor's office for an issue, you may recognize this procedure taking place:

History: a nurse will doublecheck your medical background and may ask what happened or what you're experiencing. There's a good chance the doctor will repeat many of those questions.
Observations: the doctor will look at your condition. Unusually pale? Strange mole? Large bruise? Limbs bent the wrong way? Severe bleeding?
Palpations: the doctor will do a brief physical exam. They may check heart rate, breathing function, and blood pressure. They may poke and prod a little bit.
Special Tests: the doctor will do a test that costs extra money and takes extra time. X-Ray, other imaging, throat swabs that go to a lab, or even a referral to a specialist are all examples of special tests.

Notice how there is a definite turn between information gathering and testing. The doctor transitions from observation to poking and prodding. That's where differential diagnosis happens, and all of the tests that follow are based on that list.

Applied to IT, HOPS requires only a couple changes. Palpations becomes interpreted as cheap tests, or tests that can be performed without causing more impact to the user or the business. This could be pinging a server, testing if a TCP port is listening with telnet, or checking the syslog server for a specific message.

Special tests become expensive tests, or any test that will cause even greater impact. At a bank, closing one of the teller windows costs money. At a call center, taking an agent off the phones costs money. At any enterprise, turning off a service to restart with a different logging level costs money. These tests should not be run lightly, and ideally they should be used to confirm what the cheap tests suggest.

In my opinion, investment in monitoring infrastructure should be framed in terms of increasing the number and scope of tests that can be performed cheaply when something is going wrong. Network monitoring infrastructure can turn costly packet captures for timing analysis into cheap tests. Application monitoring solutions like AppDynamics or Dynatrace can transform the expensive operation to enable debug mode on a server into a cheap test to dig into the information already being collected.

When an IT professional applies HOPS, it may look like the following:

History: are there any known issues with the system? What changes were made in the last few days? Patching? Infrastructure changes? Code releases?
Observation: what symptoms are exhibited? What's the user experience? What does the alert in the monitoring system say? What error is displayed on the screen? Was there a stack trace printed somewhere?
Palpations: what tests can be performed cheaply? Ping, traceroute, telnet? Are routing neighbors up? What monitoring is already in place?
Special Tests: What else can be done to investigate?

Again, the differential diagnosis is formed after observation is completed, and it guides the tests that come after.

What Might This Look Like in Practice?

To demonstrate this system in its full form, I'll walk through an example. You just came back from lunch, sleepy from the sizable chicken parmesan sub. Ding! A new email. It says there's a new ticket assigned to you: "High Priority: Web Site is Intermittently Failing to Load Home Page." You're in such a rush to log back into the ITIL system that you almost spill the soda you brought back with you.

History

Fighting back against the carb coma, you begin by checking the ticketing system to see if there are any recorded changes over the last week that list the main web site as an impacted application. You see nothing significant.

You then check active incidents to see if there is anything else going on that could be impacting the main web site. You don't see anything related.

You engage the individual who submitted the ticket and obtain the following pieces of information: the issue has been sporadic since mid-morning, a reload of the page normally fixes things, and the error actually displayed is a generic HTTP 500.

Observation

The error displayed is a generic server-side error code without any detail. A reload usually returns the correct result.

A check on the syslog server shows that the web server logs the HTTP 500 error, but without any other helpful information.

Differential Diagnosis

Database deadlock - deadlock on a specific database table would normally cause a constant issue for the same database call, which a generic web site home page would be requesting, but it could appear transient if the database call is specific to the user-agent string or customer location. If deadlock protection is enabled in the specific database, then it could be fixing the deadlocks almost as fast as they happen, causing a transient issue.
Overloaded database, or middleware (including microservice) server - A returned error code means there probably wasn't a drop in communication between the client and the web server. A transient issue could indicate a failed back-end call. A failed back-end call could be caused by an overloaded server that is only letting some calls through. It's more common to see deadlocks issues than sizing issues.
DDoS, or similar attack - This could certainly result in failed web page loads, but those would normally be 4xx errors. Unless this is an attack that directly affects the web server. Severe case, but less likely.
Flaky connection between web server and back end - These sorts of connections typically have plenty of redundancy, so a single bad OSPF neighbor shouldn't cause this. And that would have generated its own incident.

Palpations

Check firewall and web server CPU and RAM utilization. These are both reasonably low, so that rules out a DDoS.
Check the monitoring dashboard for middleware servers. Look for evidence of a sharp increase in thread count, which is usually evidence of hung threads. You don't see anything, which makes deadlock a little less likely.
Check the syslog server for deadlock alerts. Nothing. This completely rules out deadlocks.
Check the monitoring dashboard for CPU and RAM utilization on the middleware and database servers. Middleware servers look fine, but one database has massively elevated RAM usage. This particular database, ironically, is used to track user experience on the website, and isn't required for normal function.

Special Tests

Take this database offline. At the code level, direct connection failure will be a quick method return that the web server can work around, as opposed to a delayed return that results in a timeout. Over the next 10 minutes, the web server stops logging HTTP 500 responses, and the issue is resolved.

Prescription

Increase the size of the VM running that database and bring it back online. If the HTTP 500 errors return, go through HOPS again, but with a little more focused view.
Recommend a code change for more granular timeouts and more graceful failures in the web server.

Conclusion

I have been applying differential diagnosis and HOPS in my own work over the past year, and I've found that my ability to resolve complex issues quickly has improved dramatically. I have more fun troubleshooting since I don't feel like I'm just banging my head against a wall. And in cases where I'm leading a troubleshooting effort, asking leading questions that guide the other participants on this path really improves the engagement and efficiency of the entire group.

So give this framework a try. See how it works for you. It's frustrating at first to be stuck considering options and making a list while everything is broken, the company is losing money, and you feel like you should be doing something. But the returns are there.

I've Gone Mad for Mad Libs

February 04, 2020 by Josh Clark

It all started so innocently. I had heard so much about the Jinja2 templating engine for network automation, and I decided to come up with a quick presentation to show it off while giving myself an excuse to learn it. In short, Jinja is a templating engine originally used for dynamically rendering HTML in Python that has been adopted by the network automation community to spit out customized configuration flat files.

Due to the diverse background of my audience, I didn't want to have the caucus of DBAs or the gaggle of Unix sysadmins suffer through a Cisco IOS configuration just because that's the format I'm more comfortable with. I came up with the idea of templating Mad Libs as a fun way to level the playing field.

Basics of Jinja2 for Mad Libs

When using Jinja, you have a template file along with your script. The template (saved as template.txt in this example) might look like the below. Variables to be substituted reside within double-curly brackets.

Hello {{ name }}, I'm the Jinja Ninja!

Your code to render and print the text to console would be:

from Jinja2 import Environment, FileSystemLoader

env = Environment(loader=FilesystemLoader(searchpath=/path/to/templates))
jinja_ninja_template = env.get_template('template.txt')

# Two Ways to Format the Rendering. I'll Show Both.
my_name = "Josh"
named_arg_render = jinja_ninja_template.render(name = my_name)

variable_dict = {name: my_name}
dictionary_render = jinja_ninja_template.render(variable_dict)

print(named_arg_render)
# Would print "Hello Josh, I'm the Jinja Ninja!"
print(dictionary_render)
# Would also print "Hello Josh, I'm the Jinja Ninja!"

Advanced Fancy Templating for Mad Libs

I adapted some classic literature to a mad libs format because I didn't want to write things from scratch. Poems especially make some pretty weird and funny mad libs. And my template for Stopping by Woods on a Snowy Evening by Robert Frost allowed me to demonstrate some really interesting bits of Jinja's templating.

Stopping by {{ plural_noun_1|capitalize }} on a {{ noun_3|capitalize ~ 'y' }} Evening

by Robert Frost and {{ your_name }}

Whose {{ plural_noun_1 }} these are I think I know.   
His {{ noun_1 }} is in the {{ noun_2 }} though;   
He will not see me stopping here   
To watch his {{ plural_noun_1 }} fill up with {{ noun_3 }}.   

My little {{ noun_4 }} must think it queer   
To stop without a {{ noun_5 }} near   
Between the {{ plural_noun_1 }} and {{ adjective_1 }} lake   
The {{ superlative_adjective_1 }} evening of the year.   

He gives his {{ noun_6 }} a shake   
To {{ verb_1 }} if there is some mistake.   
The only other sound’s the sweep   
Of {{ adjective_2 }} wind and downy {{ noun_7 }}.   

The {{ plural_noun_1 }} are {{ adjective_3 }}, dark and deep,   
But I have {{ plural_noun_2 }} to keep,
{% for i in range(2) %}   
And {{ plural_noun_3 }} to {{ verb_2 }} before I {{ verb_3 }}{% if loop.last %}.{% else %},{% endif %}
{% endfor %}

Whereas you might do something like string_variable_name.capitalize() in Python to capitalize a string, you use the pipe operator to perform "filtering" in the template.
String concatenation is done via the tilde ~
Looping is absolutely allowed using typical Python for loops. Code blocks in the template use the {% %} format. There is a dictionary called loop that is exposed in a for loop (documentation here) to summarize useful information about the loop.
You can use part of that loop dictionary as a condition in an if statement. Notice in the last line of the template how I have an if/else block nested inside the for loop. Its entire purpose is to determine whether to put a comma or period at the end of the line that repeats.
The documentation and error handling for Jinja is incredible. The official documentation website and a little experimentation were completely sufficient to work through everything in this project.

What if You Don't Know What Variables Are In the Template?

To create a generic script to interpret the Jinja templates, it needs to be able to handle a varying number of verbs, nouns, and other parts of speech, as well as more unusual stuff like proper nouns, collections of words that rhyme, or whatever else is there. So, we need to make our script discover what's in the template. Here's how:

from Jinja2 import Environment, FileSystemLoader, meta

env = Environment(loader = FileSystemLoader(searchpath = '/path/to/templates'))

# This is the part where we figure out what we need
template_src = env.loader.get_source(template)
template_parsed = env.parse(template_src)
parts_of_speech = meta.find_undeclared_variables(template_parsed) # Returns a set
user_prompts = list(parts_of_speech) # Lists iterate faster than sets
try:
    # If you use range() to make a for loop,
    # Jinja2 will return 'range' as an
    # undeclared variable. You have to ignore it.
    user_prompts.remove("range")
except ValueError:
    pass

# This is where we prompt the user
dictionary_for_rendering = {}
for prompt in user_prompts:
    dictionary_for_rendering[prompt] = input(f'{prompt}: ')

# And now we render
template = env.get_template(template)
print(template.render(dictionary_for_rendering)

This Was The Initial Presentation

I successfully presented a command line script that would discover variables in a Jinja template, prompt the user to fill them in, and display the result. There was wild applause in my own head.

How Did This Get Out Of Hand?

A few things happened to transform this effort from a fun and silly presentation to a truly wacky side project. First, I brought my laptop to a coffee shop one Saturday morning and everyone around me got really excited to screw up classic literature via random word choices. Second, I completed my fireplace project and learned how truly easy Flask is to use. Third, I began planning a presentation that would require some sample network data (working title is Did the Infrastructure or the Application Mess Up: Fault Domain Isolation Through Packet Capture Timing Analysis).

Let's Adapt This For A Browser

This was the perfect opportunity to use Jinja for what it was actually designed for: dynamically rendered HTML!

I knew I wanted three webpages: one to present the story options, one to prompt the user for the parts of speech, and one to show the result. This was plenty to get the stub going:

from flask import Flask, request
from jinja2 import Environment, FileSystemLoader

app = Flask(__name__) # the application needs to be a global statement
web_env = Environment(
            loader = FileSystemLoader(searchpath = r'./html_templates')
        )
story_env = Environment(
            loader = FileSystemLoader(searchpath = r'./story_templates')
        )

@app.route('/')
def home_page():
    pass

@app.route('/show_form')
def madlib_form():
    pass

@app.route('/show_output', methods=['POST'])
def present_madlib():
    pass

if __name__ == '__main__':
    app.run(debug = True) # debug mode dynamically reloads the web server upon code changes

Because I wanted this to become a little more appropriate for the real world, I decided to change the undeclared variable search to just a list in a file. On an actual web server, it would save a decent amount of processing, comparatively. So, I offloaded all the story attributes to a JSON file.

{
    "stories": [
        {
            "id": 1,
            "name": "The Woods",
            "template": "woods.txt",
            "portrait": "/portraits/robert_frost.jpg",
            "attributes": [
                {"type": "Your Name", "var_name": "your_name"},
                {"type": "Plural Noun", "var_name": "plural_noun_1"},
                {"type": "Plural Noun", "var_name": "plural_noun_2"},
                {"type": "Plural Noun", "var_name": "plural_noun_3"},
                {"type": "Noun", "var_name": "noun_1"},
                {"type": "Noun", "var_name": "noun_2"},
                {"type": "Noun", "var_name": "noun_3"},
                {"type": "Noun", "var_name": "noun_4"},
                {"type": "Noun", "var_name": "noun_5"},
                {"type": "Noun", "var_name": "noun_6"},
                {"type": "Noun", "var_name": "noun_7"},
                {"type": "Adjective", "var_name": "adjective_1"},
                {"type": "Adjective", "var_name": "adjective_2"},
                {"type": "Adjective", "var_name": "adjective_3"},
                {"type": "Superlative Adjective", "var_name": "superlative_adjective_1"},
                {"type": "Verb", "var_name": "verb_1"},
                {"type": "Verb", "var_name": "verb_2"},
                {"type": "Verb", "var_name": "verb_3"}
            ]
        }
    ]
}

Don't worry, we'll get to that portraits value a little further in the descent into madness. For now, this gives us plenty of information to start rendering some web pages.

def home_page():
    home_page_template = web_env.get_template('home.html')
    stories = madlibs_helpers.get_story_list() 
    # I offloaded local file operations to a helper file.
    # If in the future I take this from JSON files to a
    # database, there's no need to mess with the Flask code.
    return home_page_template.render(story_list = stories)

This then renders the following HTML templates:

home.html

{% extends "base.html" %}
{% block body %}
    <h2>Choose from the following mad libs!</h2>
    {% for story in story_list %}
    <p><a href="/show_form?id={{ story.id }}">{{ story.name }}</a></p>
    {% endfor %}
{% endblock %}

base.html - Notice how I fill in the body block in home.html

<!DOCTYPE html>
<head>
    {% block head %}
    <title>Josh's Mad Libs!</title>
    {% endblock %}
</head>
<body>
    {% block body %}
    {% endblock %}
</body>

The other two pages went just as easily, and I was able to start capturing data for my other presentation.

Not Enough Data

The problem with making a really easy, lightweight web application is that when you need to use it to demonstrate issues that can happen with slightly heavier web applications, it doesn't generate enough traffic. I didn't even fill full packets with those HTML responses!

So, what takes a decent amount of data? Pictures! Why not put a picture of the original author of these selections of classic literature along with the filled-in mad lib? Why not find pictures where the author looks truly disappointed at how the user has ruined their life's work?

That's where that portrait attribute comes into play.

Where Are We Now?

So now, I have a web application that generates a decent amount of traffic that I can use to model network issues, server issues, backend access issues, and other things for my timing analysis presentation. And since I put so much effort into it, I decided I should just release it to the world!

I stuck it on a lightweight Droplet from DigitalOcean. You can check it out at madlibs.je-clark.com. All the code (minus some config work to get things going on the Droplet) is on Github.

Feel free to let me know if there are any other selections from classic literature that would make good (or truly horrible) Mad Libs. I've discovered that things between 100 and 150 words work pretty well.

Have fun! I know I have.

I Can REST Easy Because My Fireplace is API Controlled Now

January 22, 2020 by Josh Clark

I'm not sure why this is a thing, but my last couple apartments have had non-functional fireplaces. At best, it's a decent-looking mantle to put some pictures; at worst, it's just a waste of space. But I endeavor to elevate everything in my apartment! I want a fire!

The idea is to stick an old monitor inside the fireplace, connect it to a Raspberry Pi, and come up with some way to start playing a video of a lovely crackling fire.

How Do I Start the Video?

Because the Raspberry Pi runs Linux, my first thought was "just do a one line bash script to start VLC Player." My second thought was "Nah, I need that Python street cred," and the decision was made.

It turns out there are several ways to run system commands with Python, and it took some experimentation to find the proper way.

The simplest way is just to run it as a system command:

import os
os.system("cvlc -f fireplace_vid.mp4")

However, it seems that my Python installation tries to run this command as root, and VLC doesn't like being run as root. So, I moved to the subprocess library:

import subprocess
subprocess.run(['cvlc','-f','fireplace_vid.mp4'])

This is the newfangled wrapper on top of Popen that tries to make things a little simpler. However, the problem I found is that this command doesn't return until the process is finished. That doesn't really work for a 10 hour video. In light of that, I finally went to the good ol' Popen:

import subprocess
subprocess.Popen(['cvlc','-f','fireplace_vid.mp4'])

This did exactly what I wanted. It ran the call, which started the video, and returned immediately.

How Do I Stop the Video?

Like I mentioned, this is a 10 hour video, and I didn't want to start it and have no way to turn it off other than yanking power to my Raspberry Pi. I ended up going with the simplest approach, which ended up being very useful with the eventual API implementation:

import os
os.system('killall vlc') # Burn it all down!

What's This Flask Thing?

At this point, I started thinking about how to turn it on and off. I didn't want to have to SSH in and run commands every time I wanted a fire, so I started thinking about using an API. It would be so easy, just hit a link in my browser bookmarks and BOOM! Fire!

One of my buddies mentioned an idea of using the Flask package to guide a presentation on HTTP APIs, so I decided to try it out. I found a wonderful tutorial that laid out everything I needed. I had almost all my code laid out minutes later.

from flask import Flask
import os, subproccess

app = Flask(__name__)

@app.route('/fire', method=['GET'])
def start_fire():
    subprocess.Popen(['cvlc','-f','fireplace_vid.mp4','&'])
    return 'Fire Started!'

@app.route('/extinguish', method=['GET'])
def stop_fire():
    os.system('killall vlc')
    return 'No Mo Fire'

if __name__ == '__main__':
    app.run(host='0.0.0.0')

This clearly isn't production-ready code, but it's totally fine for my home use. The decorators just above each function create the endpoint on the web server and call the correct function when an HTTP request comes in.

Can I Turn the Monitor Off, Too?

Now that I'm already really fancy, can I have the monitor go black when there's no fire playing? And can I wake it up just before I start the video? It turns out that Raspbian has a command called tvservice. All I need to do is add that into each of my functions.

from flask import Flask
import os, subproccess

app = Flask(__name__)

@app.route('/fire', method=['GET'])
def start_fire():
    os.system('tvservice --preferred') # Starts the monitor with its preferred settings
    subprocess.Popen(['cvlc','-f','fireplace_vid.mp4','&'])
    return 'Fire Started!'

@app.route('/extinguish', method=['GET'])
def stop_fire():
    os.system('killall vlc')
    os.system('tvservice --off')
    return 'No Mo Fire'

if __name__ == '__main__':
    app.run(host='0.0.0.0')

Is That It?

It is! 15 lines of code to run a smartphone controlled digital fireplace.

Python Web Scraping to Detect PCI Errors

September 04, 2019 by Josh Clark

Right now, I am dealing with a little bit of instability within the packet capture environment at work. It turns out that when you completely overload one of these devices, it logs a PCI error and then stops responding to everything. It can be quite high effort to recover the device, requiring in at least one case a complete rebuild of the filesystem. The situation brings up imagery of a clogged toilet. And much like a clogged toilet, even clearing it out can still leave some issues down the line. I want to determine the true root cause.

The method to do that is to force a massive amount of traffic to one of the probes in our development environment, let it choke, and react quickly (within a day) to recover the probe, grab crash data, and send it to the vendor. Crash detection is most accurately performed by monitoring for a PCI error. These are visible in the IPMI console's event log (think ILO). However, I really don't want to take the couple minutes every day to log in and check it. So, let's automate it!

Python Web Scraping Options

Requests - This is a core Python module that excels at HTTP interaction. I will likely be using this to perform the GETs and POSTs I need.
Beautiful Soup (BS4) - This module is built for decoding HTML. If the information I need is in a <div> with a specific name, BS4 is the best way to get it.
Selenium - This is a bit of a long shot module. If the information is embedded in Java web applet or something that I can't interact with any other way, Selenium remotely controls a web browser. It's little annoying to work with for simpler web interaction, so I really want to avoid this.

Investigation to Determine Scraping Strategy

To determine which approach to take, I like to start by performing my normal browsing with the browser's developer mode turned on. As you can see below, this bit of navigation involved 80 HTTP requests/responses.

As I look through all this information, there are a number of things that jump out as very relevant. First, I see a POST to /cgi/login.cgi. Nothing about the headers looks all that important, but the Params tab shows that there's a form with username and password. So, authentication will require me to pass a username and password to /cgi/login.cgi.

Next, I need to see how the information I care about is being delivered. One of the many POSTs to /cgi/ipmi.cgi returns a bunch of XML with timestamps that match up with the event log table. However, the actual values aren't human readable.

<?xml version=\"1.0\"?>  
<IPMI>  
    <SEL_INFO>  
        <SEL TOTAL_NUMBER=\"0011\"/>  
        <SEL TIME=\"2019/04/27 11:44:30\" SEL_RD=\"0100029e40c45c20000405516ff0ffff\" SENSOR_ID=\"Chassis Intru   \" ERTYPE=\"6f\"/>  
        <SEL TIME=\"2019/06/12 19:01:38\" SEL_RD=\"020002124c015d33000413006fa58012\" SENSOR_ID=\"NO Sensor String\" ERTYPE=\"FF\"/>  
        <SEL TIME=\"2019/06/14 18:10:39\" SEL_RD=\"0300021fe3035d200004c8ff6fa0ffff\" SENSOR_ID=\"NO Sensor String\" ERTYPE=\"FF\"/>  
        <SEL TIME=\"2019/06/14 18:14:23\" SEL_RD=\"040002ffe3035d20000405516ff0ffff\" SENSOR_ID=\"Chassis Intru   \" ERTYPE=\"6f\"/>  
        <SEL TIME=\"2019/06/14 19:20:16\" SEL_RD=\"05000270f3035d33000413006fa58012\" SENSOR_ID=\"NO Sensor String\" ERTYPE=\"FF\"/>  
        <SEL TIME=\"2019/06/17 18:30:51\" SEL_RD=\"0600025bdc075d200004c8ff6fa0ffff\" SENSOR_ID=\"NO Sensor String\" ERTYPE=\"FF\"/>  
        <SEL TIME=\"2019/06/17 18:34:34\" SEL_RD=\"0700023add075d20000405516ff0ffff\" SENSOR_ID=\"Chassis Intru   \" ERTYPE=\"6f\"/>  
        <SEL TIME=\"2019/06/17 19:17:55\" SEL_RD=\"08000263e7075d33000413006fa58012\" SENSOR_ID=\"NO Sensor String\" ERTYPE=\"FF\"/>  
        <SEL TIME=\"2019/06/17 19:46:05\" SEL_RD=\"090002fded075d200004c8ff6fa0ffff\" SENSOR_ID=\"NO Sensor String\" ERTYPE=\"FF\"/>  
        <SEL TIME=\"2019/06/17 19:49:53\" SEL_RD=\"0a0002e1ee075d20000405516ff0ffff\" SENSOR_ID=\"Chassis Intru   \" ERTYPE=\"6f\"/>  
        <SEL TIME=\"2019/06/18 13:04:42\" SEL_RD=\"0b00026ae1085d200004c8ff6fa0ffff\" SENSOR_ID=\"NO Sensor String\" ERTYPE=\"FF\"/>  
        <SEL TIME=\"2019/06/18 13:08:28\" SEL_RD=\"0c00024ce2085d20000405516ff0ffff\" SENSOR_ID=\"Chassis Intru   \" ERTYPE=\"6f\"/>  
        <SEL TIME=\"2019/06/18 15:25:47\" SEL_RD=\"0d00027b02095d200004c8ff6fa0ffff\" SENSOR_ID=\"NO Sensor String\" ERTYPE=\"FF\"/>  
        <SEL TIME=\"2019/06/18 15:29:31\" SEL_RD=\"0e00025b03095d20000405516ff0ffff\" SENSOR_ID=\"Chassis Intru   \" ERTYPE=\"6f\"/>  
        <SEL TIME=\"2019/06/18 17:59:54\" SEL_RD=\"0f00029a26095d200004c8ff6fa0ffff\" SENSOR_ID=\"NO Sensor String\" ERTYPE=\"FF\"/>  
        <SEL TIME=\"2019/06/18 18:03:41\" SEL_RD=\"1000027d27095d20000405516ff0ffff\" SENSOR_ID=\"Chassis Intru   \" ERTYPE=\"6f\"/>  
        <SEL TIME=\"2019/06/18 20:38:09\" SEL_RD=\"110002b14b095d33000413006fa58012\" SENSOR_ID=\"NO Sensor String\" ERTYPE=\"FF\"/>  
    </SEL_INFO> 
</IPMI>

It's also important to note that the POST includes a form that requests:

SEL_INFO.XML : (1, c0)

So, now we know how to authenticate ourselves and how to get the data we need, but we don't yet have a way to interpret it. After more searching around, the HTML file returned from a GET to /cgi/url_redirect.cgi?url_name=servh_event includes a block of Javascript code that decodes that XML. While I probably could somehow run the XML through that Javascript code, I'm just going to understand how it works and recreate the parts I need.

Based on this investigation, it looks to me like I can get everything I need only using Requests. No need to involve those other modules.

Interpreting the XML Values

Through careful analysis of the Javascript code (and a lot of Ctrl-Fs), it looks like there are two really relevant pieces of code. The first is a switch on a value called tmp_sensor_type. Apparently, if this variable has the value 0x13 that means there's a PCI error. Also, I don't know who Linda is, but I appreciate that she made this whole thing possible.

case 0x13: //Linda added PCI error

    //alert("PCI ERR message detected! value = "+ sel_traveler[10]);
    var PCI_errtype = sel_traveler[10].substr(7,1);
    PCI_errtype = "0x" + PCI_errtype;
    var bus_id = sel_traveler[10].substr(8,2);
    var fnc_id = sel_traveler[10].substr(10,2);

    sensor_name_str = "Bus" + bus_id.toString(16).toUpperCase();
    sensor_name_str += "(DevFn" + fnc_id.toString(16).toUpperCase() + ")";

    if(PCI_errtype == 0x4)
        tmp_str = "PCI PERR";
    else if(PCI_errtype == 0x5)
        tmp_str = "PCI SERR";
    else if(PCI_errtype == 0x7)
        tmp_str = "PCI-e Correctable Error";
    else if(PCI_errtype == 0x8)
        tmp_str = "PCI-e Non-Fatal Error";
    else if(PCI_errtype == 0xa)
        tmp_str = "PCI-e Fatal Error";
    else
        tmp_str = "PCI ERR";

break;

The second relevant piece of code shows how tmp_sensor_type is determined. If it were in one discrete block, that would be nice. Instead, it's spread across a few hundred lines. Tracing through the variable assignments, I can see that:

tmp_sensor_type = sel_traveler[3];

I can also see:

sel_traveler = sel_buf[j];

It also appears that:

sel_buf[i-1][3] = stype;

And if I look at how we get stype, I find a line with an extremely helpful comment:

stype = parseInt(ch[5],16) ; \ take 11th byte

I don't know anything about Javascript, but I can find the 11th byte of something. And sure enough, if I go back to SEL_RD from my XML output, the 11th byte of all the PCI errors is 0x13.

Structure of the Python Script

Now that my investigation is done, I can identify what I need my Python script to do:

Get the IPMI URL or IP address from the command line.

Authenticate against /cgi/login.cgi.

POST to /cgi/ipmi.cgi with the appropriate form.

Navigate the resulting XML to find my SEL_RD values.

See if the 11th byte is 0x13.

Do something.

Implementation

Get the URL from the Command Line

This is pretty easy. It doesn't matter if there's a URL or IP address in the command line, and it's the only argument I'm taking.

import sysurl = sys.argv[1]

Authenticate against /cgi/login.cgi

I'm doing a couple noteworthy things here. First, I have my authentication details in a separate file called auth.py. Second, I'm using a Requests Session, which automatically keeps track of cookies and other session information for me.

import requests
from auth import ipmi_username, ipmi_pwd
auth_form = {'name': ipmi_username, 'pwd': ipmi_pwd}ipmi_sesh = requests.Session()
ipmi_sesh.post(f'http:///cgi/login.cgi', data=auth_form)

Request the Event Log Data

I already have my HTTP session set up, so I just need to ask for what I need.

event_log_form = {'SEL_INFO.XML': '(1,c0)'}response = ipmi_sesh.post(f'http:///cgi/ipmi.cgi', data=event_log_form)

Get SEL_RD from the Returned XML

Python includes some resources that let you interpret XML relatively easily. Each element within the hierarchy is treated as a list.

import xml.etree.ElementTree as ETipmi_events = ET.fromstring(response.text)
events = []
for SEL_INFO in ipmi_events:
    for SEL in SEL_INFO:
        events.append(SEL.attrib)

At this point, I have a list of dictionaries, where each dictionary includes the attributes of each SEL from the XML.

for event in events[1:] # first value is total number of events. We can skip it
    print(event.get('SEL_RD'))

I'm not actively printing SEL_RD in the final script, but that's how I isolate it.

Check the 11th Byte

Slicing byte arrays and strings in Python still doesn't come intuitively to me. It took some trial and error to find the [20:22].

if event.get('SEL_RD')[20:22] == '13':
# do something</code></pre><p><strong>Do Something</strong></p>
For this example, my 'do something' is just printing the timestamp, but we can record it in a log, send an email, or do whatever else we want.

print(f'Found PCI SERR at )
Conclusion

Python web scraping is surprisingly easy, and it would be a whole lot easier if I didn't have to learn to read Javascript to make my example work.

You can check out the full script on Github.

Audit TLS Version of Devices Connecting to Your Gear with a PCAP

August 09, 2019 by Josh Clark

I had a request come across my desk one day: we want to disable SSLv3 and TLS 1.0 on a set of 20+ servers, but would rather not rely on a scream test to see who will be affected. Can you tell us who is connecting to our gear with unsafe versions of this protocol?

I did some research to see if that information might be buried in the Windows Event Log somewhere, but it isn’t. I asked if it might be in an application log somewhere, but it isn’t. I took a quick look through our monitoring systems to see if there would be an easy answer, but there isn’t. Unfortunately, the best way to perform this audit, at least with my skillset, is to perform a packet capture on each server and extract the negotiated SSL/TLS version from each session.

So, I set up a packet capture (with capture filter set to specific TCP ports and headers only) on each one of these servers and hoped that I could come up with an analysis script by the time I stopped those capture. I didn’t quite get there.

Main Points of This Post

These are the main things I wanted to do when writing this script, and the main points I want to make in this post:

Using enums to avoid the dreaded “magic number” confusion
Working with Binary Files in Python
Being bold in the interests of high performance

Primary Structure of the Script

To help orient you (and myself), this is a brief description of how I structured the script.

First, I take in the pcap or directory of pcaps, and pass one of them into the meat of the script.

Next, I look into the global header of the file to determine the endianness of the file (thus ensuring my binary interpretations are accurate later) and to ensure the file contains Ethernet frames (to make sure my offsets work right).

I then pull a packet from the file into a byte array. I discovered that it is much faster to work with a byte array in memory than a file pointer in a multi-gigabyte capture file.

At this point, most packet dissection tools would extract all the header information from the packet and store it in some struct format. I can save a little bit of compute time by first checking to see if the packet is a TLS handshake, specifically a Client or Server hello.

If I find out that I do care about the packet, then I'll grab the IP and TCP headers, extract the relevant details (5-tuple and TLS version), and save that in a list, which gets written to a CSV file after I've gone through the entire file.

Using Enums in Python

I had an embedded systems professor in college who always stressed "NO MAGIC NUMBERS IN YOUR CODE!" If he or a TA spotted a number in your code anywhere, it was points off. This sort of script, where I'm working at a relatively low level with binary files, seemed like a good application to try avoiding hard-to-read numbers.

The strucutre in Python to do this seemed to be the Enum. My list for this script is as follows:

class Constant(IntEnum):
  ETH_HDR = 14
  IHL_MASK = int('00001111',2)
  TCP_DATA_OFFSET_MASK = int('11110000',2)
  WORDS_TO_BYTES = 4
  DATA_OFFSET_BITWISE_SHIFT = 4
  TLS_RCD_LYR_LEN_OFFSET = 3
  TLS_RCD_LYR_LEN = 5
  IP_SRC_OFFSET = 12
  IP_DST_OFFSET = 16
  IP_ADDR_LEN = 4
  TCP_DST_OFFSET = 2
  TCP_PORT_LEN = 2
  TLS_HANDSHAKE_VER_OFFSET = 9
  TLS_VER_LEN = 2

To use these, I just had to reference Constant.ETH_HDR.

It was a good idea, but there were also lines in this script where it didn't contribute that much to readability. Below is one excellent example of that, where I used a ton of Enums to perform array splicing, bitwise comparisons, bitwise shifts, and mathematical operations in a single line. And there are somehow still magic numbers in there.

tcp_len = ((int.from_bytes(packet[Constant.ETH_HDR + ip_len + 12: Constant.ETH_HDR + ip_len + 12 + 1], byteorder = 'big') & 
            Constant.TCP_DATA_OFFSET_MASK) >> Constant.DATA_OFFSET_BITWISE_SHIFT) * Constant.WORDS_TO_BYTES

Binary Files In Python

Python is surprisingly good at dealing with binary files. All you need to do is insert a b in the second argument of the open() method, and things are really nice. Example below:

file_handle = open(<file_name>,'rb+')

Once you do this, you have a pointer that you can move around however you want using the seek() method. The first argument of that method is the desired offset, and the second argument is the place you want to offset from (0 to reference the beginning of the file, and 1 to reference the current location).

A perfect example of how to use this is interpreting the headers in the PCAP file format. The global header (I used this site as a reference) has the following format:

typedef struct pcap_hdr_s {
    guint32 magic_number;   /* magic number */
    guint16 version_major;  /* major version number */
    guint16 version_minor;  /* minor version number */
    gint32  thiszone;       /* GMT to local correction */
    guint32 sigfigs;        /* accuracy of timestamps */
    guint32 snaplen;        /* max length of captured packets, in octets */
    guint32 network;        /* data link type */
} pcap_hdr_t;

From here, I care about the magic number and the data link type. The magic number tells me the endianness of the system, which is required to interpret any further data, and the data link type makes sure I'm dealing with Ethernet frames. If I'm not, all of my offsets are screwed up and I might as well abort now.

To read this information, I use the following code:

def report_global_header(file_ptr):
# assumes file_ptr is at head
# of pcap file
# returns byte order
#     big
#     little
# confirms ethernet frame
#     true if ethernet frame

magic_num = file_ptr.read(4)
magic_test = b'\xa1\xb2\xc3\xd4'
if magic_num == magic_test:
    order = 'big'
else:
    order = 'little'

file_ptr.seek(16,1)
link_layer_type = file_ptr.read(4)
ethernet = int.from_bytes(link_layer_type, byteorder = order)
is_ethernet = ethernet == 1

return order, is_ethernet

The really important bits here are the read() and seek() methods, because these move the file pointer.

I start by reading the magic number, which moves the pointer 4 bytes forward. Then I seek() 16 bytes ahead of the current pointer location, bringing me to the network variable.

I perform similar operations to interpret the packet header and read the packet to a byte array.

Being Bold for High Performance

I'll go ahead and qualify this entire section by pointing out that if I was truly trying to be bold for high performance, I would have written this in C.

For this script, I chose to not use an existing Python packet dissection tool like Scapy because of performance. Scapy, in particular, is slow and uses a ton of memory. And I wasn't able to find another pre-built tool that seemed like a significant increase in speed. I implemented my own packet dissection because I could have more control over what data I was pulling and when. I make a relatively small number of operations on packets that aren't relevant to my investigation, and I don't interpret a single piece of information that isn't necessary for the goal of the script.

By making small performance improvements over the course of development, I was able to process over 50GB of packet capture data in about 10 minutes with negligible impact on my laptop's CPU or RAM. Here are a few of the tactics I used to increase performance:

Pulling the entire packet into a byte array, instead of working with it in the file.
Reorganizing the dissection to skip over the IP and TCP headers to pull the TLS content type first
Push interpretation as far back in the script as possible. If I can make my comparison against the binary representation of my desired value, I save cycles.

Conclusion

I was able to develop a script (available here) that could interpret a pcap-based packet capture to audit the SSL/TLS version that other systems use to connect to it. This gave my colleague a nice list of application owners to talk into upgrading TLS before turning off support on his side. He successfully disabled support for the unsafe protocol versions on all of his servers without impacting any applications that relied on them.

In developing this script, I was able to mess with binary files and work with binary representations of data in Python, discovering that Python makes it surprisingly easy. I was also able to implement the wise advice of an old college professor and (mostly) avoid magic numbers in my code. It was a fun project!

Efficient Administration of Network Appliances

February 25, 2019 by Josh Clark

Beyond troubleshooting and packet analysis, part of my job is administration of the network monitoring tools we use to perform troubleshooting and packet analysis. One vendor we rely on quite heavily is Endace for their line of packet capture appliances. As part of my work with their gear, I’ve learned how to script things out for efficiency.

For capacity planning purposes, I’ve automated the process to obtain and format file system and packet capture information. Additionally, Endace recently released a hotfix for all of their devices, and I scripted things out to deploy it concurrently.

Overview of Endace Devices

The Endace architecture is relatively simple. There are probes distributed throughout the network that capture and store packets that are all managed by a Central Management Server (CMS).

The probes store databases full of packet data within an abstraction they call rotation files. These act like a ring buffer, refreshing the oldest 4GB block when the buffer fills up.

For most tasks on the system administration side, the CMS can distribute commands to all probes, groups of probes, or individual probes.

However, there are some cases where CMS orchestration doesn’t work, and installation of hotfixes is one of those. Hotfixes require the device to be in maintenance mode, which kills most of the client processes on the system, including the RabbitMQ client, the CMS orchestration mechanism. If I have 20+ devices on which to install a hotfix, requiring 2-3 minutes per device, not including waiting for a reboot to complete, I really don’t want to do that all in serial (as opposed to, and including, over serial) or manually. That’s where Python and netmiko come into play.

Netmiko Support for Endace

For those unfamiliar, netmiko is a Python library for network scripting. It adds a layer of abstraction on top of paramiko, the Python ssh library, to dumb things down enough for network engineers like me to work effectively. It allows us to send commands and view output without worrying about buffering input or output, figuring out when we can send more information, or writing regex to figure out if we’re in enable mode or configuration mode.

Since Endace is a relatively small vendor (as compared to Cisco or Juniper), netmiko doesn’t support it by default. Fortunately, Kirk Byers, the creator of netmiko, makes it really easy to add support for new operating systems. It took me a couple of days to build in support (available here). It took me a few more days to figure out how to make threading integrate with it to perform tasks on multiple devices in parallel.

At the end of that journey, I had a system to effectively distribute commands and collect information. For example, I wrote a quick script to extract information about the file systems and rotation files (basically PCAP files in a ring buffer made of 4GB blocks) to CSV files.

Extract Important Information

This script has 3 main parts. After initializing the CSV files it iterates through the devices:

for device in EndaceDevices.prod_probes:
    connection = ConnectHandler(**device)
    fsInfo = getFileSystemInfo(connection)
    rotInfo = getRotFileInfo(connection)
    connection.disconnect()
    print_output_to_csv(fsInfo, fscsvName, fsTemplate)
    print_output_to_csv(rotInfo, rotcsvName, rotTemplate)

First, it connects to the device and runs the relevant show commands.

def getFileSystemInfo(connection):
    connection.enable() # Note how netmiko provides a method to enter enable mode
    output = connection.find_prompt() # This is just here to output the device name
    output += "\n"
    output += connection.send_command("show files system")
    return output

def getRotFileInfo(connection):
    connection.enable()
    output = connection.find_prompt()
    output += "\n"
    output += connection.send_command("show erfstream rotation-file")
    return output

Then, it uses TextFSM to parse the output. TextFSM is a Python library developed by Google to throw regular expressions into a template whenever you have a block of text that will appear consistently. For example, the file systems template looks like:

Value Filldown Device (\S+)
Value Required TotalCapacity (\d+.?\d*)
Value UsedCapacity (\d+.?\d*)
Value UsableCapacity (\d+.?\d*)
Value FreeCapacity (\d+.?\d*)

Start
  ^${Device} #
  ^\s+Bytes Total\s+${TotalCapacity}
  ^\s+Bytes Used\s+${UsedCapacity}
  ^\s+Bytes Usable\s+${UsableCapacity}
  ^\s+Bytes Free\s+${FreeCapacity} -> Record

The top half of the template defines the variables I want to capture in the format Value <optional modifier> <Name> (<regex>).

The modifier Filldown copies that value to every entry in that file. In this case, it isn't very useful since I'm only extracting one partition, but if I wanted every partition, Filldown would copy the device name to the entry of every partition.

Required is an important modifier, and it's necessary to designate at least one value as Required. This indicates a valid entry. If that value doesn't appear, then TextFSM will skip the whole thing.

The second part of the template describes the text block as a regular expression.

Once this template is in place, all we need to do to turn a random string into a well formed list is:

with open(templateFilename, "r") as template:
    re_table = textfsm.TextFSM(template)
data = re_table.ParseText(output)

Finally, I use csvwriter to push it out to a CSV file:

with open(filename, 'a', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in data:
        writer.writerow(row)
    csvfile.close()

Deploy an Endace Hotfix

There are four basic steps to applying most hotfixes with Endace. First, the hotfix file needs to find its way to the device. Second, it needs to go into maintenance mode. Next, the hotfix is installed. Finally, it gets rebooted.

I use pysftp to get the hotfix file to the device:

def transfer_via_sftp(filepath, hostname, user, pw):
    opts = pysftp.CnOpts()
    opts.hostkeys = None
    conn = pysftp.Connection(host=hostname,username=user, password=pw, cnopts = opts)
    conn.chdir("/endace/packages")
    conn.put(filepath) # Upload the file
    # pysftp.exists() does an ls and compares the given string against the output #
    if conn.exists("OSm6.4.x-CumulativeHotfix-v1.end"): 
        conn.close()
        print(hostname + "tx completed successfully for host " + hostname)
        return True
    else:
        conn.close()
        print(hostname + "tx failed. Upload manually for host " + hostname)
        return False

I didn’t format my code exceptionally well, so the next three steps are a little muddled:

def enter_maintenance_mode(connection):
    connection.config_mode()
    output = connection.send_command("maintenance-mode noconfirm", strip_prompt = False)
    # In maintenance mode, the cli prompt looks like:
    # (maintenance) <device_name> (configuration) #
    if "(maintenance)" in output:
        return True
    else:
        print(output)
        print("issue obtaining maintenance mode")
        return False

def install_hotfix(hotfix_name_full, hfn_installed, device):
    conn = ConnectHandler(**device)
    if not enter_maintenance_mode(conn): # Step 2: enter maintenance mode
        print(device["host"] + "could not enter maintenance mode. Failing gracefully")
        conn.close()
        return False
    conn.send_command("package install " + hotfix_name_full) # Step 3: install hotfix
    output = conn.send_command("show packages")
    if hfn_installed in output:
        print(device["host"] + "successfully installed on host " + device["host"] + ". rebooting")
        conn.save_config()
        # send_command_timing() doesn't wait for the cli prompt to return #
        conn.send_command_timing("reload") # Step 4: reboot
        return True
    else:
        print(device["host"] + "possible issue with install. Troubleshoot host " + device["host"])
        return False
    pass

Executing in Parallel

The above code snippets perform all the required tasks to install the hotfix, but I want to go one step farther and execute those steps in parallel across many devices. To do this, I use the threading library:

thread_list = []
devices = EndaceDevices.all_devices

for dev in devices:
    thread_list.append(threading.Thread(target=transfer_via_sftp, args=(path, dev["host"], dev['username'], dev['password'])))

for thread in thread_list:
    thread.start()

for thread in thread_list:
    thread.join()

print("File transferred everywhere")
thread_list = []

for dev in devices:
    thread_list.append(threading.Thread(target=install_hotfix, args=(hf_name_full, hf_name_as_installed,dev)))

for thread in thread_list:
    thread.start()

for thread in thread_list:
    thread.join()

print("Hotfix installed everyone. Monitor for devices coming back up")

First, I create all the threads. Then, I start them. Finally, I wait for them all to return. And then I can repeat for the next function I want to parallelize.

The full script is available on Github.

Conclusion

Python, with netmiko, is a worthwhile tool to increase the efficiency of collecting information or pushing patches or configuration. If netmiko doesn’t support your device, it’s very easy to extend it. TextFSM is a very useful tool to process strings into useful, structured information.

Fast Filtering of Load Balanced or Proxied Connections

January 25, 2019 by Josh Clark

A common problem exists when investigating network issues through packet analysis. Web scale architecture provides ways to distribute workloads, and those mechanisms tend to abstract away silly things like the 5-tuple frequently used to identify individual conversations. Both proxies and load balancers can obscure a client’s IP address and source TCP port, and that makes it difficult to isolate a specific conversation across multiple networking devices.

The goal of this post is to describe an automated way to see if traffic is making it past the proxy, which can be extended to isolate the 5-tuple for the part of the conversation behind the proxy (source IP, source port, destination IP, destination port, TCP/UDP).

Architecture for this Scenario

To demonstrate how this can happen, a simple sample architecture can be used that shows a basic load balanced website.

Sample architecture depicting a user, a load balancer, and a pair of web servers

This assumes a few configuration options:

The client connects via HTTPS to the load balancer.
The load balancer performs SSL offload and the traffic within the data center is unencrypted.
The load balancer replaces the client's IP address with it's own Virtual IP Address (VIP).

These conditions typically mean that the load balancer will pass along a client's IP address within an X-Forwarded-For field within the HTTP header. This means that if we have firewall or load balancer logs that show incoming connections, or we have customers calling in who can give us their IP address, it is possible to isolate their traffic behind the load balancer.

Significance of a Novel Technique

If we can go through and find the traffic, why am I even writing this post?

Most packet analysis tools (including Wireshark, Endace, and some of the various Riverbed tools), are really good at quickly filtering through protocol headers because they’re optimized for parsing that binary data. They are less efficient at filtering through an ASCII-formatted HTTP header. Even Scapy, which is a Python library made for messing with packets, isn’t well optimized for looking through packet payloads.

Proposed Technique

For this task, it turns out that simplicity is key. To begin, we need a known client IP address and a broad packet capture behind the load balancer or proxy (it is really easy for a 5-10 minute packet capture to reach 100+ GB in an enterprise environment).

Step 1: Confirm the traffic exists

It doesn’t help to expend time and compute power searching through the details of a massive packet capture file if we don’t know the traffic under investigation even exists. I’ve found that the fastest way to do so is to ignore all the protocol headers. It takes time to process binary, especially if Scapy is doing it, so we can save a lot of time by reading in the PCAP as a regular text file.

To save on system memory (because I don’t have 100GB of RAM in my computer), we read line-by-line.

with open(pcap_fn, mode="rb") as pcap:
    print "PCAP Loaded"
    for line in pcap:
        iterate_basic_ip_check(line,target_ip)

In the HTTP header, the different fields are delimited by a newline character, so the X-Forwarded-For field we’re looking for appears in it’s own line using this technique, which allows us to match an ip address with some really simple regex.

def iterate_basic_ip_check(line, target):
    match = re.match('X-Forwarded-For: (\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})',line)
    if match:
        if match.group(1) == target:
            print "Target IP %s found" % target

All put together, this script (available on Github) runs blazingly fast. I estimated 200+ MB/sec on my machine, and it’s possible to parallelize this workload to take advantage of multiple cores.

Step 2: Isolate the 5-tuple

Once we know the correct traffic exists, we can re-iterate using Scapy to identify the 5-tuple, or potentially multiple 5-tuples, used. This is left as an exercise for the reader (you can thank my engineering textbooks for teaching me this wonderful and horribly frustrating phrase).

Conclusion

If you find yourself in the position where an expected IP address is disappearing behind a proxy or load balancer, it is possible to process a fairly large amount of data to isolate the conversation in the next segment of the network as long as the HTTP header is exposed.

Dissecting DNS Responses With Scapy

December 19, 2018 by Josh Clark

Introduction

Packet analysis to support troubleshooting is a big part of my job. In a company with hundreds of discrete applications, it is not reasonable to memorize IP addresses, or even to try to maintain a cheat sheet of IPs. Therefore, when analyzing network traffic in Wireshark, the “Resolve Network Addresses” view option is a lifesaver. At least, it is most of the time.

Wireshark resolves those network addresses by performing a reverse zone lookup through DNS. If you try to inspect a capture file on an offline computer, or one not on the corporate network, network address resolution will fail. In addition, this lookup will only return the name associated with the A record, which means that if that address was resolved through SRV or CNAME records, the returned name may not be very helpful.

A perfect example I came across was a client computer attempting to find a server to receive LDAP traffic. The initial DNS query from the client was __ldap.__tcp.windowslogon.domain.test, which returned SRV records connecting that service to srv1.domain.test on port 389 and A records connecting srv1.domain.test to an IP address. Using Wireshark’s name resolution, that IP address resolves to a random server address, and I don’t get the clue that it’s an LDAP connection used for Windows logon. This is especially confusing if the TCP ports used are nonstandard.

Script Requirements

I wanted a solution that would let me take the actual, in situ, DNS queries from the client displayed in the capture and connect those to the IP addresses that show up. Therefore, my script must parse DNS responses that showed up in the packet capture and connect the initial query through any chaining to the final IP address.

To accomplish this, I chose Scapy, the “Python-based interactive packet manipulation program & library,” based on a few blog posts I found. It’s important to note that packet dissection and analysis is not the primary goal for this library; it’s primarily meant for packet crafting. In fact, most of what you can find on StackOverflow or Google about Scapy revolve around using it to perform Man in the Middle attacks, ARP or DNS poisoning attacks, or other attacks revolving around packet manipulation. Because of this, the method by which Scapy stores packets, and the way it wants you to refer to different parts of each packet, is kind of strange.

Scapy’s Peculiarities

Scapy uses a nesting approach to storing packets, which does an admirable job matching the encapsulation that most networking protocols use. If you refer to packet[TCP], the returned data will include the TCP header and everything TCP encapsulates. However, it's not very useful to simply look at a packet with Scapy, because there is no output formatting by default.

In general, Scapy uses angular brackets (< and >) to denote the beginning and end of different sections, with specific fields separated by spaces, and displayed as field_name = field_value. Given this storage method, the best way to display a field in the packet is to refer to the section and field name. For example, the sequence number in a captured frame can be returned using packet[TCP].seq. For Scapy’s returned values to make any sense for packet analysis, it’s very important to refer to, and return, individual fields rather than entire headers.

The point at which this becomes very confusing is in DNS responses. A DNS response packet has four primary sections: queries, answers, authoritative nameservers, and additional records. Not all of these are always populated, and each one of those section can have multiple records in it. In fact, the DNS response header has fields that tell you how many values each one of those sections contains.

Based on how Scapy nests different protocols, you would expect that packet[DNS] will return the entire DNS section of the packet, and you should see fields that include qd (query), an (answer), ns (nameserver), and ar (additional record). Each one of those fields should contain an array (or list) of records. However, Scapy actually stores them nested, as shown for the nameserver section below:

ns=
    <DNSRR  
        rrname='ns.domain.test.' 
        type=NS 
        rclass=IN 
        ttl=3600 
        rdata='ns1.domain.test.' |
        <DNSRR  
            rrname='ns.domain.test.' 
            type=NS 
            rclass=IN 
            ttl=3600 
            rdata='ns2.domain.test.' |
            <DNSRR  
                rrname='ns.domain.test.' 
                type=NS 
                rclass=IN
                ttl=3600 
                rdata='ns3.domain.test.' 
                <DNSRR  
                    rrname='ns.domain.test.' 
                    type=NS 
                    rclass=IN 
                    ttl=3600 
                    rdata='ns4.domain.test.' |
                    <DNSRR  
                        rrname='ns.domain.test.' 
                        type=NS 
                        rclass=IN 
                        ttl=3600 
                        rdata='ns5.domain.test.' |
                    >
                >
            >
        >
    >

This means, somewhat unbelievably, that packet[DNS].ns[0] will return all the nameserver records, and packet[DNS].ns[4] will only return the last one. Confusing these even further, the section names for these are standardized to the record type and not the field, so the DNSRR (DNS response record) section name doesn’t consistently match with response records. A response that includes a SRV record will have a section name of DNSSRV. So, despite every other application of Scapy making it very easy to reference fields by packet[section_name].field_name, DNS responses completely break that mold.

Consistently Dissecting DNS Responses

My method to dissect DNS responses consistently makes heavy use of indices rather than alphanumeric section names. Because the DNS header reports the length of each of the four major sections, use those values to iterate through the information you need.

To iterate through the all the records in the answers section, use:

for x in range(packet[DNS].ancount):

To then connect an IP address, to the original query, use:

packet[DNS].an[x].rdata    # to return the IP address
packet[DNS].an[x].rrname   # to return the response record name
packet[DNS].qd.qname       # to return the original query name

Similar references can be used to iterate through the nameservers and additional records.

Building a dictionary of all DNS Responses

While my full script can be seen on Github, my general process to building a full dictionary mapping IP addresses to A records to DNS queries is as follows:

# For a given DNS packet, handle the case for an A record
if packet[DNS].qd.qtype == 1:
    for x in range(packet[DNS].ancount):
        if re.match(ip_address_pattern, packet[DNS].an[x].rdata) == None:
            continue
        temp_dict = {packet[DNS].an[x].rdata:[packet[DNS].an[x].rrname,packet[DNS].qd.qname]}
# And repeat the same process for the additional records by substituting ar for an

The process for a SRV record (designated by packet[DNS].qd.qtype == 33), is identical, except I don’t even bother with the answers section.

Conclusion

Automated packet dissection is a real possibility with Scapy, provided you are willing to spend the time learning how Scapy stores data and effective ways of working around some of its limitations. This example of mapping DNS responses is an excellent introduction Scapy itself, and I’m excited to see what I can do in the future if I can bake in other libraries that can give me statistical measurements, timing details, or even correlation between multiple packet captures showing the same conversations.