Efficient Administration of Network Appliances

Beyond troubleshooting and packet analysis, part of my job is administration of the network monitoring tools we use to perform troubleshooting and packet analysis. One vendor we rely on quite heavily is Endace for their line of packet capture appliances. As part of my work with their gear, I’ve learned how to script things out for efficiency.

For capacity planning purposes, I’ve automated the process to obtain and format file system and packet capture information. Additionally, Endace recently released a hotfix for all of their devices, and I scripted things out to deploy it concurrently.

Overview of Endace Devices

The Endace architecture is relatively simple. There are probes distributed throughout the network that capture and store packets that are all managed by a Central Management Server (CMS).

The probes store databases full of packet data within an abstraction they call rotation files. These act like a ring buffer, refreshing the oldest 4GB block when the buffer fills up.

For most tasks on the system administration side, the CMS can distribute commands to all probes, groups of probes, or individual probes.

However, there are some cases where CMS orchestration doesn’t work, and installation of hotfixes is one of those. Hotfixes require the device to be in maintenance mode, which kills most of the client processes on the system, including the RabbitMQ client, the CMS orchestration mechanism. If I have 20+ devices on which to install a hotfix, requiring 2-3 minutes per device, not including waiting for a reboot to complete, I really don’t want to do that all in serial (as opposed to, and including, over serial) or manually. That’s where Python and netmiko come into play.

Netmiko Support for Endace

For those unfamiliar, netmiko is a Python library for network scripting. It adds a layer of abstraction on top of paramiko, the Python ssh library, to dumb things down enough for network engineers like me to work effectively. It allows us to send commands and view output without worrying about buffering input or output, figuring out when we can send more information, or writing regex to figure out if we’re in enable mode or configuration mode.

Since Endace is a relatively small vendor (as compared to Cisco or Juniper), netmiko doesn’t support it by default. Fortunately, Kirk Byers, the creator of netmiko, makes it really easy to add support for new operating systems. It took me a couple of days to build in support (available here). It took me a few more days to figure out how to make threading integrate with it to perform tasks on multiple devices in parallel.

At the end of that journey, I had a system to effectively distribute commands and collect information. For example, I wrote a quick script to extract information about the file systems and rotation files (basically PCAP files in a ring buffer made of 4GB blocks) to CSV files.

Extract Important Information

This script has 3 main parts. After initializing the CSV files it iterates through the devices:

for device in EndaceDevices.prod_probes:
    connection = ConnectHandler(**device)
    fsInfo = getFileSystemInfo(connection)
    rotInfo = getRotFileInfo(connection)
    connection.disconnect()
    print_output_to_csv(fsInfo, fscsvName, fsTemplate)
    print_output_to_csv(rotInfo, rotcsvName, rotTemplate)

First, it connects to the device and runs the relevant show commands.

def getFileSystemInfo(connection):
    connection.enable() # Note how netmiko provides a method to enter enable mode
    output = connection.find_prompt() # This is just here to output the device name
    output += "\n"
    output += connection.send_command("show files system")
    return output

def getRotFileInfo(connection):
    connection.enable()
    output = connection.find_prompt()
    output += "\n"
    output += connection.send_command("show erfstream rotation-file")
    return output

Then, it uses TextFSM to parse the output. TextFSM is a Python library developed by Google to throw regular expressions into a template whenever you have a block of text that will appear consistently. For example, the file systems template looks like:

Value Filldown Device (\S+)
Value Required TotalCapacity (\d+.?\d*)
Value UsedCapacity (\d+.?\d*)
Value UsableCapacity (\d+.?\d*)
Value FreeCapacity (\d+.?\d*)

Start
  ^${Device} #
  ^\s+Bytes Total\s+${TotalCapacity}
  ^\s+Bytes Used\s+${UsedCapacity}
  ^\s+Bytes Usable\s+${UsableCapacity}
  ^\s+Bytes Free\s+${FreeCapacity} -> Record

The top half of the template defines the variables I want to capture in the format Value <optional modifier> <Name> (<regex>).

The modifier Filldown copies that value to every entry in that file. In this case, it isn't very useful since I'm only extracting one partition, but if I wanted every partition, Filldown would copy the device name to the entry of every partition.

Required is an important modifier, and it's necessary to designate at least one value as Required. This indicates a valid entry. If that value doesn't appear, then TextFSM will skip the whole thing.

The second part of the template describes the text block as a regular expression.

Once this template is in place, all we need to do to turn a random string into a well formed list is:

with open(templateFilename, "r") as template:
    re_table = textfsm.TextFSM(template)
data = re_table.ParseText(output)

Finally, I use csvwriter to push it out to a CSV file:

with open(filename, 'a', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in data:
        writer.writerow(row)
    csvfile.close()

Deploy an Endace Hotfix

There are four basic steps to applying most hotfixes with Endace. First, the hotfix file needs to find its way to the device. Second, it needs to go into maintenance mode. Next, the hotfix is installed. Finally, it gets rebooted.

I use pysftp to get the hotfix file to the device:

def transfer_via_sftp(filepath, hostname, user, pw):
    opts = pysftp.CnOpts()
    opts.hostkeys = None
    conn = pysftp.Connection(host=hostname,username=user, password=pw, cnopts = opts)
    conn.chdir("/endace/packages")
    conn.put(filepath) # Upload the file
    # pysftp.exists() does an ls and compares the given string against the output #
    if conn.exists("OSm6.4.x-CumulativeHotfix-v1.end"): 
        conn.close()
        print(hostname + "tx completed successfully for host " + hostname)
        return True
    else:
        conn.close()
        print(hostname + "tx failed. Upload manually for host " + hostname)
        return False

I didn’t format my code exceptionally well, so the next three steps are a little muddled:

def enter_maintenance_mode(connection):
    connection.config_mode()
    output = connection.send_command("maintenance-mode noconfirm", strip_prompt = False)
    # In maintenance mode, the cli prompt looks like:
    # (maintenance) <device_name> (configuration) #
    if "(maintenance)" in output:
        return True
    else:
        print(output)
        print("issue obtaining maintenance mode")
        return False

def install_hotfix(hotfix_name_full, hfn_installed, device):
    conn = ConnectHandler(**device)
    if not enter_maintenance_mode(conn): # Step 2: enter maintenance mode
        print(device["host"] + "could not enter maintenance mode. Failing gracefully")
        conn.close()
        return False
    conn.send_command("package install " + hotfix_name_full) # Step 3: install hotfix
    output = conn.send_command("show packages")
    if hfn_installed in output:
        print(device["host"] + "successfully installed on host " + device["host"] + ". rebooting")
        conn.save_config()
        # send_command_timing() doesn't wait for the cli prompt to return #
        conn.send_command_timing("reload") # Step 4: reboot
        return True
    else:
        print(device["host"] + "possible issue with install. Troubleshoot host " + device["host"])
        return False
    pass

Executing in Parallel

The above code snippets perform all the required tasks to install the hotfix, but I want to go one step farther and execute those steps in parallel across many devices. To do this, I use the threading library:

thread_list = []
devices = EndaceDevices.all_devices

for dev in devices:
    thread_list.append(threading.Thread(target=transfer_via_sftp, args=(path, dev["host"], dev['username'], dev['password'])))

for thread in thread_list:
    thread.start()

for thread in thread_list:
    thread.join()

print("File transferred everywhere")
thread_list = []

for dev in devices:
    thread_list.append(threading.Thread(target=install_hotfix, args=(hf_name_full, hf_name_as_installed,dev)))

for thread in thread_list:
    thread.start()

for thread in thread_list:
    thread.join()

print("Hotfix installed everyone. Monitor for devices coming back up")

First, I create all the threads. Then, I start them. Finally, I wait for them all to return. And then I can repeat for the next function I want to parallelize.

The full script is available on Github.

Conclusion

Python, with netmiko, is a worthwhile tool to increase the efficiency of collecting information or pushing patches or configuration. If netmiko doesn’t support your device, it’s very easy to extend it. TextFSM is a very useful tool to process strings into useful, structured information.

Dissecting DNS Responses With Scapy

Introduction

Packet analysis to support troubleshooting is a big part of my job. In a company with hundreds of discrete applications, it is not reasonable to memorize IP addresses, or even to try to maintain a cheat sheet of IPs. Therefore, when analyzing network traffic in Wireshark, the “Resolve Network Addresses” view option is a lifesaver. At least, it is most of the time.

ResolveNames.PNG

Wireshark resolves those network addresses by performing a reverse zone lookup through DNS. If you try to inspect a capture file on an offline computer, or one not on the corporate network, network address resolution will fail. In addition, this lookup will only return the name associated with the A record, which means that if that address was resolved through SRV or CNAME records, the returned name may not be very helpful.

A perfect example I came across was a client computer attempting to find a server to receive LDAP traffic. The initial DNS query from the client was __ldap.__tcp.windowslogon.domain.test, which returned SRV records connecting that service to srv1.domain.test on port 389 and A records connecting srv1.domain.test to an IP address. Using Wireshark’s name resolution, that IP address resolves to a random server address, and I don’t get the clue that it’s an LDAP connection used for Windows logon. This is especially confusing if the TCP ports used are nonstandard.

Script Requirements

I wanted a solution that would let me take the actual, in situ, DNS queries from the client displayed in the capture and connect those to the IP addresses that show up. Therefore, my script must parse DNS responses that showed up in the packet capture and connect the initial query through any chaining to the final IP address.

To accomplish this, I chose Scapy, the “Python-based interactive packet manipulation program & library,” based on a few blog posts I found. It’s important to note that packet dissection and analysis is not the primary goal for this library; it’s primarily meant for packet crafting. In fact, most of what you can find on StackOverflow or Google about Scapy revolve around using it to perform Man in the Middle attacks, ARP or DNS poisoning attacks, or other attacks revolving around packet manipulation. Because of this, the method by which Scapy stores packets, and the way it wants you to refer to different parts of each packet, is kind of strange.

Scapy’s Peculiarities

Scapy uses a nesting approach to storing packets, which does an admirable job matching the encapsulation that most networking protocols use. If you refer to packet[TCP], the returned data will include the TCP header and everything TCP encapsulates. However, it's not very useful to simply look at a packet with Scapy, because there is no output formatting by default.

samplePacket.PNG

In general, Scapy uses angular brackets (< and >) to denote the beginning and end of different sections, with specific fields separated by spaces, and displayed as field_name = field_value. Given this storage method, the best way to display a field in the packet is to refer to the section and field name. For example, the sequence number in a captured frame can be returned using packet[TCP].seq. For Scapy’s returned values to make any sense for packet analysis, it’s very important to refer to, and return, individual fields rather than entire headers.

The point at which this becomes very confusing is in DNS responses. A DNS response packet has four primary sections: queries, answers, authoritative nameservers, and additional records. Not all of these are always populated, and each one of those section can have multiple records in it. In fact, the DNS response header has fields that tell you how many values each one of those sections contains.

Based on how Scapy nests different protocols, you would expect that packet[DNS] will return the entire DNS section of the packet, and you should see fields that include qd (query), an (answer), ns (nameserver), and ar (additional record). Each one of those fields should contain an array (or list) of records. However, Scapy actually stores them nested, as shown for the nameserver section below:

ns=
    <DNSRR  
        rrname='ns.domain.test.' 
        type=NS 
        rclass=IN 
        ttl=3600 
        rdata='ns1.domain.test.' |
        <DNSRR  
            rrname='ns.domain.test.' 
            type=NS 
            rclass=IN 
            ttl=3600 
            rdata='ns2.domain.test.' |
            <DNSRR  
                rrname='ns.domain.test.' 
                type=NS 
                rclass=IN
                ttl=3600 
                rdata='ns3.domain.test.' 
                <DNSRR  
                    rrname='ns.domain.test.' 
                    type=NS 
                    rclass=IN 
                    ttl=3600 
                    rdata='ns4.domain.test.' |
                    <DNSRR  
                        rrname='ns.domain.test.' 
                        type=NS 
                        rclass=IN 
                        ttl=3600 
                        rdata='ns5.domain.test.' |
                    >
                >
            >
        >
    >

This means, somewhat unbelievably, that packet[DNS].ns[0] will return all the nameserver records, and packet[DNS].ns[4] will only return the last one. Confusing these even further, the section names for these are standardized to the record type and not the field, so the DNSRR (DNS response record) section name doesn’t consistently match with response records. A response that includes a SRV record will have a section name of DNSSRV. So, despite every other application of Scapy making it very easy to reference fields by packet[section_name].field_name, DNS responses completely break that mold.

Consistently Dissecting DNS Responses

My method to dissect DNS responses consistently makes heavy use of indices rather than alphanumeric section names. Because the DNS header reports the length of each of the four major sections, use those values to iterate through the information you need.

To iterate through the all the records in the answers section, use:

for x in range(packet[DNS].ancount):

To then connect an IP address, to the original query, use:

packet[DNS].an[x].rdata    # to return the IP address
packet[DNS].an[x].rrname   # to return the response record name
packet[DNS].qd.qname       # to return the original query name

Similar references can be used to iterate through the nameservers and additional records.

Building a dictionary of all DNS Responses

While my full script can be seen on Github, my general process to building a full dictionary mapping IP addresses to A records to DNS queries is as follows:

# For a given DNS packet, handle the case for an A record
if packet[DNS].qd.qtype == 1:
    for x in range(packet[DNS].ancount):
        if re.match(ip_address_pattern, packet[DNS].an[x].rdata) == None:
            continue
        temp_dict = {packet[DNS].an[x].rdata:[packet[DNS].an[x].rrname,packet[DNS].qd.qname]}
# And repeat the same process for the additional records by substituting ar for an

The process for a SRV record (designated by packet[DNS].qd.qtype == 33), is identical, except I don’t even bother with the answers section.

Conclusion

Automated packet dissection is a real possibility with Scapy, provided you are willing to spend the time learning how Scapy stores data and effective ways of working around some of its limitations. This example of mapping DNS responses is an excellent introduction Scapy itself, and I’m excited to see what I can do in the future if I can bake in other libraries that can give me statistical measurements, timing details, or even correlation between multiple packet captures showing the same conversations.