Analysis — Posts — Josh Clark

A common problem exists when investigating network issues through packet analysis. Web scale architecture provides ways to distribute workloads, and those mechanisms tend to abstract away silly things like the 5-tuple frequently used to identify individual conversations. Both proxies and load balancers can obscure a client’s IP address and source TCP port, and that makes it difficult to isolate a specific conversation across multiple networking devices.

The goal of this post is to describe an automated way to see if traffic is making it past the proxy, which can be extended to isolate the 5-tuple for the part of the conversation behind the proxy (source IP, source port, destination IP, destination port, TCP/UDP).

Architecture for this Scenario

To demonstrate how this can happen, a simple sample architecture can be used that shows a basic load balanced website.

Sample architecture depicting a user, a load balancer, and a pair of web servers

This assumes a few configuration options:

The client connects via HTTPS to the load balancer.
The load balancer performs SSL offload and the traffic within the data center is unencrypted.
The load balancer replaces the client's IP address with it's own Virtual IP Address (VIP).

These conditions typically mean that the load balancer will pass along a client's IP address within an X-Forwarded-For field within the HTTP header. This means that if we have firewall or load balancer logs that show incoming connections, or we have customers calling in who can give us their IP address, it is possible to isolate their traffic behind the load balancer.

Significance of a Novel Technique

If we can go through and find the traffic, why am I even writing this post?

Most packet analysis tools (including Wireshark, Endace, and some of the various Riverbed tools), are really good at quickly filtering through protocol headers because they’re optimized for parsing that binary data. They are less efficient at filtering through an ASCII-formatted HTTP header. Even Scapy, which is a Python library made for messing with packets, isn’t well optimized for looking through packet payloads.

Proposed Technique

For this task, it turns out that simplicity is key. To begin, we need a known client IP address and a broad packet capture behind the load balancer or proxy (it is really easy for a 5-10 minute packet capture to reach 100+ GB in an enterprise environment).

Step 1: Confirm the traffic exists

It doesn’t help to expend time and compute power searching through the details of a massive packet capture file if we don’t know the traffic under investigation even exists. I’ve found that the fastest way to do so is to ignore all the protocol headers. It takes time to process binary, especially if Scapy is doing it, so we can save a lot of time by reading in the PCAP as a regular text file.

To save on system memory (because I don’t have 100GB of RAM in my computer), we read line-by-line.

with open(pcap_fn, mode="rb") as pcap:
    print "PCAP Loaded"
    for line in pcap:
        iterate_basic_ip_check(line,target_ip)

In the HTTP header, the different fields are delimited by a newline character, so the X-Forwarded-For field we’re looking for appears in it’s own line using this technique, which allows us to match an ip address with some really simple regex.

def iterate_basic_ip_check(line, target):
    match = re.match('X-Forwarded-For: (\d{1,3}.\d{1,3}.\d{1,3}.\d{1,3})',line)
    if match:
        if match.group(1) == target:
            print "Target IP %s found" % target

All put together, this script (available on Github) runs blazingly fast. I estimated 200+ MB/sec on my machine, and it’s possible to parallelize this workload to take advantage of multiple cores.

Step 2: Isolate the 5-tuple

Once we know the correct traffic exists, we can re-iterate using Scapy to identify the 5-tuple, or potentially multiple 5-tuples, used. This is left as an exercise for the reader (you can thank my engineering textbooks for teaching me this wonderful and horribly frustrating phrase).

Conclusion

If you find yourself in the position where an expected IP address is disappearing behind a proxy or load balancer, it is possible to process a fairly large amount of data to isolate the conversation in the next segment of the network as long as the HTTP header is exposed.

Introduction

Packet analysis to support troubleshooting is a big part of my job. In a company with hundreds of discrete applications, it is not reasonable to memorize IP addresses, or even to try to maintain a cheat sheet of IPs. Therefore, when analyzing network traffic in Wireshark, the “Resolve Network Addresses” view option is a lifesaver. At least, it is most of the time.

Wireshark resolves those network addresses by performing a reverse zone lookup through DNS. If you try to inspect a capture file on an offline computer, or one not on the corporate network, network address resolution will fail. In addition, this lookup will only return the name associated with the A record, which means that if that address was resolved through SRV or CNAME records, the returned name may not be very helpful.

A perfect example I came across was a client computer attempting to find a server to receive LDAP traffic. The initial DNS query from the client was __ldap.__tcp.windowslogon.domain.test, which returned SRV records connecting that service to srv1.domain.test on port 389 and A records connecting srv1.domain.test to an IP address. Using Wireshark’s name resolution, that IP address resolves to a random server address, and I don’t get the clue that it’s an LDAP connection used for Windows logon. This is especially confusing if the TCP ports used are nonstandard.

Script Requirements

I wanted a solution that would let me take the actual, in situ, DNS queries from the client displayed in the capture and connect those to the IP addresses that show up. Therefore, my script must parse DNS responses that showed up in the packet capture and connect the initial query through any chaining to the final IP address.

To accomplish this, I chose Scapy, the “Python-based interactive packet manipulation program & library,” based on a few blog posts I found. It’s important to note that packet dissection and analysis is not the primary goal for this library; it’s primarily meant for packet crafting. In fact, most of what you can find on StackOverflow or Google about Scapy revolve around using it to perform Man in the Middle attacks, ARP or DNS poisoning attacks, or other attacks revolving around packet manipulation. Because of this, the method by which Scapy stores packets, and the way it wants you to refer to different parts of each packet, is kind of strange.

Scapy’s Peculiarities

Scapy uses a nesting approach to storing packets, which does an admirable job matching the encapsulation that most networking protocols use. If you refer to packet[TCP], the returned data will include the TCP header and everything TCP encapsulates. However, it's not very useful to simply look at a packet with Scapy, because there is no output formatting by default.

In general, Scapy uses angular brackets (< and >) to denote the beginning and end of different sections, with specific fields separated by spaces, and displayed as field_name = field_value. Given this storage method, the best way to display a field in the packet is to refer to the section and field name. For example, the sequence number in a captured frame can be returned using packet[TCP].seq. For Scapy’s returned values to make any sense for packet analysis, it’s very important to refer to, and return, individual fields rather than entire headers.

The point at which this becomes very confusing is in DNS responses. A DNS response packet has four primary sections: queries, answers, authoritative nameservers, and additional records. Not all of these are always populated, and each one of those section can have multiple records in it. In fact, the DNS response header has fields that tell you how many values each one of those sections contains.

Based on how Scapy nests different protocols, you would expect that packet[DNS] will return the entire DNS section of the packet, and you should see fields that include qd (query), an (answer), ns (nameserver), and ar (additional record). Each one of those fields should contain an array (or list) of records. However, Scapy actually stores them nested, as shown for the nameserver section below:

ns=
    <DNSRR  
        rrname='ns.domain.test.' 
        type=NS 
        rclass=IN 
        ttl=3600 
        rdata='ns1.domain.test.' |
        <DNSRR  
            rrname='ns.domain.test.' 
            type=NS 
            rclass=IN 
            ttl=3600 
            rdata='ns2.domain.test.' |
            <DNSRR  
                rrname='ns.domain.test.' 
                type=NS 
                rclass=IN
                ttl=3600 
                rdata='ns3.domain.test.' 
                <DNSRR  
                    rrname='ns.domain.test.' 
                    type=NS 
                    rclass=IN 
                    ttl=3600 
                    rdata='ns4.domain.test.' |
                    <DNSRR  
                        rrname='ns.domain.test.' 
                        type=NS 
                        rclass=IN 
                        ttl=3600 
                        rdata='ns5.domain.test.' |
                    >
                >
            >
        >
    >

This means, somewhat unbelievably, that packet[DNS].ns[0] will return all the nameserver records, and packet[DNS].ns[4] will only return the last one. Confusing these even further, the section names for these are standardized to the record type and not the field, so the DNSRR (DNS response record) section name doesn’t consistently match with response records. A response that includes a SRV record will have a section name of DNSSRV. So, despite every other application of Scapy making it very easy to reference fields by packet[section_name].field_name, DNS responses completely break that mold.

Consistently Dissecting DNS Responses

My method to dissect DNS responses consistently makes heavy use of indices rather than alphanumeric section names. Because the DNS header reports the length of each of the four major sections, use those values to iterate through the information you need.

To iterate through the all the records in the answers section, use:

for x in range(packet[DNS].ancount):

To then connect an IP address, to the original query, use:

packet[DNS].an[x].rdata    # to return the IP address
packet[DNS].an[x].rrname   # to return the response record name
packet[DNS].qd.qname       # to return the original query name

Similar references can be used to iterate through the nameservers and additional records.

Building a dictionary of all DNS Responses

While my full script can be seen on Github, my general process to building a full dictionary mapping IP addresses to A records to DNS queries is as follows:

# For a given DNS packet, handle the case for an A record
if packet[DNS].qd.qtype == 1:
    for x in range(packet[DNS].ancount):
        if re.match(ip_address_pattern, packet[DNS].an[x].rdata) == None:
            continue
        temp_dict = {packet[DNS].an[x].rdata:[packet[DNS].an[x].rrname,packet[DNS].qd.qname]}
# And repeat the same process for the additional records by substituting ar for an

The process for a SRV record (designated by packet[DNS].qd.qtype == 33), is identical, except I don’t even bother with the answers section.

Conclusion

Automated packet dissection is a real possibility with Scapy, provided you are willing to spend the time learning how Scapy stores data and effective ways of working around some of its limitations. This example of mapping DNS responses is an excellent introduction Scapy itself, and I’m excited to see what I can do in the future if I can bake in other libraries that can give me statistical measurements, timing details, or even correlation between multiple packet captures showing the same conversations.

Fast Filtering of Load Balanced or Proxied Connections

Dissecting DNS Responses With Scapy