COMPSCI 590K: Advanced Digital Forenics Systems | Spring 2020

16: BitTorrent on the Wire

(Note: I’ve disabled a bunch of stuff on purpose here to avoid getting into the protocol weeds; the bittorrent spec has lots of options due to its evolution over the years, and some of them are interesting, but this is not a networking class so I’m going to skip the details).

Trackers

So let’s start with a BitTorrent client talking to a tracker, asking for peers (a so-called “scrape”). Remember from last lecture (and look it up in the spec if you want all the details) that this is an HTTP GET transaction, where the request URL is a specific form that encodes the infohash (which in turn uniquely identifies the content described by the .torrent). The body of the HTTP response is a bencoded dictionary with various fields. Let’s dig it out of an example here.

Note there are some extraneous packets here. But there’s one HTTP request that looks like a scrape. If we use WireShark to view the packet capture, we can spot it pretty easily. Then we can “follow” the HTTP transaction to see the relevant data.

Now that we see it, how to do we examine it?

One relatively straightforward way to get it out of WireShark is to select it under “Line-based text data”, and then copy it as an “escaped string”. We can’t just copy it directly because the peers field is encoded as raw bytes – having your system interpret it as text will result in it being translated as UTF-8 (or who knows what, it’s system dependent) and then you won’t be able to parse it without translating it back to binary the same way.

Now, I’ll use a Python bencode library to decrypt it, but you can use almost any bencode library in almost any language to do this. Notice that the copy/pasted string is in a pretty generic format that’s valid Python (and many other languages).

x = "\x64\x38\x3a\x63\x6f\x6d\x70\x6c\x65\x74\x65\x69\x31\x65\x31\x30" \
"\x3a\x64\x6f\x77\x6e\x6c\x6f\x61\x64\x65\x64\x69\x32\x65\x31\x30" \
"\x3a\x69\x6e\x63\x6f\x6d\x70\x6c\x65\x74\x65\x69\x31\x65\x38\x3a" \
"\x69\x6e\x74\x65\x72\x76\x61\x6c\x69\x31\x36\x38\x38\x65\x31\x32" \
"\x3a\x6d\x69\x6e\x20\x69\x6e\x74\x65\x72\x76\x61\x6c\x69\x38\x34" \
"\x34\x65\x35\x3a\x70\x65\x65\x72\x73\x31\x32\x3a\x60\xec\x71\x6d" \
"\xc8\xd5\x6d\xca\x6f\xec\xc8\xd5\x65"
bencode.bdecode(x)

Uh-oh. Let’s look at x:

d8:completei1e10:downloadedi2e10:incompletei1e8:intervali1688e12:min intervali844e5:peers12:`ìqmÈÕmÊoìÈÕe

If you read the bencode spec, you’ll see that’s mostly a valid bencoded dictionary, but the values associated with peers have been Unicode-ified and are thus corrupt. Ooops. Let’s modify the copy/paste to make sure the escaped values are stored as a byte literal and not translated into a string (which is what Python was doing for us before). In particular, I’m going to prefix each part of the escaped “string” with a 'b' to indicated that it’s to be interpreted as bytes, not a string:

x = b"\x64\x38\x3a\x63\x6f\x6d\x70\x6c\x65\x74\x65\x69\x31\x65\x31\x30" \
b"\x3a\x64\x6f\x77\x6e\x6c\x6f\x61\x64\x65\x64\x69\x32\x65\x31\x30" \
b"\x3a\x69\x6e\x63\x6f\x6d\x70\x6c\x65\x74\x65\x69\x31\x65\x38\x3a" \
b"\x69\x6e\x74\x65\x72\x76\x61\x6c\x69\x31\x36\x38\x38\x65\x31\x32" \
b"\x3a\x6d\x69\x6e\x20\x69\x6e\x74\x65\x72\x76\x61\x6c\x69\x38\x34" \
b"\x34\x65\x35\x3a\x70\x65\x65\x72\x73\x31\x32\x3a\x60\xec\x71\x6d" \
b"\xc8\xd5\x6d\xca\x6f\xec\xc8\xd5\x65"

Now we can decode the string:

bencode.bdecode(x)

{b'complete': 1,
 b'downloaded': 2,
 b'incomplete': 1,
 b'interval': 1688,
 b'min interval': 844,
 b'peers': b'`\xecqm\xc8\xd5m\xcao\xec\xc8\xd5'}

It’s a dictionary, where the keys are bytestrings (note the leading b). We want the peers key’s value in particular. According to the spec, it’s a “a string consisting of multiples of 6 bytes. First 4 bytes are the IP address and last 2 bytes are the port number. All in network (big endian) notation.”

result = bencode.bdecode(x)
peers = result[b'peers']
len(peers)

12 bytes, so two peers, each represented as six bytes. Let’s slice out the first six bytes and interpret them. We’ll use a ready-made function from socket and another from struct:

p = peers[0:6]
import socket
socket.inet_ntoa(p[0:4])       # -> returns '96.236.113.109'
struct.unpack('>H', p[4:6])    # -> returns a tuple (51413,)
struct.unpack('>H', p[4:6])[0] # -> returns just 51413

Try it on the second IP/port yourself; you should get 109.202.111.236 on the same port. Note that port is the default that Transmission (a BitTorrent client) uses if you don’t have port randomization on.

Peers

Parsing messages between peers is a little hairier. In the default BitTorrent protocol, each side must send the other a handshake message (exactly once). Then they send a sequence of as many other messages as they like that can take different forms.

This is complicated by the fact that there are several extensions to BitTorrent that define, in essence, their own protocols on top of this (at least one of them even switches to UDP, leaving the TCP stream entirely). We’re just going to very briefly look at the default BitTorrent messages here; a full forensic tool would need to consider all the common extensions.

Let’s briefly discuss the formats of these messages. When inside the .torrent files and when talking to the tracker, most messages are bencoded, but in the peer-to-peer messages, things are generally in a raw binary format. If you’ve dealt with parsing binary values before you should know how to do this in your preferred programming language. If you were writing a real parser you would probably take a more sophisticated approach – at least wrapping things up into logical functions, or maybe using a library like Construct, or maybe even something like Kaitai. I’ll just show things using Wireshark, which does an OK job of parsing BitTorrent messages.

Handshakes

So according to the spec the handshake looks like this:

<pstrlen><pstr><reserved><info_hash><peer_id>

where:

  • pstrlen: string length of <pstr>, as a single raw byte
  • pstr: string identifier of the protocol
  • reserved: eight (8) reserved bytes. All current implementations use all zeroes. Each bit in these bytes can be used to change the behavior of the protocol. An email from Bram suggests that trailing bits should be used first, so that leading bits may be used to change the meaning of trailing bits.
  • info_hash: 20-byte SHA1 hash of the info key in the metainfo file. This is the same info_hash that is transmitted in tracker requests.
  • peer_id: 20-byte string used as a unique ID for the client. This is usually the same peer_id that is transmitted in tracker requests (but not always e.g. an anonymity option in Azureus).

In version 1.0 of the BitTorrent protocol, pstrlen = 19, and pstr = “BitTorrent protocol”.

And you can see Wireguard is able to parse these out from a packet capture; same for most other messages, though you’ll see some that are pretty opaque, for example, messages of type “Extended” – these are protocol extensions to the BitTorrent wire protocol and it looks like my version of Wireshark doesn’t know how to parse them.

Menu