COMPSCI 590K: Advanced Digital Forenics Systems | Spring 2020

15: BitTorrent

Stuff we didn’t quite get to last class:

Gov’t actors are bound by law. 4A requirements; fruit of poisonous tree. Leads/hearsay vs evidence. Wiretapping has a higher standard than searches!

Back to BitTorrent

No built-in searches; it’s just for file distribution.

Collections of files identified by .torrents; At a minimum,this information includes file names, sizes, and SHA-1 hashvalues for power-of-two-sized pieces of the concatenated fileset, plus URLs for trackers.

.torrent identified by infohash: the SHA-1 hash of fixed fields within the torrent that identify the files being distributedd – the filenames, sizes, and piece sizes and hashes.

Peer communicates with tracker that it wants to download this “infohash”, and gets back a (subset) of other peers currently involved with distributing the same torrent. IPs and GUIDs are used, but BitTorrent GUIDs are very transient.

To down/upload a file, the peer then connects to other peers. They exchange a list of pieces each possesses, then request pieces from each other (and send them). Tit-for-tat, so you can only “leach” if the other peer doesn’t have better options – otherwise you get “choked”.

Other details. DHT: a distributed way to avoid the reliance on trackers (used by so-called “magnet links”). Peer exchange: during download, peers can share IPs of other peers that have the same file.

.torrent files

First, let’s look at the structure of a .torrent file.

Everything in a torrent file is more-or-less encoded in a format called “bencoding”. There are many libraries to bencode/bdecode; I have the “official” one installed for Python 3.

They encode strings, ints, lists, and dicts; this suffices for both .torrent files and some of the on-the-wire bittorrent protocol.

A .torrent file (in the bittorrent spec, a “metainfo” file, but we’ll just call it a torrent file) is just a bencoded dictionary.

There are only two required keys: announce, whose value is the URL of the tracker, and info which describes the content that this torrent describes (as a dictionary). There are some optional keys as well (see spec).

info dictionary

In the simplest case, the torrent describes a single file. Then the info dictionary has four mandatory key/value pairs

  • name: the filename
  • length: length of the file in bytes (integer)
  • piece length: number of bytes in each piece
  • pieces: string consisting of the concatenation of all 20-byte SHA1 hash values, one per piece

If the torrent instead describes multiple files, then instead of the name/length keys, there is a key called files, which has as a value a list of dictionaries. Each dictionary in this list has two keys (path and length, describing the path and length in bytes of the files). The piece hashes are the hashes of the concatenation of all of these files in this order.

Some examples:

import bencode
d = open('/Users/liberato/Desktop/590K/ubuntu-18.04.4-desktop-amd64.iso.torrent', 'rb').read()
bencode.bdecode(d)
bencode.bdecode(d).keys()
bencode.bdecode(d)['info']
info = bencode.bdecode(d)['info']
info.keys()
info['name']
info['name']
info['piece length']
pieces = info['pieces']
pieces[0:20]
import binascii
binascii.hexlify(pieces[0:20])

So, assuming you can get a bdecoder working, it’s pretty easy to examine torrent files. (this is a PIP-installable packaged called bencode.py for Python3)

Now, recall how we handle multi-file torrents. How might we find “pieces of interest” given a large library of known files of interest, and a torrent under consideration? Consider the fact that offsets of pieces don’t match offsets of files. What can you do? As hinted at in the paper, you can compute hashes of all possible first pieces of the file, offset by some number of bytes less than the piece size, into the file. (on board) So you grow your hash set of FOIs by a large (but constant) factor in order to be able to quickly identify torrents that contain FOIs. Hmm, what data structure might let you do a first-pass check for membership in this set of hashes scalably (which you can then verify against a larger, on-disk database)?

On the wire

How does a user get a list of peers? One way is through a tracker.

How do you get a piece? You talk to a peer. There’s a handshake, and then exchange of fixed-length messages.

Menu