This assignment is due by 9pm on Tuesday, April 14th. It must be submitted through Moodle.
(15 points) In lecture, we talked about how you might be able to identify multi-file
.torrentfiles that contain files of interest, assuming you already have access to the file-of-interest list. Recall this task is not entirely trivial, since piece boundaries within the
.torrentare generally not aligned to file boundaries.
Suppose that you have a single file of interest you wish to be able to detect in
.torrents without having to download the associated files (that is, you have the
.torrentbut don’t want to join the swarm). Further suppose false positives are OK – in other words, you’re going to be using a Bloom filter. Matching just the first piece of the file is sufficient – you don’t need to match multiple such pieces for the file.
a) If you assume that pieces are 256 KB (and that the file of interest length is at least double the piece length), how many hashes of piece-sized data will you need to generate and insert into a Bloom filter to detect this file? In other words, you are determining how many 256 KB values you need to take the SHA-1 of. In a production system, these SHA-1s would then be inserted into the Bloom filter (which would require computing k hashes of their values, and so on).
b) The most common piece sizes are 256 KB, 512 KB, and 1 MB. How many hashes, total, will you need to generate and insert into a Bloom filter to detect this file?
c) Suppose you have a library of 10,000 such files of interest and you wish to be able to detect them using single Bloom filter, with an expected FPR of less than 1%. Further suppose you wish to have as small a Bloom filter as possible. What parameters must the filter have? (You do not need to completely minimize the Bloom filter’s size, but make a reasonable attempt – don’t tell me you need a 100 TB Bloom filter.)
(10 points) Suppose you are given a pre-populated Bloom filter, representing pieces of files of interest, as in the previous question. One such piece hash is
Identify the file that is identified by this hash by the name of the
.torrent, the number of pieces into the
.torrent, and the name of the corresponding file described in the
(Marc, why aren’t you giving us an actual pre-populated Bloom filter? Because I spent over two hours fiddling with various available Bloom filter implementations in Python, and was unable to find one that would serialize data to disk in a way that could be read back into memory in a portable way that didn’t also require installation of Redis or Postgres or the like. Several I tested purported to provide this functionality, but did not in practice!)
Finally, note that this assignment is synthetic! I make no claims about the actual files described by these
.torrents: I have not downloaded them or otherwise interacted with them, other than to post the
.torrentfiles themselves here. Nor do I recommend you download them. In fact, please do not, as they may refer to copyrighted material.
(25 points) Here is a short packet capture from the start of a download of a (legal)
.torrentfile. Find the TCP session connecting to a tracker and answer the following questions:
- Identify the
.torrentby name and/or infohash.
- What peers (IP/port) were provided by the tracker?
- Which of these peers were successfully connected to?
- (Optional, advanced: What information did these peers provide? Which pieces did they possess? What peers did they advertise using PEX? And so on.)
- For everyone: Which items of evidence above are most forensically valid? And which might be less so? Explain your answer.
- Identify the
(20 points) Choose another protocol over which files could be exchanged and give a forensic analysis of what data it exposes to the client, to the server, and to an observe in the middle of the data flow. On the easier end of the spectrum, you might choose unencrypted HTTP (probably 1.1) or FTP. On the harder end, you might pick a p2p protocol (like the µTP protocol that Bittorrent uses).
(Optional, advanced: packet capture a small exchange of data using the protocol you chose, and annotate the forensically interesting information.)
(10 points) Consider the simple timing attack described by Bissias et al. on OneSwarm. Is there a feasible setting of the tunable parameters that stops the attack? Give one such setting, or explain why such a setting is not possible.
Your submission should be comprised of your written answers, programs, and any other required files. Putting it all into a reasonable archive format (.zip, .tar.gz) and uploading it through Moodle is how we expect you to get it to us.
Reminder: Group work is permitted (so long as you clearly indicate group members). But if you work in groups, we will generally expect a higher level of performance on the work.