21: Introduction to reverse engineering

Today we’re going to do a brief overview of how you might reverse engineer an executable file. There are many nuances to this process, and it requires some background that not everyone will have and/or have fresh in their minds. So I’ll start with an overview and review of things you might need to know, then walk through some simple uses of command line tools to do some reverse engineering.

Side note: everything here is in the context of Linux, but like for memory forensics, there are similar tools and techniques that will work on executables compiled for other OSes.

Why might you want to do this? Suppose you have an executable for which you don’t have the source code, and you want to know what it does, or how, or why. You can of course just run it, but in order to really understand it you might need to look more carefully at it. You can do what’s called static analysis, where you examine the executable itself, or dynamic analysis, where you run it and examine/manipulate the running process, or (usually) some combination of the two.

Executable format

Linux executables are in ELF format. ELF is (yet another) binary file format. I won’t overrun you with the tiny details – read the spec for details or the Wikipedia article – but I will give you the highlights.

It starts with an ELF header, which describes the executable at a high level: a magic number, 32v64 bit, target OS, instruction set, pointers to the other parts of the executable, and so on.

Then there’s a program header table, that describes the memory segments in the executable – these are (some of) the same regions that we talked about in the memory class; thins like the code and data of the executable.

Then these sections follow. Various ones of interest include:

  • .bss: This section holds uninitialized data that contribute to the program’s memory image.
  • .data: This section holds initialized data that contribute to the program’s memory image.
  • .dynamic: This section holds dynamic linking information.
  • .init: This section holds executable instructions that contribute to the process initialization code. That is, when a program starts to run, the system arranges to execute the code in this section before calling the main program entry point (called main for C programs).
  • .rodata and .rodata1: These sections hold read-only data that typically contribute to a non-writable segment in the process image.
  • .text This section holds the “text,” or executable instructions, of a program.

Most of the tools we use to run and examine programs parse these sections for us automatically. More general tools (like the ever-present strings) can be used (for example) to look for data embedded in the .data. section, but this is a very coarse techinque.

Architecture and Assembly

OK, so what does program code look like?

You (kids today!) probably learned to code in a higher-level language, like Java, Javascript, Python, or the like. These languages are great, in that they hide many of the low-level details of a machine from you. You don’t have to worry too much about memory management, and you don’t have to worry basically at all about the underlying CPU and how it actually does stuff. That’s all great news.

The bad news is that if you want to reverse engineer an executable, you need to understand at least a little of this stuff to get started.

So, CPUs present a very simple interface to the programmer via assembly language. (CPUs used to actually be about this simple, but now there’s hardware that provides this abstraction – take a modern architecture class to learn more if you’re curiouus…) Generally, you have short (three or four letter) mnemonic names for each low-level instruction the CPU supports. These are typically things arithmetic operations like ADD SUB MUL DIV, comparison operators of various flavors like CMP, and “flow control” (really, just branch instructions) like JE or JNE.

Most of these operations usually one, two, (or on some architectures and for some operations, three) arguments, for the source(s) and destination of the operation. Some architectures let you directly target memory addresses with these operations. But, direct memory access is slow. CPUs have their own dedicated chunks of memory, called registers, that are better sources and targets (certainly faster) for most operations. They also might be the implicit target of certain operations. There are various operations to move data into or out of registers, too, usually coded MOV or the like.

There are other details too. Most CPU instruction sets directly support the “stack pointer” which, you know, points to the top of the stack in memory (since it’s so useful in function calls, which happen a lot it turns out!). They also support an instruction pointer of some sort, containing the address of the current instruction in memory – helpful when jumping around, for example.

Some simple sample reversing

So the usual way people learn reverse engineering is not by, say, loading up the most complicated executable they know of (like, Chrome or something) and trying to figure out what it does. Instead, they start with small, simple exercises to cut down on the number of things they have to consider. These exercises are sometimes called “crackmes”, and the goal is to figure how to make the executable do something specific (like, say, exit with status code 0) by analyzing it.

So let’s do that on a couple of examples. (See schedule page for links!)

crackme01

Normally you don’t look at the program source first (if you’re challenging yourself) but hey this is a learning environment, so:

#include <stdio.h>
#include <string.h>

// A very simple crackme which stores the correct password in program memory
// and uses the builtin string comparison function to check it.

int main(int argc, char** argv) {

    if (argc != 2) {
        printf("Need exactly one argument.\n");
        return -1;
    }

    char* correct = "password1";

    if (strncmp(argv[1], correct, strlen(correct))) {
        printf("No, %s is not correct.\n", argv[1]);
        return 1;
    } else {
        printf("Yes, %s is correct!\n", argv[1]);
        return 0;
    }

}

Here’s a program that wants a password, which is hardcoded into the program itself.

Suppose we didn’t have the source.

If we run it and don’t know the password, we’d have to guess repeatedly to figure it out. But maybe we can do better?

One option in a simple case like this is to just run strings. Hardcoded program data end up in the .data segment of the executable, and we should be able to find it.

(demo)

crackme02

We traced through the execution of crackme02. After I did the lecture, someone asked me to send them the links. I realized that the github crackme repo I used was actually linked to…a tutorial on reverse engineering! The link to the entire tutorial is in the readings now if you want a much more detailed writeup than my original notes.