Developing PE file packer step-by-step. Step 6. TLS

Previous step is here.

It's time to manage such important thing as Thread Local Storage (TLS). What is it? It is a small structure, which tells PE loader where it has to place the data which should be allocated for each thread. The loader also calls TlsAlloc, and the return value is stored at the address specified in this structure (this is called index). Besides that, this structure may contain address of an array storing a set of callbacks (function addresses), which are called by loader when the file is loaded into memory or when a new thread in process is created.

To be honest, working with TLS will be somewhat more hardcore, than it was with other things, so get prepared and strain your brain. My old packer I mentioned one or two steps ago don't support TLS callbacks, it notifies cowardly that they exist but they are not processed. Basically, this is a reasonable behaviour, as TLS callbacks are contained mainly in rather weird files, which use them as anti-debugging trick. There is no regular linker like Borland or Microsoft linker, with TLS callback creation support. However, we will add their support to make our packer cool.

Let's start step by step. As always we will edit packed_file_info structure (structures.h file from simple_pe_packer project). This time we will add four fields to it:

These fields will store the TLS-related values required by the packer. I will explain callbacks individually. The AddressOfCallBacks field of IMAGE_TLS_DIRECTORY structure points to an array of absolute virtual addresses (i.e. to addresses following one after another), which, in their turn, point to functions that are callbacks. The last element of this array is null. The loader calls all the functions in this array one-by-one on the following events: process creation, thread creation, thread finish, process finish. First time they are called even before the process starts. To let the loader know, that our packed file has TLS callbacks (of course, if they were in original one), we will do the following: we will not set AddressOfCallBacks field to null, but we will write there an array of addresses, which contains only one empty callback (not null, but a real callback, which does nothing). While loading a packed image to memory this callback will be processed and the loader will know from this moment that the file has TLS callbacks. If we would write null or a pointer to an empty array to the AddressOfCallbacks field, we could not notify the loader that the callbacks exist. However, the callbacks array could be changed further, because the loader reads it whenever it's required.

We move the TLS index and data, which initialize thread memory, to their own section (which we called coderpub, do you remember?), to prevent their overwriting after unpacking, and write the index provided by the loader in the unpacker directly to a location, where it has to be in an original file. We will write the IMAGE_TLS_DIRECTORY structure itself (in fact, IMAGE_TLS_DIRECTORY32, because we pack 32-bit binaries) to our section. Also we will write there an array of our fake callbacks consisting of an empty callback only, if they exist in original file.

Let's begin the development. After this code block:

write the following:

Now we add a little code after the place, where we rebuild the resources:

Here we just save different required information about original TLS to our structure containing information about the original file. Besides that, the loader will save TLS index to it, which we will copy in the unpacker to the place where it should be located.

Further we work with "coderpub" section, which contained only unpacker body before. Firstly, we will set write access attribute, by changing the line:


We change also the line:


because we will work with this section.

Now we will turn to the unpacker project (unpacker) for some time. I will describe in detail, how we will process TLS callbacks. We store all original TLS addresses (we did this already). Then, right after unpacking the file we manually execute all original file callbacks, because the loader will not do this for obvious reasons - it has one empty callback only. After that we change the callback array, which we created, and write there all original function addresses, and from this moment TLS callbacks control goes to the loader, we have nothing to do further. So, our current task is to make an empty TLS callback in the unpacker. To avoid creating excessive functions, we simply modify unpacker_main function prologue:

So, the unpacker will begin to execute from jmp next instruction, directly jumping to its main body. And that empty callback we need looks like ret 0xC, and we will store pointer to this instruction in callbacks array. This instruction just passes the control, removing 0xC = 12 bytes from stack before that. In case you don't know, the TLS callback prototype looks like this:

and it uses stdcall calling convention and three four-byte parameters. In total, we have to remove 3 * 4 = 12 bytes from stack. The callback does not return any value, so it is not necessary to modify eax register in its body.

Now we replace all these lines:

to one memcpy, because amount of excessive code is getting too big:

Let's replace all operations with variables listed above in corresponding way, for example, replacing original_import_directory_rva to info_copy.original_import_directory_rva.

Let's change parameters.h file, the offsets required for the packer have changed, besides that, there is another one added:

The last offset in unpacker code (empty_tls_callback_offset) is the offset to ret instruction, which performs return from TLS callback.

Let's go further. Unlike import and resource directories, we will not restore TLS directory - there is no point. The loader will not read it again anyway. We turn to TLS processing. We will place the code in the unpacker after the part where we fix imports. To begin, we copy index provided by the loader to the memory address where it has to be located:

The following part is more complicated. Let's process TLS callbacks:

At first we list all callback addresses in original array and copy them to our TLS callbacks array to let the loader read them next time when they are needed. However, the loader called only our empty callback when creating the process, and PE file waits that its callbacks with DLL_PROCESS_ATTACH parameter will be called. That's why we need the second loop, in which we call all callbacks from original array, and pass base address of image as first parameter and DLL_PROCESS_ATTACH (=1) as second. Third parameter is not used, see the prototype above. Of course, we could copy callbacks addresses and call them in one loop, but what if the binary modifies itself in callback body or waits for the array to be filled before start? Anyway, two loops are not the universal solution, but this is more reliable.

That's all with the unpacker, and we now turn to the packer. We should place TLS directory to "coderpub" section, and also copy file data used to initialize new threads local data there.

I'm going to describe this huge piece of code. First, we reserved memory for IMAGE_TLS_DIRECTORY32 structure in last section with the unpacker ("coderpub") right after its code, then allocated memory for TLS callbacks array by their original number (each of them takes 4 bytes, plus last element - null). New callbacks array contains a pointer to code, which does nothing, except stack manipulation (ret 0xC). This informs the loader, that file has callbacks. Further we recalculate pointers to data, which the loader will use to initialize threads local data. We place these data after IMAGE_TLS_DIRECTORY32 structure and TLS callbacks array. Then we rebuild TLS using PE library (basically, it just writes IMAGE_TLS_DIRECTORY32 structure to right place and fills it, we turned off automatic callback and data writing). At last, we recalculate virtual section size taking into account the value of SizeOfZeroFill field from original file TLS (we don't change this value). I can't say exactly how this field is processed (unfortunately, I could not find good explanations in Internet) - whether the loader nulls the data after EndAddressOfRawData right inside the section, or after thread local memory initialization, but it's better to get reassured and allocate memory right inside the section. This does not affect packed file size, because we increase section virtual size, not raw. After all, we strip unnecessary null bytes from the end of the section (it is the last one, we can do that, I mentioned this before) and recalculate virtual and raw sizes of the section (actually only raw size can change, as we set virtual size manually and it's larger or equal to raw).

Now we remove the line that we added before:

We have only to test if the packer is working. TLS processing without callbacks could be tested on any program, compiled with Borland. Also we can build the program using Microsoft Visual Studio of any version, using __declspec(thread). It is not easy to make TLS with callbacks, I used an example from here, compiled it in Visual C++ 6.0, although I could also build TLS with callbacks manually with MASM32. After little testing I made myself sure that everything works as intended!

Honestly, I mentioned one feature, which applies to all packers I tried - they all don't change TLS index address. I can't say yet why this is happening, but probably there is a reason for such behavior. Comments in source code of UPX say that TLS callbacks array should be, as IMAGE_TLS_DIRECTORY32 structure, aligned to DWORD boundary, however I decided not to do this, because even on XP PE file with unaligned array worked properly.

There is also a note on previous code. Suddenly it was discovered, that Win XP works badly if data directories in PE header (Data Directory) are cut, and its explorer.exe stops displaying file icons. So we have to comment the line:

to keep the compatibility.

Full solution for this step: Own PE Packer, step 6

Leave a Reply

Your email address will not be published. Required fields are marked *