Previous step is here.
It's time to manage such important thing as Thread Local Storage (TLS). What is it? It is a small structure, which tells PE loader where it has to place the data which should be allocated for each thread. The loader also calls TlsAlloc, and the return value is stored at the address specified in this structure (this is called index). Besides that, this structure may contain address of an array storing a set of callbacks (function addresses), which are called by loader when the file is loaded into memory or when a new thread in process is created.
To be honest, working with TLS will be somewhat more hardcore, than it was with other things, so get prepared and strain your brain. My old packer I mentioned one or two steps ago don't support TLS callbacks, it notifies cowardly that they exist but they are not processed. Basically, this is a reasonable behaviour, as TLS callbacks are contained mainly in rather weird files, which use them as anti-debugging trick. There is no regular linker like Borland or Microsoft linker, with TLS callback creation support. However, we will add their support to make our packer cool.
Let's start step by step. As always we will edit packed_file_info structure (structures.h file from simple_pe_packer project). This time we will add four fields to it:
1 2 3 4 5 6 7 8 9 |
//Loader writes TLS index here DWORD tls_index; //Relative TLS index address in original file DWORD original_tls_index_rva; //Original TLS callback array address in original file DWORD original_rva_of_tls_callbacks; //Relative TLS callback array address in file //after our modification DWORD new_rva_of_tls_callbacks; |
These fields will store the TLS-related values required by the packer. I will explain callbacks individually. The AddressOfCallBacks field of IMAGE_TLS_DIRECTORY structure points to an array of absolute virtual addresses (i.e. to addresses following one after another), which, in their turn, point to functions that are callbacks. The last element of this array is null. The loader calls all the functions in this array one-by-one on the following events: process creation, thread creation, thread finish, process finish. First time they are called even before the process starts. To let the loader know, that our packed file has TLS callbacks (of course, if they were in original one), we will do the following: we will not set AddressOfCallBacks field to null, but we will write there an array of addresses, which contains only one empty callback (not null, but a real callback, which does nothing). While loading a packed image to memory this callback will be processed and the loader will know from this moment that the file has TLS callbacks. If we would write null or a pointer to an empty array to the AddressOfCallbacks field, we could not notify the loader that the callbacks exist. However, the callbacks array could be changed further, because the loader reads it whenever it's required.
We move the TLS index and data, which initialize thread memory, to their own section (which we called coderpub, do you remember?), to prevent their overwriting after unpacking, and write the index provided by the loader in the unpacker directly to a location, where it has to be in an original file. We will write the IMAGE_TLS_DIRECTORY structure itself (in fact, IMAGE_TLS_DIRECTORY32, because we pack 32-bit binaries) to our section. Also we will write there an array of our fake callbacks consisting of an empty callback only, if they exist in original file.
Let's begin the development. After this code block:
1 2 3 4 5 6 |
//Check if file became smaller really if(out_buf.size() >= src_length) { std::cout << "File is incompressible!" << std::endl; return -1; } |
write the following:
1 2 3 4 5 6 7 |
//If the file has TLS, we get information about it std::auto_ptr<pe_base::tls_info> tls; if(image.has_tls()) { std::cout << "Reading TLS..." << std::endl; tls.reset(new pe_base::tls_info(image.get_tls_info())); } |
Now we add a little code after the place, where we rebuild the resources:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
//If the file has TLS if(tls.get()) { //Pointer to our packer information structure //This structure is in the beginning of new added section, //we added it a bit earlier packed_file_info* info = reinterpret_cast<packed_file_info*>(&added_section.get_raw_data()[0]); //Write original TLS //relative virtual address info->original_tls_index_rva = tls->get_index_rva(); //If we have TLS callbacks, write //relative virtual address of their array in original file //to structure if(!tls->get_tls_callbacks().empty()) info->original_rva_of_tls_callbacks = tls->get_callbacks_rva(); //Now the relative virtual address of TLS index //will change - we will make the loader to write it to tls_index field //of packed_file_info structure tls->set_index_rva(pe_base::rva_from_section_offset(added_section, offsetof(packed_file_info, tls_index))); } |
Here we just save different required information about original TLS to our structure containing information about the original file. Besides that, the loader will save TLS index to it, which we will copy in the unpacker to the place where it should be located.
Further we work with "coderpub" section, which contained only unpacker body before. Firstly, we will set write access attribute, by changing the line:
1 |
unpacker_section.readable(true).executable(true); |
to
1 2 |
//Accessible for writing, reading, and execution unpacker_section.readable(true).executable(true).writeable(true); |
We change also the line:
1 |
const pe_base::section& unpacker_added_section = image.add_section(unpacker_section); |
to
1 |
pe_base::section& unpacker_added_section = image.add_section(unpacker_section); |
because we will work with this section.
Now we will turn to the unpacker project (unpacker) for some time. I will describe in detail, how we will process TLS callbacks. We store all original TLS addresses (we did this already). Then, right after unpacking the file we manually execute all original file callbacks, because the loader will not do this for obvious reasons - it has one empty callback only. After that we change the callback array, which we created, and write there all original function addresses, and from this moment TLS callbacks control goes to the loader, we have nothing to do further. So, our current task is to make an empty TLS callback in the unpacker. To avoid creating excessive functions, we simply modify unpacker_main function prologue:
1 2 3 4 5 6 7 8 9 10 11 |
//Create prologue manually __asm { jmp next; ret 0xC; next: push ebp; mov ebp, esp; sub esp, 256; } |
So, the unpacker will begin to execute from jmp next instruction, directly jumping to its main body. And that empty callback we need looks like ret 0xC, and we will store pointer to this instruction in callbacks array. This instruction just passes the control, removing 0xC = 12 bytes from stack before that. In case you don't know, the TLS callback prototype looks like this:
1 2 3 4 5 6 |
typedef VOID (NTAPI *PIMAGE_TLS_CALLBACK) ( PVOID DllHandle, DWORD Reason, PVOID Reserved ); |
and it uses stdcall calling convention and three four-byte parameters. In total, we have to remove 3 * 4 = 12 bytes from stack. The callback does not return any value, so it is not necessary to modify eax register in its body.
Now we replace all these lines:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
//Import directory relative virtual address DWORD original_import_directory_rva; //Import directory virtual address DWORD original_import_directory_size; //Resource directory relative virtual address DWORD original_resource_directory_rva; //Resource directory virtual address DWORD original_resource_directory_size; //Original entry point DWORD original_entry_point; //Total size of all file sections DWORD total_virtual_size_of_sections; //Number of original file sections BYTE number_of_sections; //Copy these values from packed_file_info structure, //which was written by packer original_import_directory_rva = info->original_import_directory_rva; original_import_directory_size = info->original_import_directory_size; original_resource_directory_rva = info->original_resource_directory_rva; original_resource_directory_size = info->original_resource_directory_size; total_virtual_size_of_sections = info->total_virtual_size_of_sections; number_of_sections = info->number_of_sections; |
to one memcpy, because amount of excessive code is getting too big:
1 2 3 4 |
//Copy all packed_file_info structure fields, because //we will need them further, but we will overwrite the structure at "info" pointer soon packed_file_info info_copy; memcpy(&info_copy, info, sizeof(info_copy)); |
Let's replace all operations with variables listed above in corresponding way, for example, replacing original_import_directory_rva to info_copy.original_import_directory_rva.
Let's change parameters.h file, the offsets required for the packer have changed, besides that, there is another one added:
1 2 3 4 5 |
#pragma once static const unsigned int original_image_base_offset = 0x11; static const unsigned int rva_of_first_section_offset = 0x1B; static const unsigned int empty_tls_callback_offset = 0x2; |
The last offset in unpacker code (empty_tls_callback_offset) is the offset to ret instruction, which performs return from TLS callback.
Let's go further. Unlike import and resource directories, we will not restore TLS directory - there is no point. The loader will not read it again anyway. We turn to TLS processing. We will place the code in the unpacker after the part where we fix imports. To begin, we copy index provided by the loader to the memory address where it has to be located:
1 2 3 |
//Copy TLS index if(info_copy.original_tls_index_rva) *reinterpret_cast<DWORD*>(info_copy.original_tls_index_rva + original_image_base) = info_copy.tls_index; |
The following part is more complicated. Let's process TLS callbacks:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
if(info_copy.original_rva_of_tls_callbacks) { //If TLS has callbacks PIMAGE_TLS_CALLBACK* tls_callback_address; //Pointer to first callback of an original array tls_callback_address = reinterpret_cast<PIMAGE_TLS_CALLBACK*>(info_copy.original_rva_of_tls_callbacks + original_image_base); //Offset relative to the beginning of original TLS callbacks array DWORD offset = 0; while(true) { //If callback is null - this is the end of array if(!*tls_callback_address) break; //Copy the address of original one //to our callbacks array *reinterpret_cast<PIMAGE_TLS_CALLBACK*>(info_copy.new_rva_of_tls_callbacks + original_image_base + offset) = *tls_callback_address; //Move to next callback ++tls_callback_address; offset += sizeof(DWORD); } //Return to the beginning of the new array tls_callback_address = reinterpret_cast<PIMAGE_TLS_CALLBACK*>(info_copy.new_rva_of_tls_callbacks + original_image_base); while(true) { //If callback is null - this is the end of array if(!*tls_callback_address) break; //Execute callback (*tls_callback_address)(reinterpret_cast<PVOID>(original_image_base), DLL_PROCESS_ATTACH, 0); //Move to next callback ++tls_callback_address; } } |
At first we list all callback addresses in original array and copy them to our TLS callbacks array to let the loader read them next time when they are needed. However, the loader called only our empty callback when creating the process, and PE file waits that its callbacks with DLL_PROCESS_ATTACH parameter will be called. That's why we need the second loop, in which we call all callbacks from original array, and pass base address of image as first parameter and DLL_PROCESS_ATTACH (=1) as second. Third parameter is not used, see the prototype above. Of course, we could copy callbacks addresses and call them in one loop, but what if the binary modifies itself in callback body or waits for the array to be filled before start? Anyway, two loops are not the universal solution, but this is more reliable.
That's all with the unpacker, and we now turn to the packer. We should place TLS directory to "coderpub" section, and also copy file data used to initialize new threads local data there.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 |
//If file has TLS if(tls.get()) { std::cout << "Rebuilding TLS..." << std::endl; //Reference to unpacker raw section data //Only unpacker body is located there yet std::string& data = unpacker_added_section.get_raw_data(); //Change unpacker section data size //to a number of bytes in unpacker body //(in case if null bytes from the end were stripped //by PE library) data.resize(sizeof(unpacker_data)); //Calculate the position to write IMAGE_TLS_DIRECTORY32 structure DWORD directory_pos = data.size(); //Allocate space for this structure //sizeof(DWORD) is required for alignment, because //IMAGE_TLS_DIRECTORY32 must be aligned to 4-byte boundary data.resize(data.size() + sizeof(IMAGE_TLS_DIRECTORY32) + sizeof(DWORD)); //If TLS has callbacks... if(!tls->get_tls_callbacks().empty()) { //It is necessary to reserve memory //for original TLS callback addresses //Plus 1 cell for null DWORD DWORD first_callback_offset = data.size(); data.resize(data.size() + sizeof(DWORD) * (tls->get_tls_callbacks().size() + 1)); //First callback will be our empty one (ret 0xC), //Write its address *reinterpret_cast<DWORD*>(&data[first_callback_offset]) = image.rva_to_va_32(pe_base::rva_from_section_offset(unpacker_added_section, empty_tls_callback_offset)); //Write relative virtual address //of new TLS callbacks table tls->set_callbacks_rva(pe_base::rva_from_section_offset(unpacker_added_section, first_callback_offset)); //Now write new callbacks table relative address //to packed_file_info structure, which is placed at //the beginning of first section reinterpret_cast<packed_file_info*>(&image.get_image_sections().at(0).get_raw_data()[0])->new_rva_of_tls_callbacks = tls->get_callbacks_rva(); } else { //If no callbacks exist, let's set address to null just in case tls->set_callbacks_rva(0); } //Clean up callback array, we don't need them anymore //We created them manually tls->clear_tls_callbacks(); //Set new relative address //of data used to initialize thread local memory tls->set_raw_data_start_rva(pe_base::rva_from_section_offset(unpacker_added_section, data.size())); //Recalculate the address of these data end tls->recalc_raw_data_end_rva(); //Rebuild TLS //Notify the rebuilder, that it is not needed to write data and callbacks //We do this manually (callbacks are already written to the right places) //Also we indicate that we don't need to strip null bytes at the end of the section image.rebuild_tls(*tls, unpacker_added_section, directory_pos, false, false, pe_base::tls_data_expand_raw, true, false); //Add data used to initialize local thread memory to the section unpacker_added_section.get_raw_data() += tls->get_raw_data(); //Now set "coderpub" section virtual size //taking into account SizeOfZeroFill of TLS field image.set_section_virtual_size(unpacker_added_section, data.size() + tls->get_size_of_zero_fill()); //At last, strip unnecessary null bytes at the end of the section pe_base::strip_nullbytes(unpacker_added_section.get_raw_data()); //and recalculate its sizes (raw and virtual) image.prepare_section(unpacker_added_section); } |
I'm going to describe this huge piece of code. First, we reserved memory for IMAGE_TLS_DIRECTORY32 structure in last section with the unpacker ("coderpub") right after its code, then allocated memory for TLS callbacks array by their original number (each of them takes 4 bytes, plus last element - null). New callbacks array contains a pointer to code, which does nothing, except stack manipulation (ret 0xC). This informs the loader, that file has callbacks. Further we recalculate pointers to data, which the loader will use to initialize threads local data. We place these data after IMAGE_TLS_DIRECTORY32 structure and TLS callbacks array. Then we rebuild TLS using PE library (basically, it just writes IMAGE_TLS_DIRECTORY32 structure to right place and fills it, we turned off automatic callback and data writing). At last, we recalculate virtual section size taking into account the value of SizeOfZeroFill field from original file TLS (we don't change this value). I can't say exactly how this field is processed (unfortunately, I could not find good explanations in Internet) - whether the loader nulls the data after EndAddressOfRawData right inside the section, or after thread local memory initialization, but it's better to get reassured and allocate memory right inside the section. This does not affect packed file size, because we increase section virtual size, not raw. After all, we strip unnecessary null bytes from the end of the section (it is the last one, we can do that, I mentioned this before) and recalculate virtual and raw sizes of the section (actually only raw size can change, as we set virtual size manually and it's larger or equal to raw).
Now we remove the line that we added before:
1 |
image.remove_directory(IMAGE_DIRECTORY_ENTRY_TLS); |
We have only to test if the packer is working. TLS processing without callbacks could be tested on any program, compiled with Borland. Also we can build the program using Microsoft Visual Studio of any version, using __declspec(thread). It is not easy to make TLS with callbacks, I used an example from here, compiled it in Visual C++ 6.0, although I could also build TLS with callbacks manually with MASM32. After little testing I made myself sure that everything works as intended!
Honestly, I mentioned one feature, which applies to all packers I tried - they all don't change TLS index address. I can't say yet why this is happening, but probably there is a reason for such behavior. Comments in source code of UPX say that TLS callbacks array should be, as IMAGE_TLS_DIRECTORY32 structure, aligned to DWORD boundary, however I decided not to do this, because even on XP PE file with unaligned array worked properly.
There is also a note on previous code. Suddenly it was discovered, that Win XP works badly if data directories in PE header (Data Directory) are cut, and its explorer.exe stops displaying file icons. So we have to comment the line:
1 |
//image.strip_data_directories(16 - 4); //Commented because of incompatibility with Windows XP |
to keep the compatibility.
Full solution for this step: Own PE Packer, step 6