Decompiling P-Code in Your Mind's Eye. The Subtleties of Researching Commands of the VB Virtual Machine.

Introduction


Many programmers whose work involves studying the internals of binary files sometimes encounter programs written in p-code. Even if you are a reverse engineer who has never seen Visual Basic™ and its compiler, you surely have run into pseudo code at least once in your professional career. Unlike standard machine code, which is directly executed by the CPU, p-code is a set of virtual machine mnemonics executed by the msvbvmXX.dll engine. In this case, I guess, a debugger won't be of much help (though you might disagree with me), and a disassembler is even less useful. What you need, is either a decompiler or an ability to analyze information. Hopefully, you do have the latter! Assuming that, I'll provide some code listings from the decompiler for illustrative purpose only, while focusing on decompilation, for which I'll be using a hex editor as my sole tool.

A Few Words about Structures


As for the pseudo code, we'll have to locate it within the .exe file, which is not as easy as it seems. To do that, we must understand at least some of the Visual Basic™ structures incorporated into the .exe file. Let's start with taking a look at the original entry point into the program. To go to that point from the hex editor (I'm using Hiew), I only need to open the .exe file in the editor and press the following keys: Enter, Enter, F8, and F5. (If you are an advanced Hiew user, you already know how to reduce this operation to one command line.) What we'll see is the following:

push 0004042E8 ;'VB5!'
call ThunRTMain ;MSVBVM60 --?2

The whole purpose of this ultra-complex line is to call the ThunRTMain function of MSVBVM60. A pointer to VB Header is passed as a parameter of that function. Here's how it may look:

VBHeader Structure


Field Type Description
Signature String * 4 "VB5!" signature
RuntimeBuild Integer Runtime flag
LanguageDLL String * 14 Language DLL
BackupLanguageDLL String * 14 Backup language DLL
RuntimeDLLVersion Integer Version of the runtime DLL
LanguageID Long Application language
BackupLanguageID Long Used with backup language DLL
aSubMain Long Procedure to start after the application is launched
aProjectInfo Long Pointer to ProjectInfo
fMDLIntObjs Long
fMDLIntObjs2 Long
ThreadFlags Long Thread flags
ThreadCount Long Number of threads
(the meaning of this field is unclear as VB
doesn't let you make multithreaded application)
FormCount Integer Number of forms in this application
ExternalComponentCount Integer Number of external OCX components
ThunkCount Long
aGUITable Long Pointer to GUITable
aExternalComponentTable Long Pointer to ExternalComponentTable
aComRegisterData Long Pointer to ComRegisterData
oProjectExename Long Pointer to the string containing EXE filename
oProjectTitle Long Pointer to the string containing project's title
oHelpFile Long Pointer to the string containing name of the Help file
oProjectName Long Pointer to the string containing project's name

ThunRTMain of the above structure receives all necessary data and the addresses of structures required to set up and launch the .exe file. If the aSubMain field is present, the ProcCallEngine function will launch the function located at the address specified in that field; that is the main function in the project, and its value is not equal to 0 if there is a module with a "Main" function in the .exe file. If there is no such function, then the first form will load, and control will be passed either to the Form_Initialize function (if it exists) or to the Form_Load function (if the former doesn't exist). What we need in this structure, is only aSubMain and the ProjectInfo structure. The latter looks like this:

ProjectInfo Structure


lTemplateVersion Long VB compatible version
aObjectTable Long Pointer to ObjectTable
lNull1 Long
aStartOfCode Long Pointer to the start of some Assembly listing
aEndOfCode Long Pointer to the end of some Assembly listing
lDataBufferSize Long Size of Data Buffer
aThreadSpace Long
aVBAExceptionhandler Long Pointer to VBA Exception Handler
aNativeCode Long Pointer to the start of RAW Native Code
uIncludeID(527) Byte VB Include Internal Project Identifier
aExternalTable Long Pointer to API Imports Table
lExternalCount Long Number of API's imported

Now I'm at the crossroads: it only makes sense to continue if aNativeCode is equal to 0. That would mean that all functions in the program are written in pseudo code and that it's worthwhile to look into them. I'm going to get all useful data from aObjectTable, which points to the structure of objects (forms, modules, etc.). Among other data, I can find there addresses of some procedures or events.

ObjectTable Structure


lNull1 Long
aExecProj Long
aProjectInfo2 Long
lConst1 Long
lNull2 Long
aProjectObject Long
uuidObjectTable Long
Flag2 Long
Flag3 Long
Flag4 Long
fCompileType Integer
iObjectsCount Integer Count of objects
iCompiledObjects Integer
iObjectsInUse Integer
aObjectsArray Long Pointer to objects array
lNull3 Long
lNull4 Long
lNull5 Long
aNTSProjectName Long
lLcID1 Long
lLcID2 Long
lNull6 Long
lTemplateVersion Long

Now, after having considered a table of objects, let's take a closer look at an array of these objects. ObjectsArray is an array of TObject structures, and the number of elements in the array is defined by the iObjectsCount field. Here's the TObject structure and its derivative, TObjectInfo:

TObject Structure


aObjectInfo Long Pointer to ObjectInfo
lConst1 Long
aPublicBytes Long Pointer to Public Variable Size integers
aStaticBytes Long Pointer to Static Variables Struct
aModulePublic Long Memory Pointer to Public Variables
aModuleStatic Long Pointer to Static Variables
aNTSObjectName Long Pointer to Object Name
lMethodCount Long Number of methods
aMethodNameTable Long Pointer to method names array
oStaticVars Long Offset to Static Vars from aModuleStatic
lObjectType Long Flags defining this object behaviour
lNull2 Long

As you can see, this structure looks more interesting! The aNTSObjectName field indicates the type of module.

TObjectInfo Structure


iConst1 Integer
iObjectIndex Integer Index of the Object in the Project
aObjectTable Long Pointer to Object Table
lNull1 Long
aObjectDescriptor Long Set to -1 if module after compile
lConst2 Long
lNull2 Long
aObjectHeader Long Pointer to Object Header
aObjectData Long Pointer to in-memory Object Data
the next members are only 100% valid on P-Code executables
iMethodCount Integer Number of Methods
iNull3 Integer
aMethodTable Long Pointer to Method Table
iConstantsCount Integer Number of Constants
iMaxConstants Integer Maximum Constants to allocate
lNull4 Long
lFlag1 Long
aConstantTable Long Pointer to Constants Table

If I disregard all object events on the form and only consider user-created functions, I can find the latter's addresses in the table, to which aMethodTable points. Surely, each module or form has its own table. Therefore, if I need to get access to all functions, I need to go over all functions and modules (and also class modules, user controls, etc.). Let's take a look at what is located at the addresses, which I'm taking from the methods table. Surprisingly, what I've found there, is not some pseudo code or assembler commands, but yet another structure! The developers of Visual Basic™ definitely are in love with structures, er?

ProcDscInfo Structure


ProcTable Long
field_4 Integer
FrameSize Integer
ProcSize Integer

I have reduced this structure deliberately, as we only need ProcSize and the address to ProcTable. As for ProcTable, we only need the address to the data block. Let's consider this block in detail, if you don't mind. All op-codes of p-code that handle some data (such as strings, API functions, and so on) refer to it not by absolute but by relative address, which is counted from the beginning of the data block. Unless we have that address, we cannot decompile anything. So let's consider the following structure located at the ProcTable address:

ProcTable Structure


SomeTemp String*52
DataConst Long

SomeTemp is just a bunch of fields that I don't need, so I've combined them to save space on the page. What we only need is the DataConst address. The question is, what I've got? Only the size of the p-code structure and the address to the data block. But where is the procedure? Well, though its address is not specified anywhere, I can easily calculate it by subtracting sProcDscInfo.ProcSize from ProcDscInfo (the address of the beginning of the structure). As you can see, p-code directly precedes the ProcDscInfo structure. Now, when we know how to get the methods, we can start studying the p-code.

How to Research


Now I know the starting address of the p-code block. For better understanding the p-code's internals, I'll need a table of op-codes. You can find its open source version on the Web, but it is incomplete (keep in mind that it's non official) and not free from errors (though there are very few of them). The table uses the following format:

<opcode>tab<size>tab<description>

Each op-code may consist of one or two bytes. All op-codes before FBh are one-byte ones, and all op-codes after FBh are two-byte ones. So, if you see FEh, then the following byte refers to the same op-code. <size> in the table is the number of parameters. Now let's take a look at a sample p-code block:

00 00 00 00-00 00 00 00-00 00 00 00-F4 00 2B 6E
FF F5 00 00-00 00 F5 00-00 00 00 1B-0C 00 04 70
FF 34 6C 70-FF 1B 0D 00-04 74 FF 34-6C 74 FF F5
00 00 00 00-59 78 FF 0A-0E 00 18 00-3C 32 04 00
74 FF 70 FF-13 00 00 00

The p-code in this block begins with F4. Let's use our table and check out what this op-code means. According to the table, it's LitI2_Byte, which always uses one byte for its parameters. Lit is always "push" (you might want to memorize this fact), and I2 means a two-byte integer. Here's a table of all possible values, which hopefully will make our further research a bit more simple:

Data types

UI1 Byte
Bool Boolean
I2 Integer
I4 Long
R4 Single
R8 Double
Cy Currency
Var Variant
Str String

Because F4 is followed by 0, it's reasonable to assume that it is a "push 0" command. Let's continue, shall we? 2B is PopTmpLdAd2. Read it as follows: Pop from stack to Temp variable and Load Address to stack. So this variable only saves the contents of the stack's upper cell into a temporary variable; we don't need that as we are only decompiling the code, not executing it. The next two bytes, FF6E, hold that temporary variable. If I convert that number from negative to positive, I'll get 92. In this case, VB Decompiler will show var_92. If that number were positive, it would be not a local variable but a function's argument. I guess, by now you are understanding everything! :) Let's proceed. F5 is LitI4. I'll assume that the next four bytes is push, so I've got another push 0. :) Then again, push 0. Now goes 1B, or LitStr. This op-code puts a string onto the stack. I have told you about the address to the data block for a purpose. It's a point from which the next two bytes are counted in the parameters:
string address = ProcTable.DataConst + 2 bytes following the command
That's a Unicode string, which ends in two zeros. Suppose it contains the word "Test" - then we have the command push "Test". 04 is FLdRfVar (push variable). The variable is obtained as above: FF70 = var_90. 34 is CStr2Ansi, which reads two elements from the stack and assigns the second element to the first one. That is, now the stack contains the var_90 variable and the "Test" string, so you can read this command as var_90 = "Test". This op-code doesn't have any parameters, so next goes another op-code: 6C, or ILdRf (push variable). The following two bytes, FF70, are decompiled as var_90 and used for making the command push var_90. 1B puts another string onto the stack, let it be "Test2" (sorry, I'm not very creative). 04 creates a variable again and puts it onto the stack, and 34 assigns "Test2" to that variable. FF74 is var_8C. 6C puts this variable onto the stack as described above. Then goes F5, which I've encountered before, and puts the following four bytes onto the stack. After that, we see something more interesting: 59, or PopTmpLdAdStr. It saves the contents of the stack's upper cell in the var_88 variable and also puts the same onto the top of stack so that this data is more available. 0A is ImpAdCallFPR4. I already can smell something delicious! What we are looking at is a pure Call to a function, whose address is traditionally counted from the data block. An important note: These two bytes are first multiplied by four, and only then added to the address of the data block. It's a law, you may say. I don't know why VM developers have implemented something like that, but rules must be followed. So, what do we have in the data block? Here's a virtual address to the following stub:

A1A0634000 mov eax,[004063A0]
0BC0 or eax,eax
7402 je .000402A77
FFE0 jmp eax
68542A4000 push 000402A54
B870104000 mov eax,000401070
FFD0 call eax

In this code, we need push 402A54. The 402A54 virtual address points to a structure that looks like that:

CallAPI Structure


strLibraryName Long
strFunctionName Long

One of these two virtual addresses points to the name of the DLL; another, to the name of the function that must be called.

"Crossing" to API


Now we know why VB is so slow: That's because calling any API function involves passing through a whole lot of structures and stubs, which surely consumes CPU cycles and greatly reduces overall performance! As you can see, there is no such thing as an import table. We only deal with the names of DLL and function that are dynamically called by the DllFunctionCall function, which is exported from our beloved msvbvm60.dll file. Therefore, the code from the above block:

B870104000 mov eax,000401070 --?3
FFD0 call eax

gets the address to the "crossing":

jmp DllFunctionCall

and calls it via call eax. I guess, it's understandable what the DllFunctionCall function contains.

In our case, LibraryName is user32.dll, and FunctionName is ShellExecute. Now it is quite clear why p-code is so much about push commands and saving pieces of data: It's only preparation for calling ShellExecute. If we abridge what we have obtained, we'll get something like that:

loc_4043F1: var_90 = "Test"
loc_4043FB: var_8C = "Test2"
loc_404407: ShellExecute(0, var_8C, var_90, 0, 0, 0)

Let's continue our little research, if you don't mind. 3C is SetLastSystemError. Everything's clear: It's only a "crossing" to API. We won't need it, because it is used by the virtual machine for tracing errors and cannot provide any important data for analyzing our p-code. 32 is FFreeStr, a very interesting command! Its number of parameters is variable and defined by the first two bytes. In this case, the first two bytes are 4, so the following four bytes are two two-byte variables: FF74 and FF70, as you can see. The FFreeStr command sets the string variables var_8C and var_90 to zero. In a normal language, it looks like that:

var_8C = ""
var_90 = ""

The next (and the last) op-code is 13, ExitProcHresult. As you can guess by its name, it completes the procedure. So it must be followed by the ProcDscInfo structure, which is already familiar to us.

Well, you have learned to decompile p-code in your mind's eye! Actually, there's no need to do it manually - you can always use VB Decompiler!

(C) Sergey Chubchenko, VB Decompiler's main developer




* Microsoft, Windows, and Visual Basic are registered trademarks of Microsoft Corporation.
Original of most structures from this article is hosted at vb-decompiler.com forum and copyrighted to his authors.





Main     News     Products     Documentation     Articles     Download     Order now     About us    

Privacy policy