The goal of this document is to show how clever optimization can tremendously increase code performance. Our goal is to implement a basic attack against the (exportable) stream cipher RC4. We will assume that the reader has a basic understanding of cryptography and a good knowledge of PC assembly language.
RC4 is a variable-key-size stream cipher designed in 1987 by Ron Rivest and is proprietary to RSA Data Securities Inc. A stream cipher is an algorithm which produces pseudo-random bytes from a given cryptovariable (or key). These bytes are then xored with plaintext to produce ciphertext. Decryption is straightforward, xoring the ciphertext with the same pseudo-random byte sequence will reveal the plaintext. The exportable version of RC4 has a key length of 40 bits, or 5 bytes.
The key is used to initialize a 256-element permutation which will be used to generate the pseudo-random bytes. The permutation, which we will denote by S, is first set to identity, and two counters, i and j, are set to zero. Then, if we denote by K the five bytes of key, S is modified as follow:
For i = 0 to 255 do j = (j + S_{i} + K_{i mod 5}) mod 256 Swap S_{i} and S_{j}Counters i and j are then reset and the pseudo-random sequence R of bytes is generated by
For n = 0, 1, 2, ... do i = (i + 1) mod 256 j = (j + S_{i}) mod 256 Swap S_{i} and S_{j} t = (S_{i} + S_{j}) mod 256 R_{n} = S_{t}As already mentionned, plaintext P and sequence R are combined to produce ciphertext C:
C_{n} = P_{n} xor R_{n}
Given an RC4 cipher and the first few bytes of the underlying plaintext, one is able to deduce the first few bytes of the pseudo-random sequence R. Those bytes can then be used to try all possible five byte keys until one that produces the same few bytes of R is found. This key can then be used to decrypt the whole cipher. Keep in mind that the number of possible keys is 256^{5}, or a thousand billion. Thus testing possible keys should be done as fast as possible.
For each five-byte key we will try, the bulk of the work will be spent initializing the permutation S. Indeed, once this permutation is initialized, generating two bytes of R will be enough to reject 99.998% of the five byte keys. Hence in this document I will only consider the code which initializes S. I will leave the generation of the first few bytes of R to interested readers...
Testing billions of keys can be quite time consuming. Since it is highly impractical to have a function that can take days (or years) to execute, we will program a function which will test 256 keys, by trying all possibility for the fifth byte given the value of the first four. We could also have written a function to test a single key, but this approach is too slow; there is too much overhead associated with function calls and we will see that more than one key can be tried simultaneously.
In C/C++, the function to optimize is then
void Test256Keys(unsigned char Key[5]) { unsigned int I, J, A, B, S[256]; Key[4] = 0x00; do { for (I = 0; I < 256; I++) S[I] = I; for (I = 0, J = 0; I < 256; I++) { A = S[I]; J = (J + A + Key[I % 5]) % 256; B = S[J]; S[J] = A; S[I] = B; } } while (Key[4]++ != 0xFF); }Note that this piece of code does not include the generation of the first few bytes of R.
Very precise timings can be obtained using the Pentium instruction rdtsc
.
The following C-callable assembly function will return the exact number of clock cyles taken to execute the function Test256Keys
defined above:
mov ecx, [esp + 4] rdtsc push edx push eax push ecx call Test256Keys pop ecx rdtsc pop ecx sub eax, ecx pop ecx sbb edx, ecx sub eax, 18 sbb edx, 0 retWe subtract 18 from the count since calling an empty function takes about 18 clock cycles. If you are using Microsoft Visual C++, then the function above may be included with your C code as follow:
unsigned long int __declspec(naked) TimeIt(unsigned char Key[5]) { __asm { mov ecx, [esp + 4] ... ret } }Do not forget in the prototype to specify the C calling convention:
unsigned int __cdecl TimeIt(unsigned char Key[5]);Declaration of assembly functions should be similar for other compilers. You may need an
extern "C"
in front of the prototype if the assembly function is located in a separate file and you are compiling in C++.
If your compiler does not recognize instruction rdtsc
, it can be defined with bytes 0x0F
and 0x31
.
The number of clock cycles per key is of course obtained by dividing the total count by the number of key tried, which is 256 in our case. The actual time elapsed is obtained by dividing the number of clock cycles by the speed of your machine, for example 200,000,000 for a 200MHz. Finally, cache misses may slow down considerably the first few calls to the function. Hence I usually call it a few times and take the minimum clock cycle count.
void Test256Keys(unsigned char Key[5]) { unsigned int I, K; unsigned int Key1[5], J1, A1, B1, S1[256]; unsigned int Key2[5], J2, A2, B2, S2[256]; Key1[0] = Key[0]; Key2[0] = Key[0]; Key1[1] = Key[1]; Key2[1] = Key[1]; Key1[2] = Key[2]; Key2[2] = Key[2]; Key1[3] = Key[3]; Key2[3] = Key[3]; Key1[4] = 0x00; Key2[4] = 0x01; for ( ; Key1[4] < 256; Key1[4] += 2, Key2[4] += 2) { for (I = 0; I < 256; I++) { S1[I] = I; S2[I] = I; } for (I = 0, J1 = 0, J2 = 0, K = 0; I < 256; I++) { A1 = S1[I]; A2 = S2[I]; J1 += Key1[K]; J2 += Key2[K]; J1 += A1; J2 += A2; J1 &= 0xFF; J2 &= 0xFF; B1 = S1[J1]; B2 = S2[J2]; S1[J1] = A1; S2[J2] = A2; S1[I] = B1; S2[I] = B2; if (++K > 4) K = 0; } } }Note how the modulo 5 operation was removed and the complex computation of the new value of j was split in many simpler computations. This still very simple procedure takes only 2860 cycles per key, or about 70,000 keys per second. This function is more than 5 times faster than our original one! Many people tell me that compilers are very good at optimizing; these people are probably right in some sense, but I find that most of the time compilers need much human help!
Although we can probably still improve our C code, we will switch to assembly. Highly optimized C code can be as difficult to understand as assembly hence I see no reason why not to use the full power of assembly. Of course assembly is not portable, but so is highly optimized C in most cases.
As in our second C version, we will process two keys simultaneously. Let me call those keys A and B. Moreover, as we are processing those two keys, we will initialize two permutations to identity for the next pair of keys. Refer to the following three tables to better understand the assembly code:
Register | Content | Key | |
---|---|---|---|
EAX, AL |
j | A | |
EBX, BL |
j | B | |
ECX |
CL |
S_{i} or S_{j} | A |
CH |
S_{i} or S_{j} | B | |
EDX |
DL |
K_{i mod 5} if i = 0, 1, 2 or 3 mod 5 | Both |
K_{i mod 5} if i = 4 mod 5 | A | ||
DH |
K_{i mod 5} if i = 4 mod 5 | B | |
EDI |
Pointer to S. See table below. | Both | |
ESI |
Pointer to S_{i}. See table below. | ||
EBP |
Pointer to K. See second table below. | ||
ESP |
Pointer into permutation array for next pair of keys. See code. |
Offset | Content | Key |
---|---|---|
EDI + 4i + 0, ESI + 0 |
S_{i} | A |
EDI + 4i + 1, ESI + 1 |
S_{i} | B |
EDI + 4i + 2, ESI + 2 |
Unused | |
EDI + 4i + 3, ESI + 3 |
Offset | Content | Key |
---|---|---|
EBP + 0 |
K_{0} | Both |
EBP + 1 |
K_{1} | |
EBP + 2 |
K_{2} | |
EBP + 3 |
K_{3} | |
EBP + 4 |
K_{4} | A |
EBP + 5 |
K_{4} | B |
EBP + 6 |
Unused | |
EBP + 7 |
The first step in our code is to initialize some registers.
After saving some registers to comply with the compiler, the first four bytes of key are loaded in edx
and the key array is initialized:
mov ecx, [esp + 0x04] push ebp push ebx push edi push esi mov edx, [ecx] push 0x0000FEFF push edx mov ebp, esp xor eax, eax xor ebx, ebxRegisters
eax
, ebx
, edx
and ebp
have been initialized according to the first table above.
The next step is to initialize the permutation array with two identity permutations.
The fastest way I found to do that should take 128 clock cycles:
push 0x0000FFFF push 0x0000FEFE push 0x0000FDFD ... push 0x00000101 push 0x00000000We are now ready to do the 256 iterations that will modify those two permutations according to the keys A and B. In the code that will follow, never forget that we are processing two keys simultaneously!
The value of j after the first iteration is simply equal to the first byte of key:
FirstIterationAB: mov al, dl mov bl, dlWe then save S_{j} into S_{i} and initialize registers
edi
and esi
.
At this stage, we know that i = 0 and S_{j} = K_{0}, hence the code is very simple:
mov [esp], dl mov edi, esp mov [esp + 1], dl mov esi, espWe also save S_{i}, which is zero, in S_{j}. Since we can not do two memory writes simultaneously, I have inserted a
nop
:
mov [edi + 4 * eax], ah nop mov [edi + 4 * ebx + 1], bhWe can still issue one instruction with the previous memory write so why not prepare for the next iteration and add K_{1} to j:
add al, dh push 0x0000FFFF add bl, dhNote that at the same time we started initializing another permutation array with two identity permutations. This array will be located in the stack, below the current permutation array. The above 12 lines of code should take 6 clock cycles to execute.
To prevent looping 255 times, we will write a single iteration as a macro and simply call it 255 times.
The macro takes a single argument, I
, ranging from 1 to 255.
First we load S_{i}, remembering that esi
still points to S_{i - 1}:
mov ecx, [esi + 4]We continue the initialization of the next permutation array:
push (0xFFFF - (I * 0x0101))To j we now add S_{i}:
add al, cl add bl, chWe now load the next byte of key, for the next iteration. Remember that we already added to j the value of K_{i mod 5} in the previous iteration.
mov dl, [ebp + ((I + 1) % 5)]If i + 1 = 4 mod 5, then we must load the fifth byte of key, which is different for key A and key B. In that case, the above line is replaced by this one:
mov edx, [ebp + 4]Then I ran out of registers! After saving
ecx
, we load S_{j}:
push ecx mov cl, [edi + 4 * eax] add esi, 4 mov ch, [edi + 4 * ebx + 1] nopWe can then save S_{j} in S_{i}:
mov [esi], ecx pop ecxFinally, we save S_{i} in S_{j} to complete the swap. We also add K_{(i + 1) mod 5} to j to prepare for the next iteration:
mov [edi + 4 * eax], cl add al, dl mov [edi + 4 * ebx + 1], ch add bl, dlAgain, If i + 1 = 4 mod 5, the last line is replaced by
add bl, dhThis completes the macro for one loop iteration. Note that we have been extremely careful to prevent AGI (Address Generation Interlocks) so that the above 16 lines of code should take exactly 8 clock cycles to execute.
After the both permutations S have been initialized, we should generate a few bytes of keystream to check whether we have found the correct key.
As mentionned in section 3, I skip this step and go directly to test two new keys.
The first thing we must do is to change the value of the fifth byte of key and reload the first two bytes in edx
:
sub [ebp + 4], 0x0202 mov edx, [ebp] js DoneNow, remember that while we tested the previous two keys, we generated two brand new identity permutations, hence we can start right away with the first iteration, described in section 6.2! However, if we do that, we will end up using at least 128k of stack space (1k per key pair). The solution is to use only two arrays for the permutation by resetting
esp
once every four keys:
mov ecx, ebp sub ecx, esp cmp ecx, 1024 je FirstIterationAB FirstIterationCD:The block of code for
FirstIterationCD
is exactly the same as the code for FirstIterationAB
except that nop
is replaced by
add esp, 2048
By this time you may be somewhat confused about what goes where. Here is the full function:
mov ecx, [esp + 0x04] push ebp push ebx push edi push esi mov edx, [ecx] push 0x0000FEFF push edx mov ebp, esp xor eax, eax xor ebx, ebx push 0x0000FFFF push 0x0000EEEE ... push 0x00000101 push 0x00000000 FirstIterationAB: mov al, dl mov bl, dl mov [esp], dl mov edi, esp mov [esp + 1], dl mov esi, esp mov [edi + 4 * eax], ah nop mov [edi + 4 * ebx + 1], bh add al, dh push 0x0000FFFF add bl, dh AllIterations: MacroSingleIteration(1) MacroSingleIteration(2) MacroSingleIteration(3) ... MacroSingleIteration(253) MacroSingleIteration(254) MacroSingleIteration(255) sub [ebp + 4], 0x0202 mov edx, [ebp] js Done mov ecx, ebp sub ecx, esp cmp ecx, 1024 je FirstIterationAB FirstIterationCD: mov al, dl mov bl, dl mov [esp], dl mov edi, esp mov [esp + 1], dl mov esi, esp mov [edi + 4 * eax], ah add esp, 2048 mov [edi + 4 * ebx + 1], bh add al, dh push 0x0000FFFF add bl, dh jmp AllIterations Done: add esp, 1024 add esp, 8 pop esi pop edi pop ebx pop ebp retMacro
MacroSingleIteration(I)
is defined as
mov ecx, [esi + 4] push (0xFFFF - (I * 0x0101)) add al, cl add bl, ch mov dl, [ebp + ((I + 1) % 5)] push ecx mov cl, [edi + 4 * eax] add esi, 4 mov ch, [edi + 4 * ebx + 1] nop mov [esi], ecx pop ecx mov [edi + 4 * eax], cl add al, dl mov [edi + 4 * ebx + 1], ch add bl, dlwhen
I
is not equal to 3 modulo 5 and as
mov ecx, [esi + 4] push (0xFFFF - (I * 0x0101)) add al, cl add bl, ch mov edx, [ebp + 4] push ecx mov cl, [edi + 4 * eax] add esi, 4 mov ch, [edi + 4 * ebx + 1] nop mov [esi], ecx pop ecx mov [edi + 4 * eax], cl add al, dl mov [edi + 4 * ebx + 1], ch add bl, dhotherwise. Simple, is it not?
From the code above, the minimum time the function should take is 128 times 255*8 + 6 plus some overhead of at least 128 * 8 cycles, for a total of 262,912 clock cyles, or a minimum 1027 cycles per key. The actual timing is not far, 1076 clock cycles per key, a value which translates into more than 185,800 keys per second, two times and a half faster than our second C version. I would be surprised if someone can produce a C program which runs at a speed near the assembly code.
There is still room for improvement in the assembly version, especially if we write self-modifiable code. I will be happy to learn of any significant faster versions (i.e. 7 cycles per iteration).
Of course, the best optimization one can do is not to improve the speed of this function by another 50% but rather to find a known plaintext attack against 40-bit RC4 which as an average complexity of less than 2^{39}...
[1] Abrash, Michael, Zen of Code Optimization, Coriolis Group Books, ©1994.
[2] Fog, Agner, How to Optimize for the Pentium Family of Microprocessors, WWW page (Lost), ©1996-98.
[3] Schneier, Bruce, Applied Cryptography, 2^{nd} edition, John Wiley & Sons, ©1996.