Pentium Assembly Language Optimization of RC4

Étienne Dupuis, July 1999

lestourtereaux@free.fr

1. Introduction

The goal of this document is to show how clever optimization can tremendously increase code performance. Our goal is to implement a basic attack against the (exportable) stream cipher RC4. We will assume that the reader has a basic understanding of cryptography and a good knowledge of PC assembly language.

2. RC4 Description

RC4 is a variable-key-size stream cipher designed in 1987 by Ron Rivest and is proprietary to RSA Data Securities Inc. A stream cipher is an algorithm which produces pseudo-random bytes from a given cryptovariable (or key). These bytes are then xored with plaintext to produce ciphertext. Decryption is straightforward, xoring the ciphertext with the same pseudo-random byte sequence will reveal the plaintext. The exportable version of RC4 has a key length of 40 bits, or 5 bytes.

The key is used to initialize a 256-element permutation which will be used to generate the pseudo-random bytes. The permutation, which we will denote by S, is first set to identity, and two counters, i and j, are set to zero. Then, if we denote by K the five bytes of key, S is modified as follow:

	For i = 0 to 255 do
		j = (j + S_i + K_{i mod 5}) mod 256
		Swap S_i and S_j

Counters i and j are then reset and the pseudo-random sequence R of bytes is generated by For n = 0, 1, 2, ... do i = (i + 1) mod 256 j = (j + S_i) mod 256 Swap S_i and S_j t = (S_i + S_j) mod 256 R_n = S_t
As already mentionned, plaintext P and sequence R are combined to produce ciphertext C: C_n = P_n xor R_n

Given an RC4 cipher and the first few bytes of the underlying plaintext, one is able to deduce the first few bytes of the pseudo-random sequence R. Those bytes can then be used to try all possible five byte keys until one that produces the same few bytes of R is found. This key can then be used to decrypt the whole cipher. Keep in mind that the number of possible keys is 256⁵, or a thousand billion. Thus testing possible keys should be done as fast as possible.

3. Algorithm to Optimize

For each five-byte key we will try, the bulk of the work will be spent initializing the permutation S. Indeed, once this permutation is initialized, generating two bytes of R will be enough to reject 99.998% of the five byte keys. Hence in this document I will only consider the code which initializes S. I will leave the generation of the first few bytes of R to interested readers...

Testing billions of keys can be quite time consuming. Since it is highly impractical to have a function that can take days (or years) to execute, we will program a function which will test 256 keys, by trying all possibility for the fifth byte given the value of the first four. We could also have written a function to test a single key, but this approach is too slow; there is too much overhead associated with function calls and we will see that more than one key can be tried simultaneously.

In C/C++, the function to optimize is then

	void Test256Keys(unsigned char Key[5])
	{
		unsigned int I, J, A, B, S[256];
		Key[4] = 0x00;	

		do {
			for (I = 0; I < 256; I++)
				S[I] = I;

			for (I = 0, J = 0; I < 256; I++) {
				A = S[I];
				J = (J + A + Key[I % 5]) % 256;
				B = S[J];
				S[J] = A;
				S[I] = B;
			}

		} while (Key[4]++ != 0xFF);
	}

Note that this piece of code does not include the generation of the first few bytes of R.

4. Code Timing

Very precise timings can be obtained using the Pentium instruction rdtsc. The following C-callable assembly function will return the exact number of clock cyles taken to execute the function Test256Keys defined above:

	mov	ecx, [esp + 4]	
	rdtsc
	push	edx
	push	eax
	push	ecx
	call	Test256Keys
	pop	ecx
	rdtsc
	pop	ecx
	sub	eax, ecx
	pop	ecx
	sbb	edx, ecx
	sub	eax, 18
	sbb	edx, 0		
	ret

We subtract 18 from the count since calling an empty function takes about 18 clock cycles. If you are using Microsoft Visual C++, then the function above may be included with your C code as follow: unsigned long int __declspec(naked) TimeIt(unsigned char Key[5]) { __asm { mov ecx, [esp + 4] ... ret } }
Do not forget in the prototype to specify the C calling convention: unsigned int __cdecl TimeIt(unsigned char Key[5]);
Declaration of assembly functions should be similar for other compilers. You may need an extern "C" in front of the prototype if the assembly function is located in a separate file and you are compiling in C++. If your compiler does not recognize instruction rdtsc, it can be defined with bytes 0x0F and 0x31.

The number of clock cycles per key is of course obtained by dividing the total count by the number of key tried, which is 256 in our case. The actual time elapsed is obtained by dividing the number of clock cycles by the speed of your machine, for example 200,000,000 for a 200MHz. Finally, cache misses may slow down considerably the first few calls to the function. Hence I usually call it a few times and take the minimum clock cycle count.

5. C Timings

On a 200Mhz Pentium MMX, using Microsoft Visual C++ 5.0 Enterprise Edition, the function listed in section 3 takes over 14,000 clock cycles per key, which is equivalent to just under 13,700 keys per second. This function can of course be improved. The modulo 5, a terribly slow operation, can easily be replaced by a mod 5 counter. Moreover, the Pentium chip is able to execute two instructions simultaneously if the input to the second is not the output of the first. Looking at the inner loop, we see that each variable is modified just before it is used, hence preventing the execution of two simultaneous instructions. The easiest solution is to try two different keys simultaneously: void Test256Keys(unsigned char Key[5]) { unsigned int I, K; unsigned int Key1[5], J1, A1, B1, S1[256]; unsigned int Key2[5], J2, A2, B2, S2[256]; Key1[0] = Key[0]; Key2[0] = Key[0]; Key1[1] = Key[1]; Key2[1] = Key[1]; Key1[2] = Key[2]; Key2[2] = Key[2]; Key1[3] = Key[3]; Key2[3] = Key[3]; Key1[4] = 0x00; Key2[4] = 0x01; for ( ; Key1[4] < 256; Key1[4] += 2, Key2[4] += 2) { for (I = 0; I < 256; I++) { S1[I] = I; S2[I] = I; } for (I = 0, J1 = 0, J2 = 0, K = 0; I < 256; I++) { A1 = S1[I]; A2 = S2[I]; J1 += Key1[K]; J2 += Key2[K]; J1 += A1; J2 += A2; J1 &= 0xFF; J2 &= 0xFF; B1 = S1[J1]; B2 = S2[J2]; S1[J1] = A1; S2[J2] = A2; S1[I] = B1; S2[I] = B2; if (++K > 4) K = 0; } } }
Note how the modulo 5 operation was removed and the complex computation of the new value of j was split in many simpler computations. This still very simple procedure takes only 2860 cycles per key, or about 70,000 keys per second. This function is more than 5 times faster than our original one! Many people tell me that compilers are very good at optimizing; these people are probably right in some sense, but I find that most of the time compilers need much human help!

6. Assembly Version

Although we can probably still improve our C code, we will switch to assembly. Highly optimized C code can be as difficult to understand as assembly hence I see no reason why not to use the full power of assembly. Of course assembly is not portable, but so is highly optimized C in most cases.

As in our second C version, we will process two keys simultaneously. Let me call those keys A and B. Moreover, as we are processing those two keys, we will initialize two permutations to identity for the next pair of keys. Refer to the following three tables to better understand the assembly code:

**Register usage**
Register		Content	Key
`EAX, AL`		j	A
`EBX, BL`		j	B
`ECX`	`CL`	S_i or S_j	A
`ECX`	`CH`	S_i or S_j	B
`EDX`	`DL`	K_{i mod 5} if i = 0, 1, 2 or 3 mod 5	Both
	`DL`	K_{i mod 5} if i = 4 mod 5	A
	`DH`	K_{i mod 5} if i = 4 mod 5	B
`EDI`		Pointer to S. See table below.	Both
`ESI`		Pointer to S_i. See table below.
`EBP`		Pointer to K. See second table below.
`ESP`		Pointer into permutation array for next pair of keys. See code.

**Permutation array**
Offset	Content	Key
`EDI + 4i + 0, ESI + 0`	S_i	A
`EDI + 4i + 1, ESI + 1`	S_i	B
`EDI + 4i + 2, ESI + 2`	Unused
`EDI + 4i + 3, ESI + 3`	Unused

**Key array**
Offset	Content	Key
`EBP + 0`	K₀	Both
`EBP + 1`	K₁
`EBP + 2`	K₂
`EBP + 3`	K₃
`EBP + 4`	K₄	A
`EBP + 5`	K₄	B
`EBP + 6`	Unused
`EBP + 7`	Unused

6.1 Initialization

The first step in our code is to initialize some registers. After saving some registers to comply with the compiler, the first four bytes of key are loaded in edx and the key array is initialized:

	mov	ecx, [esp + 0x04]

	push	ebp
	push	ebx
	push	edi
	push	esi

	mov	edx, [ecx]

	push	0x0000FEFF
	push	edx

	mov	ebp, esp

	xor	eax, eax
	xor	ebx, ebx

Registers eax, ebx, edx and ebp have been initialized according to the first table above. The next step is to initialize the permutation array with two identity permutations. The fastest way I found to do that should take 128 clock cycles: push 0x0000FFFF push 0x0000FEFE push 0x0000FDFD ... push 0x00000101 push 0x00000000
We are now ready to do the 256 iterations that will modify those two permutations according to the keys A and B. In the code that will follow, never forget that we are processing two keys simultaneously!

6.2 First iteration

The value of j after the first iteration is simply equal to the first byte of key:

FirstIterationAB:
	mov	al, dl
	mov	bl, dl

We then save S_j into S_i and initialize registers edi and esi. At this stage, we know that i = 0 and S_j = K₀, hence the code is very simple: mov [esp], dl mov edi, esp mov [esp + 1], dl mov esi, esp
We also save S_i, which is zero, in S_j. Since we can not do two memory writes simultaneously, I have inserted a nop: mov [edi + 4 * eax], ah nop mov [edi + 4 * ebx + 1], bh
We can still issue one instruction with the previous memory write so why not prepare for the next iteration and add K₁ to j: add al, dh push 0x0000FFFF add bl, dh
Note that at the same time we started initializing another permutation array with two identity permutations. This array will be located in the stack, below the current permutation array. The above 12 lines of code should take 6 clock cycles to execute.

6.3 A Single Iteration

To prevent looping 255 times, we will write a single iteration as a macro and simply call it 255 times. The macro takes a single argument, I, ranging from 1 to 255. First we load S_i, remembering that esi still points to S_{i - 1}:

	mov	ecx, [esi + 4]

We continue the initialization of the next permutation array: push (0xFFFF - (I * 0x0101))
To j we now add S_i: add al, cl add bl, ch
We now load the next byte of key, for the next iteration. Remember that we already added to j the value of K_{i mod 5} in the previous iteration. mov dl, [ebp + ((I + 1) % 5)]
If i + 1 = 4 mod 5, then we must load the fifth byte of key, which is different for key A and key B. In that case, the above line is replaced by this one: mov edx, [ebp + 4]
Then I ran out of registers! After saving ecx, we load S_j: push ecx mov cl, [edi + 4 * eax] add esi, 4 mov ch, [edi + 4 * ebx + 1] nop
We can then save S_j in S_i: mov [esi], ecx pop ecx
Finally, we save S_i in S_j to complete the swap. We also add K_{(i + 1) mod 5} to j to prepare for the next iteration: mov [edi + 4 * eax], cl add al, dl mov [edi + 4 * ebx + 1], ch add bl, dl
Again, If i + 1 = 4 mod 5, the last line is replaced by add bl, dh
This completes the macro for one loop iteration. Note that we have been extremely careful to prevent AGI (Address Generation Interlocks) so that the above 16 lines of code should take exactly 8 clock cycles to execute.

6.4 Toward the Next Key Pair

After the both permutations S have been initialized, we should generate a few bytes of keystream to check whether we have found the correct key. As mentionned in section 3, I skip this step and go directly to test two new keys. The first thing we must do is to change the value of the fifth byte of key and reload the first two bytes in edx:

	sub	[ebp + 4], 0x0202
	mov	edx, [ebp]
	js	Done

Now, remember that while we tested the previous two keys, we generated two brand new identity permutations, hence we can start right away with the first iteration, described in section 6.2! However, if we do that, we will end up using at least 128k of stack space (1k per key pair). The solution is to use only two arrays for the permutation by resetting esp once every four keys: mov ecx, ebp sub ecx, esp cmp ecx, 1024 je FirstIterationAB FirstIterationCD:
The block of code for FirstIterationCD is exactly the same as the code for FirstIterationAB except that nop is replaced by add esp, 2048

6.5 Wrapping Up

By this time you may be somewhat confused about what goes where. Here is the full function:

	mov	ecx, [esp + 0x04]
	push	ebp
	push	ebx
	push	edi
	push	esi
	mov	edx, [ecx]
	push	0x0000FEFF
	push	edx
	mov	ebp, esp
	xor	eax, eax
	xor	ebx, ebx

	push	0x0000FFFF
	push	0x0000EEEE
	...
	push	0x00000101
	push	0x00000000

FirstIterationAB:
	mov	al, dl
	mov	bl, dl
	mov	[esp], dl
	mov	edi, esp
	mov	[esp + 1], dl
	mov	esi, esp
	mov	[edi + 4 * eax], ah
	nop
	mov	[edi + 4 * ebx + 1], bh
	add	al, dh
	push	0x0000FFFF
	add	bl, dh

AllIterations:
	MacroSingleIteration(1)
	MacroSingleIteration(2)
	MacroSingleIteration(3)
	...
	MacroSingleIteration(253)
	MacroSingleIteration(254)
	MacroSingleIteration(255)
	
	sub	[ebp + 4], 0x0202
	mov	edx, [ebp]
	js	Done
	mov	ecx, ebp
	sub	ecx, esp
	cmp	ecx, 1024
	je	FirstIterationAB
	
FirstIterationCD:
	mov	al, dl
	mov	bl, dl
	mov	[esp], dl
	mov	edi, esp
	mov	[esp + 1], dl
	mov	esi, esp
	mov	[edi + 4 * eax], ah
	add	esp, 2048
	mov	[edi + 4 * ebx + 1], bh
	add	al, dh
	push	0x0000FFFF
	add	bl, dh
	jmp	AllIterations

Done:
	add	esp, 1024
	add	esp, 8
	
	pop	esi
	pop	edi
	pop	ebx
	pop	ebp
	ret

Macro MacroSingleIteration(I) is defined as mov ecx, [esi + 4] push (0xFFFF - (I * 0x0101)) add al, cl add bl, ch mov dl, [ebp + ((I + 1) % 5)] push ecx mov cl, [edi + 4 * eax] add esi, 4 mov ch, [edi + 4 * ebx + 1] nop mov [esi], ecx pop ecx mov [edi + 4 * eax], cl add al, dl mov [edi + 4 * ebx + 1], ch add bl, dl
when I is not equal to 3 modulo 5 and as mov ecx, [esi + 4] push (0xFFFF - (I * 0x0101)) add al, cl add bl, ch mov edx, [ebp + 4] push ecx mov cl, [edi + 4 * eax] add esi, 4 mov ch, [edi + 4 * ebx + 1] nop mov [esi], ecx pop ecx mov [edi + 4 * eax], cl add al, dl mov [edi + 4 * ebx + 1], ch add bl, dh
otherwise. Simple, is it not?

7. Conclusion

From the code above, the minimum time the function should take is 128 times 255*8 + 6 plus some overhead of at least 128 * 8 cycles, for a total of 262,912 clock cyles, or a minimum 1027 cycles per key. The actual timing is not far, 1076 clock cycles per key, a value which translates into more than 185,800 keys per second, two times and a half faster than our second C version. I would be surprised if someone can produce a C program which runs at a speed near the assembly code.

There is still room for improvement in the assembly version, especially if we write self-modifiable code. I will be happy to learn of any significant faster versions (i.e. 7 cycles per iteration).

Of course, the best optimization one can do is not to improve the speed of this function by another 50% but rather to find a known plaintext attack against 40-bit RC4 which as an average complexity of less than 2³⁹...