Intel x86 Assembly Language & Microarchitecture

Optimization

Introduction#

The x86 family has been around for a long time, and as such there are many tricks and techniques that have been discovered and developed that are public knowledge - or maybe not so public.

Most of these tricks take advantage of the fact that many instructions effectively do the same thing - but different versions are quicker, or save memory, or don’t affect the Flags.

Herein are a number of tricks that have been discovered. Each have their Pros and Cons, so should be listed.

Remarks#

When in doubt, you can always refer to the pretty comprehensive Intel 64 and IA-32 Architectures Optimization Reference Manual, which is a great resource from the company behind the x86 architecture itsself.

Zeroing a register

The obvious way to zero a register is to MOV in a 0—for example:

B8 00 00 00 00    MOV eax, 0

Notice that this is a 5-byte instruction.

If you are willing to clobber the flags (MOV never affects the flags), you can use the XOR instruction to bitwise-XOR the register with itself:

33 C0             XOR eax, eax

This instruction requires only 2 bytes and executes faster on all processors.

Moving Carry flag into a register

Background

If the Carry (C) flag holds a value that you want to put into a register, the naïve way is to do something like this:

    mov  al, 1
    jc   NotZero
    mov  al, 0
NotZero:

Use ‘sbb’

A more direct way, avoiding the jump, is to use “Subtract with Borrow”:

    sbb  al,al    ; Move Carry to al

If C is zero, then al will be zero. Otherwise it will be 0xFF (-1). If you need it to be 0x01, add:

    and  al, 0x01 ; Mask down to 1 or 0

Pros

  • About the same size
  • Two or one fewer instructions
  • No expensive jump

Cons

  • It’s opaque to a reader unfamiliar with the technique
  • It alters other Flags

Test a register for 0

Background

To find out if a register holds a zero, the naïve technique is to do this:

    cmp   eax, 0

But if you look at the opcode for this, you get this:

83 F8 00      cmp   eax, 0

Use test

    test   eax, eax      ; Equal to zero?

Examine the opcode you get:

85 c0         test   eax, eax

Pros

  • Only two bytes!

Cons

  • Opaque to a reader unfamiliar with the technique

You can also have a look into the Q&A Question on this technique.

Linux system calls with less bloat

In 32-bit Linux, system calls are usually done by using the sysenter instruction (I say usually because older programs use the now deprecated int 0x80) however, this can take up quite alot of space in a program and so there are ways that one can cut corners in order to shorten and speed things up.
This is usually the layout of a system call on 32-bit Linux:

mov eax, <System call number>
mov ebx, <Argument 1> ;If applicable
mov ecx, <Argument 2> ;If applicable
mov edx, <Argument 3> ;If applicable
push <label to jump to after the syscall>
push ecx
push edx
push ebp
mov ebp, esp
sysenter

That’s massive right! But there are a few tricks we can pull to avoid this mess.
The first is to set ebp to the value of esp decreased by the size of 3 32-bit registers, that is, 12 bytes. This is great so long as you are ok with overwriting ebp, edx and ecx with garbage (such as when you will be moving a value into those registers directly after anyway), we can do this using the LEA instruction so that we do not need to affect the value of ESP itself.

mov eax, <System call number>
mov ebx, <Argument 1>
mov ecx, <Argument 2>
mov edx, <Argument 3>
push <label to jump to after the syscall>
lea ebp, [esp-12]
sysenter

However, we’re not done, if the system call is sys_exit we can get away with not pushing anything at all to the stack!

mov eax, 1
xor ebx, ebx ;Set the exit status to 0
mov ebp, esp
sysenter

Multiply by 3 or 5

Background

To get the product of a register and a constant and store it in another register, the naïve way is to do this:

    imul ecx, 3      ; Set ecx to 5 times its previous value
    imul edx, eax, 5 ; Store 5 times the contend of eax in edx

Use lea

Multiplications are expensive operations. It’s faster to use a combination of shifts and adds. For the particular case of muliplying the contend of a 32 or 64 bit register that isn’t esp or rsp by 3 or 5, you can use the lea instruction. This uses the address calculation circuit to calculate the product quickly.

    lea ecx, [2*ecx+ecx] ; Load 2*ecx+ecx = 3*ecx into ecx
    lea edx, [4*edx+edx] ; Load 4*edx+edx = 5*edx into edx

Many assemblers will also understand

    lea ecx, [3*ecx]
    lea edx, [5*edx]

For all possible multiplicands other them ebp or rbp, the resulting instruction lengh is the same as with using imul.

Pros

  • Executes much faster

Cons

  • If your multiplicand is ebp or rbp it takes one byte more them using imul
  • More to type if your assembler dosn’t support the shortcuts
  • Opaque to a reader unfamiliar with the technique

This modified text is an extract of the original Stack Overflow Documentation created by the contributors and released under CC BY-SA 3.0 This website is not affiliated with Stack Overflow