Intel x86 Assembly Language & Microarchitecture

Assemblers

Microsoft Assembler - MASM

Given that the 8086/8088 was used in the IBM PC, and the Operating System on that was most often from Microsoft, Microsoft’s assembler MASM was the de facto standard for many years. It followed Intel’s syntax closely, but permitted some convenient but “loose” syntax that (in hindsight) only caused confusion and errors in code.

A perfect example is as follows:

MaxSize      EQU     16          ; Define a constant
Symbol       DW      0x1234      ; Define a 16-bit WORD called Symbol to hold 0x1234

             MOV     AX, 10      ; AX now holds 10
             MOV     BX, MaxSize ; BX now holds 16
             MOV     CX, Symbol  ; ????

Does the last MOV instruction put the contents of Symbol into CX, or the address of Symbol into CX? Does CX end up with 0x1234 or 0x0102 (or whatever)? It turns out that CX ends up with 0x1234 - if you want the address, you need to use the OFFSET specifier

             MOV     AX, [Symbol]      ; Contents of Symbol
             MOV     CX, OFFSET Symbol ; Address of Symbol

Intel Assembler

Intel wrote the specification of the 8086 assembly language, a derivative of the earlier 8080, 8008 and 4004 processors. As such, the assembler they wrote followed their own syntax precisely. However, this assembler wasn’t used very widely.

Intel defined their opcodes to have either zero, one or two operands. The two-operand instructions were defined to be in the dest, source order, which was different from other assemblers at the time. But some instructions used implicit registers as operands - you just had to know what they were. Intel also used the concept of “prefix” opcodes - one opcode would affect the next instruction.

; Zero operand examples
NOP             ; No parameters
CBW             ; Convert byte in AL into word in AX
MOVSB           ; Move byte pointed to by DS:SI to byte pointed to by ES:DI
                ; SI and DI are incremented or decremented according to D bit

; Prefix examples
REP   MOVSB     ; Move number of bytes in CX from DS:SI to ES:DI
                ; SI and DI are incremented or decremented according to D bit

; One operand examples
NOT      AX     ; Replace AX with its one's complement
MUL      CX     ; Multiply AX by CX and put 32-bit result in DX:AX

; Two operand examples
MOV      AL, [0x1234] ; Copy the contents of memory location DS:0x1234 into AL register

Intel also broke a convention used by other assemblers: for each opcode, a different mnemonic was invented. This required subtly- or distinctly-different names for similar operations: e.g. LDM for “Load from Memory” and LDI for “Load Immediate”. Intel used the one mnemonic MOV - and expected the assembler to work out which opcode to use from context. That caused many pitfalls and errors for programmers in the future when the assembler couldn’t intuit what the programmer actually wanted…

AT&T assembler - as

Although the 8086 was most used in IBM PCs along with Microsoft, there were a number of other computers and Operating Systems that used it too: most notably Unix. That was a product of AT&T, and it already had Unix running on a number of other architectures. Those architectures used more conventional assembly syntax - especially that two-operand instructions specified them in source, dest order.

So AT&T assembler conventions overrode the conventions dictated by Intel, and a whole new dialect was introduced for the x86 range:

%al, %bx etc.

Immediate values were prefied by $:

$4

Operands were in source, dest order
Opcodes included their operand sizes:

movw $4, %ax ; Move word 4 into AX

Borland’s Turbo Assembler - TASM

Borland started out with a Pascal compiler that they called “Turbo Pascal”. This was followed by compilers for other languages: C/C++, Prolog and Fortran. They also produced an assembler called “Turbo Assembler”, which, following Microsoft’s naming convention, they called “TASM”.

TASM tried to fix some of the problems of writing code using MASM (see above), by providing a more strict interpretation of the source code under a specified IDEAL mode. By default it assumed MASM mode, so it could assemble MASM source directly - but then Borland found that they had to be bug-for-bug compatible with MASM’s more “quirky” idiosyncracies - so they also added a QUIRKS mode.

Since TASM was (much) cheaper than MASM, it had a large user base - but not many people used IDEAL mode, despite its touted advantages.

GNU assembler - gas

When the GNU project needed an assembler for the x86 family, they went with the AT&T version (and its syntax) that was associated with Unix rather than the Intel/Microsoft version.

Netwide Assembler - NASM

NASM is by far the most ported assembler for the x86 architecture - it’s available for practically every Operating System based on the x86 (even being included with MacOS), and is available as a cross-platform assembler on other platforms.

This assembler uses Intel syntax, but it is different from others because it focuses heavily on its own “macro” language - this permits the programmer to build up more complex expressions using simpler definitions, allowing new “instructions” to be created.

Unfortunately this powerful feature comes at a cost: the type of the data gets in the way of generalised instructions, so data typing is not enforced.

response:    db       'Y'     ; Character that user typed

             cmp      response, 'N' ; *** Error! Unknown size!
             cmp byte response, 'N' ; That's better!
             cmp      response, ax  ; No error!

However, NASM introduced one feature that others lacked: scoped symbol names. When you define a symbol in other assemblers, that name is available throughout the rest of the code - but that “uses up” that name, “polluting” the global name space with symbols.

For example (using NASM syntax):

       STRUC     Point
X      resw      1
Y      resw      1
       ENDSTRUC

After this definition, X and Y are forevermore defined. To avoid “using up” the names X and Y, you needed to use more definite names:

       STRUC     Point
Pt_X   resw      1
Pt_Y   resw      1
       ENDSTRUC

But NASM offers an alternative. By leveraging its “local variable” concept, you can define structure fields that require you to nominate the containing structure in future references:

       STRUC      Point
.X     resw       1
.Y     resw       1
       ENDSTRUC

Cursor ISTRUC     Point
       ENDISTRUC

       mov        ax,[Cursor+Point.X]
       mov        dx,[Cursor+Point.Y]

Unfortunately, because NASM doesn’t keep track of types, you can’t use the more natural syntax:

       mov        ax,[Cursor.X]
       mov        dx,[Cursor.Y]

Yet Another Assembler - YASM

YASM is a complete rewrite of NASM, but is compatible with both Intel and AT&T syntaxes.