My First Experiences with x86 Assembly + Some speed tests

By Prithvi Vishak

Created December 26th, 2021

I've been into computers and coding for a couple years now. I can manage some Python, Go, and C++ (Arduino), configure our home network, and maintain this website. One thing that I had been meaning to explore was some of the more low-level stuff. I chose to learn assembly because I heard it was the way to get the most out of your hardware, it teaches you a lot about the workings of your computer, and... all the cool kids are doing it.

This article talks about how I started and what I've learnt. This is not a tutorial, but simply a chronicle of my experiences for the amusement of readers.

How I started

I searched the internet for getting started with x86 assembly. A good number of tutorials I found simply gave you the program for hello world, maybe explained each line, and left you. I eventually found this tutorial from SecureIdeas and this paper from Yale. The first was a good introductory tutorial, and the second a useful reference while writing my own programs.

Understanding the hello world program was fairly straightforward, and I felt confident enough to do something myself. I decided to write a program that would print an integer out as a string to STDOUT. I planned to do this by repeatedly dividing the number in question by 10 to get the next least significant digit as the remainder. Printing the ASCII character corresponding to the remainder simply meant adding 48 to it. Simple enough, right?

I am still exploring different ways to do things, and the stack sounded like something fun to use. So, I decided to do all the division in one loop, pushing the digit values onto the stack one by one, then popping them and printing them in another.

Here's how my code came out:

.data
n:
	.long 192
digs:
	.long 0
char:
	.ascii ""

.global _start

.text


printInt:
	movl $10,%ecx
	xor %edx,%edx # CLEAR REGISTERS. Learnt this the hard way.
	idivl %ecx
	addl $48,%edx
	push %edx
	incl digs
	movl %eax,n

	cmpl $0,%eax
	ja printInt

printChars:
	pop char

	mov $4,%eax
	mov $1,%ebx
	mov $char,%ecx
	mov $1,%edx
	int $0x80

	decl digs
	movl digs,%eax
	cmpl $0,%eax
	ja printChars
	
	mov $4,%eax
	mov $1,%ebx
	mov $0xa,%ecx
	mov $1,%edx
	int $0x80

	ret

exit:
	mov $1,%eax
	mov $0,%ebx
	int $0x80
	ret

_start:
	movl n,%eax
	call printInt
	call exit

It most definitely did work, but it took me the better part of a day. Why? Random floating point exceptions, segmentation faults, and infinite loops. I could have used a debugger, but setting it up would probably have taken longer than my method of just moving the exit syscall until I found which line was erroring out.

Going through the official Intel documentation was actually quite helpful, contrary to what some people online say. If only there was less cruft in x86 so I wouldn't have to spend time figuring out gems like jnbe (seriously, "Jump if not below or equal to"? It even has the same opcode as 'ja').

Since I eventually got this working, I thought it would be interesting to see exactly how much faster an assembly program would be against an equivalent C and Python program. I decided on bubble-sorting 1000 integers.

Here's the Python:

l = [77, 55, 91 ... ]
sorted = False

while not sorted:
    sorted = True
    for i in range(len(l)-1):
        if l[i] > l[i+1]:
            l[i], l[i+1] = l[i+1], l[i]
            sorted = False

print(l)

And the C:

#include <stdio.h>
#include <stdbool.h>

#define size 1000

int arr[size] = {77, 55, 91 ... }

int main() {
	bool sorted = false;
	while (!sorted) {
		sorted = true;
		for (int i=0; i<size-1; i++) {
			if (arr[i] > arr[i+1]) {
				int temp = arr[i+1];
				arr[i+1] = arr[i];
				arr[i] = temp;
				sorted = false;
			}
		}
	}

	for (int i=0; i<size; i++) {
		printf("%d\n", arr[i]);
	}
	return 0;
}

And finally the assembly:

.data

array:
	.long 72, 55, 91 ...
len:
	.long 0
sorted:
	.byte 0

.global _start

.text
.include "intToChar.asm"

_start:
	lea len,%eax
	subl $array,%eax
	subl $4,%eax
	movl %eax,len

while:
	lea array,%eax
	movl $1,sorted
for:
	movl %eax,%ebx
	addl $4,%ebx
	movl (%eax),%ecx
	movl (%ebx),%edx
	cmpl %edx,%ecx
	jbe afterSwap

	movl $0,sorted	
	movl %ecx,(%ebx)
	movl %edx,(%eax)
afterSwap:
	addl $4,%eax
	mov %eax,%ecx
	lea array,%ebx
	subl %ebx,%ecx
	cmpl len,%ecx	
	jne for

	mov sorted,%ebx
	cmpl $0,%ebx
	jz while

	lea array,%ebx
printFinal:
	movl (%ebx),%eax
	push %ebx
	call printInt
	pop %ebx

	movl %ebx,%ecx
	lea array,%eax
	subl %eax,%ecx
	addl $4,%ebx
	cmp %ecx,len
	ja printFinal

	call exit

As you may have noticed, there are small differences in the programs (like calculation of length, or process of printing result), but they seemed similar enough to get a ballpark estimate of their speeds.

Python took around 270 milliseconds each run.

~$ time python3 bubbleSort.py > /dev/null
________________________________________________________
Executed in  272.48 millis    fish           external
   usr time  268.60 millis  995.00 micros  267.60 millis
   sys time    3.99 millis    0.00 micros    3.99 millis

Unsurprising, considering that it has to start an interpreter first. The fact that python runs, well, slow doesn't help.

C compiled for 32-bit took around 5-8 milliseconds on average. Compiling for 64-bit made it take 4-8 milliseconds.

~$ gcc -m32 bubbleSort.c -o bubbleSortInC
~$ time ./bubbleSortInC > /dev/null
________________________________________________________
Executed in    5.70 millis    fish           external
   usr time    5.63 millis  367.00 micros    5.26 millis
   sys time    0.14 millis  136.00 micros    0.00 millis

Finally, assembly took about the same amount of time or longer than vanilla C, at around 5-10 milliseconds.

~$ as --march=i386 --32 bubbleSort.asm -o bubbleSort.o && ld -m elf_i386 bubbleSort.o -o bubbleSort
~$ time ./bubbleSort > /dev/null
________________________________________________________
Executed in    5.74 millis    fish           external
   usr time    3.01 millis  334.00 micros    2.68 millis
   sys time    2.80 millis  125.00 micros    2.68 millis

I hypothesized that the uncompetitive times from assembly were due to the number of syscalls I was making while printing the result. Sure enough, I was right. Sorting without printing the results took 2-4 milliseconds in assembly. To compare, sorting without printing still took 5-8 milliseconds in C.

Interesting.

Anyway, that's all I have for today. Will I continue to use assembly? Probably. Will I continue to have hair? At this rate, probably not.