My First Experiences with x86 Assembly + Some speed tests
By Prithvi Vishak
Created December 26th, 2021
I've been into computers and coding for a couple years now. I can manage some Python, Go, and C++ (Arduino), configure our home network, and maintain this website. One thing that I had been meaning to explore was some of the more low-level stuff. I chose to learn assembly because I heard it was the way to get the most out of your hardware, it teaches you a lot about the workings of your computer, and... all the cool kids are doing it.
This article talks about how I started and what I've learnt. This is not a tutorial, but simply a chronicle of my experiences for the amusement of readers.
How I started
I searched the internet for getting started with x86 assembly. A good number of tutorials I found simply gave you the program for hello world, maybe explained each line, and left you. I eventually found this tutorial from SecureIdeas and this paper from Yale. The first was a good introductory tutorial, and the second a useful reference while writing my own programs.
Understanding the hello world program was fairly straightforward, and I felt confident enough to do something myself.
I decided to write a program that would print an integer out as a string to STDOUT
. I planned to do this by repeatedly dividing the number in question by 10 to get the next least significant digit as the remainder. Printing the ASCII character corresponding to the remainder simply meant adding 48 to it. Simple enough, right?
I am still exploring different ways to do things, and the stack sounded like something fun to use. So, I decided to do all the division in one loop, pushing the digit values onto the stack one by one, then popping them and printing them in another.
Here's how my code came out:
.data
n:
.long 192
digs:
.long 0
char:
.ascii ""
.global _start
.text
printInt:
movl $10,%ecx
xor %edx,%edx # CLEAR REGISTERS. Learnt this the hard way.
idivl %ecx
addl $48,%edx
push %edx
incl digs
movl %eax,n
cmpl $0,%eax
ja printInt
printChars:
pop char
mov $4,%eax
mov $1,%ebx
mov $char,%ecx
mov $1,%edx
int $0x80
decl digs
movl digs,%eax
cmpl $0,%eax
ja printChars
mov $4,%eax
mov $1,%ebx
mov $0xa,%ecx
mov $1,%edx
int $0x80
ret
exit:
mov $1,%eax
mov $0,%ebx
int $0x80
ret
_start:
movl n,%eax
call printInt
call exit
It most definitely did work, but it took me the better part of a day. Why? Random floating point exceptions, segmentation faults, and infinite loops. I could have used a debugger, but setting it up would probably have taken longer than my method of just moving the exit syscall until I found which line was erroring out.
Going through the official Intel documentation was actually quite helpful, contrary to what some people online say. If only there was less cruft in x86 so I wouldn't have to spend time figuring out gems like jnbe
(seriously, "Jump if not below or equal to"? It even has the same opcode as 'ja').
Since I eventually got this working, I thought it would be interesting to see exactly how much faster an assembly program would be against an equivalent C and Python program. I decided on bubble-sorting 1000 integers.
Here's the Python:
l = [77, 55, 91 ... ]
sorted = False
while not sorted:
sorted = True
for i in range(len(l)-1):
if l[i] > l[i+1]:
l[i], l[i+1] = l[i+1], l[i]
sorted = False
print(l)
And the C:
#include <stdio.h>
#include <stdbool.h>
#define size 1000
int arr[size] = {77, 55, 91 ... }
int main() {
bool sorted = false;
while (!sorted) {
sorted = true;
for (int i=0; i<size-1; i++) {
if (arr[i] > arr[i+1]) {
int temp = arr[i+1];
arr[i+1] = arr[i];
arr[i] = temp;
sorted = false;
}
}
}
for (int i=0; i<size; i++) {
printf("%d\n", arr[i]);
}
return 0;
}
And finally the assembly:
.data
array:
.long 72, 55, 91 ...
len:
.long 0
sorted:
.byte 0
.global _start
.text
.include "intToChar.asm"
_start:
lea len,%eax
subl $array,%eax
subl $4,%eax
movl %eax,len
while:
lea array,%eax
movl $1,sorted
for:
movl %eax,%ebx
addl $4,%ebx
movl (%eax),%ecx
movl (%ebx),%edx
cmpl %edx,%ecx
jbe afterSwap
movl $0,sorted
movl %ecx,(%ebx)
movl %edx,(%eax)
afterSwap:
addl $4,%eax
mov %eax,%ecx
lea array,%ebx
subl %ebx,%ecx
cmpl len,%ecx
jne for
mov sorted,%ebx
cmpl $0,%ebx
jz while
lea array,%ebx
printFinal:
movl (%ebx),%eax
push %ebx
call printInt
pop %ebx
movl %ebx,%ecx
lea array,%eax
subl %eax,%ecx
addl $4,%ebx
cmp %ecx,len
ja printFinal
call exit
As you may have noticed, there are small differences in the programs (like calculation of length, or process of printing result), but they seemed similar enough to get a ballpark estimate of their speeds.
Python took around 270 milliseconds each run.
~$ time python3 bubbleSort.py > /dev/null
________________________________________________________
Executed in 272.48 millis fish external
usr time 268.60 millis 995.00 micros 267.60 millis
sys time 3.99 millis 0.00 micros 3.99 millis
Unsurprising, considering that it has to start an interpreter first. The fact that python runs, well, slow doesn't help.
C compiled for 32-bit took around 5-8 milliseconds on average. Compiling for 64-bit made it take 4-8 milliseconds.
~$ gcc -m32 bubbleSort.c -o bubbleSortInC
~$ time ./bubbleSortInC > /dev/null
________________________________________________________
Executed in 5.70 millis fish external
usr time 5.63 millis 367.00 micros 5.26 millis
sys time 0.14 millis 136.00 micros 0.00 millis
Finally, assembly took about the same amount of time or longer than vanilla C, at around 5-10 milliseconds.
~$ as --march=i386 --32 bubbleSort.asm -o bubbleSort.o && ld -m elf_i386 bubbleSort.o -o bubbleSort
~$ time ./bubbleSort > /dev/null
________________________________________________________
Executed in 5.74 millis fish external
usr time 3.01 millis 334.00 micros 2.68 millis
sys time 2.80 millis 125.00 micros 2.68 millis
I hypothesized that the uncompetitive times from assembly were due to the number of syscalls I was making while printing the result. Sure enough, I was right. Sorting without printing the results took 2-4 milliseconds in assembly. To compare, sorting without printing still took 5-8 milliseconds in C.
Interesting.
Anyway, that's all I have for today. Will I continue to use assembly? Probably. Will I continue to have hair? At this rate, probably not.