Mysore Linux Users' Group

A website for Linux users and geeks in general to express their thoughts and spread information


My First Experiences with x86 Assembly + Some speed tests

By Prithvi Vishak

Created December 26th, 2021


I've been into computers and coding for a couple years now. I can manage some Python, Go, and C++ (Arduino), configure our home network, and maintain this website. One thing that I had been meaning to explore was some of the more low-level stuff. I chose to learn assembly because I heard it was the way to get the most out of your hardware, it teaches you a lot about the workings of your computer, and... all the cool kids are doing it.

This article talks about how I started and what I've learnt. This is not a tutorial, but simply a chronicle of my experiences for the amusement of readers.

How I started

I searched the internet for getting started with x86 assembly. A good number of tutorials I found simply gave you the program for hello world, maybe explained each line, and left you. I eventually found this tutorial from SecureIdeas and this paper from Yale. The first was a good introductory tutorial, and the second a useful reference while writing my own programs.

Understanding the hello world program was fairly straightforward, and I felt confident enough to do something myself. I decided to write a program that would print an integer out as a string to STDOUT. I planned to do this by repeatedly dividing the number in question by 10 to get the next least significant digit as the remainder. Printing the ASCII character corresponding to the remainder simply meant adding 48 to it. Simple enough, right?

I am still exploring different ways to do things, and the stack sounded like something fun to use. So, I decided to do all the division in one loop, pushing the digit values onto the stack one by one, then popping them and printing them in another.

Here's how my code came out:

.data n: .long 192 digs: .long 0 char: .ascii "" .global _start .text printInt: movl $10,%ecx xor %edx,%edx # CLEAR REGISTERS. Learnt this the hard way. idivl %ecx addl $48,%edx push %edx incl digs movl %eax,n cmpl $0,%eax ja printInt printChars: pop char mov $4,%eax mov $1,%ebx mov $char,%ecx mov $1,%edx int $0x80 decl digs movl digs,%eax cmpl $0,%eax ja printChars mov $4,%eax mov $1,%ebx mov $0xa,%ecx mov $1,%edx int $0x80 ret exit: mov $1,%eax mov $0,%ebx int $0x80 ret _start: movl n,%eax call printInt call exit

It most definitely did work, but it took me the better part of a day. Why? Random floating point exceptions, segmentation faults, and infinite loops. I could have used a debugger, but setting it up would probably have taken longer than my method of just moving the exit syscall until I found which line was erroring out.

Going through the official Intel documentation was actually quite helpful, contrary to what some people online say. If only there was less cruft in x86 so I wouldn't have to spend time figuring out gems like jnbe (seriously, "Jump if not below or equal to"? It even has the same opcode as 'ja').

Since I eventually got this working, I thought it would be interesting to see exactly how much faster an assembly program would be against an equivalent C and Python program. I decided on bubble-sorting 1000 integers.

Here's the Python:

l = [77, 55, 91 ... ] sorted = False while not sorted: sorted = True for i in range(len(l)-1): if l[i] > l[i+1]: l[i], l[i+1] = l[i+1], l[i] sorted = False print(l)

And the C:

#include <stdio.h> #include <stdbool.h> #define size 1000 int arr[size] = {77, 55, 91 ... } int main() { bool sorted = false; while (!sorted) { sorted = true; for (int i=0; i<size-1; i++) { if (arr[i] > arr[i+1]) { int temp = arr[i+1]; arr[i+1] = arr[i]; arr[i] = temp; sorted = false; } } } for (int i=0; i<size; i++) { printf("%d\n", arr[i]); } return 0; }

And finally the assembly:

.data array: .long 72, 55, 91 ... len: .long 0 sorted: .byte 0 .global _start .text .include "intToChar.asm" _start: lea len,%eax subl $array,%eax subl $4,%eax movl %eax,len while: lea array,%eax movl $1,sorted for: movl %eax,%ebx addl $4,%ebx movl (%eax),%ecx movl (%ebx),%edx cmpl %edx,%ecx jbe afterSwap movl $0,sorted movl %ecx,(%ebx) movl %edx,(%eax) afterSwap: addl $4,%eax mov %eax,%ecx lea array,%ebx subl %ebx,%ecx cmpl len,%ecx jne for mov sorted,%ebx cmpl $0,%ebx jz while lea array,%ebx printFinal: movl (%ebx),%eax push %ebx call printInt pop %ebx movl %ebx,%ecx lea array,%eax subl %eax,%ecx addl $4,%ebx cmp %ecx,len ja printFinal call exit

As you may have noticed, there are small differences in the programs (like calculation of length, or process of printing result), but they seemed similar enough to get a ballpark estimate of their speeds.

Python took around 270 milliseconds each run.

~$ time python3 bubbleSort.py > /dev/null ________________________________________________________ Executed in 272.48 millis fish external usr time 268.60 millis 995.00 micros 267.60 millis sys time 3.99 millis 0.00 micros 3.99 millis

Unsurprising, considering that it has to start an interpreter first. The fact that python runs, well, slow doesn't help.

C compiled for 32-bit took around 5-8 milliseconds on average. Compiling for 64-bit made it take 4-8 milliseconds.

~$ gcc -m32 bubbleSort.c -o bubbleSortInC ~$ time ./bubbleSortInC > /dev/null ________________________________________________________ Executed in 5.70 millis fish external usr time 5.63 millis 367.00 micros 5.26 millis sys time 0.14 millis 136.00 micros 0.00 millis

Finally, assembly took about the same amount of time or longer than vanilla C, at around 5-10 milliseconds.

~$ as --march=i386 --32 bubbleSort.asm -o bubbleSort.o && ld -m elf_i386 bubbleSort.o -o bubbleSort ~$ time ./bubbleSort > /dev/null ________________________________________________________ Executed in 5.74 millis fish external usr time 3.01 millis 334.00 micros 2.68 millis sys time 2.80 millis 125.00 micros 2.68 millis

I hypothesized that the uncompetitive times from assembly were due to the number of syscalls I was making while printing the result. Sure enough, I was right. Sorting without printing the results took 2-4 milliseconds in assembly. To compare, sorting without printing still took 5-8 milliseconds in C.

Interesting.

Anyway, that's all I have for today. Will I continue to use assembly? Probably. Will I continue to have hair? At this rate, probably not.