The 2000x Speed-Up

(Or, advice for biology students optimizing their code.)

The full source code for all examples and benchmarks is available on GitHub.

I recently found myself watching a talk from games programmer Casey Muratori, "Where Does Bad Code Come From?". Early in the talk, he makes the point:

As an industry, we haven't done that poorly on algorithms.... we're having a problem writing the code that glues all these algorithms together. 1

This remark instantly transported me back to an incident during my undergraduate days. This is the story of the worst code I've ever seen. Or, phrase it more optimistically, the story of the 2000x speed-up.

I was in the lab one summer when my mentor asked me to look over a work-in-progress program from another student. The program was working as expected, I was told, and he wanted to see how well it would work on our data.

So, I spun it up on our supercomputer cluster and let it chew through some data. And it worked! But it was unbelievably slow.

I dug around a bit and found some Python gluing together a lot of the usual suspects like `bowtie` and `samtools`. With well-optimized programs like these doing all of the heavy lifting, what could be making things so slow? The trouble came when I arrived at the code that the programmer had rolled themself. As an example, here's a paraphrased version of their function for reading a FASTA-formatted DNA sequence and outputting its complement to a new FASTA file. 2

``````def complement_fasta(in_name, out_name):
# read FASTA with name [in_name], complement, and write to [out_name]
in_file = open(in_name)
in_comment = False
while base:
out_f = open(out_name, 'a')
if in_comment:
if base == "\n":
in_comment = False
out_f.write(base)
else:
if base == ">":
in_comment = True
out_f.write(base)
elif base == "A":
out_f.write("T")
elif base == "C":
out_f.write("G")
elif base == "G":
out_f.write("C")
elif base == "T":
out_f.write("A")
else:
out_f.write(base)
``````

I don't claim to be an expert programmer now, and I was even less of one back then. Yet, even then, I could immediately pick up on some bad smells from this code. (What's going on here may be obvious depending on your experience level, but just bear with the rest of us.)

See if you can spot the difference here:

``````def complement_fasta2(in_name, out_name):
# read FASTA with name [in_name], complement, and write to [out_name]
in_file = open(in_name)
out_f = open(out_name, 'w')
in_comment = False
while base:
if in_comment:
if base == "\n":
in_comment = False
out_f.write(base)
else:
if base == ">":
in_comment = True
out_f.write(base)
elif base == "A":
out_f.write("T")
elif base == "C":
out_f.write("G")
elif base == "G":
out_f.write("C")
elif base == "T":
out_f.write("A")
else:
out_f.write(base)
``````

Did you catch it? Let's run some benchmarks to see how this tweak affected performance. For our benchmarks, we'll use a simple E. coli genome from NCBI. At 4.5 MB, this FASTA should be big enough to get an accurate read our on function's performance without being so big that the benchmarks take too long. Here are the results of our benchmarks (all times in milliseconds):

Function Mean Std. Dev. Optimization
`complement_fasta` 72,181.3 1,357.67 -
`complement_fasta2` 1,093.7 25.08 ???

This small change made our function over 70 times faster! So what's going on here? Well, to cut to the chase, the original `complement_fasta` code opens and closes the output file with every iteration of the loop. So for every base pair, Python has to make a new call to the underlying operating system to open the file again.

You could say that this is like opening and closing your notebook every time you write a single character, but this metaphor doesn't even begin to capture how inefficient this is. According to some quick-and-dirty benchmarks, it takes an average of 250 ns (nanoseconds) to write a single base (byte) to a file, while it takes an average of 20,000 ns to open a file.3 This isn't like opening and closing a notebook every time you write a letter, this is like writing a single character, closing the notebook, and then scrolling through TikTok for about a minute before you pick up your pencil again.

To fix this in `complement_fasta2`, we open the file before we enter the while loop. That way, we avoid opening it (and implicitly closing it) for each read and write of a byte.

Some Slightly Better Code

We can take this even further.

Rather than iterate over individual bytes, we can iterate over the lines of the file directly.4 This allows us to clean up some of the logic that deals with the FASTA format's comments (lines starting with '>').

``````def complement_fasta3(in_name, out_name):
# read FASTA with name [in_name], complement, and write to [out_name]
in_file = open(in_name)
out_f = open(out_name, 'w')
for line in in_file:
if line.startswith(">"):
out_f.write(line)
else:
for base in line:
if base == "A":
out_f.write("T")
elif base == "C":
out_f.write("G")
elif base == "G":
out_f.write("C")
elif base == "T":
out_f.write("A")
else:
out_f.write(base)
``````

And here's the change in performance (all values in ms).

Function Mean Std. Dev. Optimization
`complement_fasta` 72,181.3 1,357.67 -
`complement_fasta2` 1,093.7 25.08 Open output file only once
`complement_fasta3` 594.9 38.99 Iterate line-by-line, not byte-by-byte

Not only does this result in a slight performance improvement, but it also makes the handling of the FASTA comments a little more obvious. Furthermore, it sets up for our final improvement.

That ladder of if-elif-else comparisons is a bit harsh on the eyes.5 Beyond aesthetics, this approach may only require one comparison for "A", but for "T" we have to perform four comparisons. This doesn't scale particularly well.6

Fortunately, Python has a built-in function, `str.translate`, for this exact task of remapping the characters in a string. Here it is in action:

``````def complement_fasta4(in_name, out_name):
# read FASTA with name [in_name], complement, and write to [out_name]
in_file = open(in_name)
out_f = open(out_name, 'w')

# create a translation table for each base
base_complements = str.maketrans({"A": "T", "C": "G", "G": "C", "T": "A"})
for line in in_file:
if line.startswith(">"):
out_f.write(line)
else:
line = line.translate(base_complements)
out_f.write(line)
``````

And here's the final change in performance (values still in ms):

Function Mean Std. Dev. Optimization
`complement_fasta` 72,181.3 1,357.67 -
`complement_fasta2` 1,093.7 25.08 Open output file only once
`complement_fasta3` 594.9 38.99 Iterate line-by-line, not byte-by-byte
`complement_fasta4` 34.1 0.88 Use built-in `str.translate`

With this last change, we're able to enjoy another 10x speedup, making this final function over 2000 times faster than the original. The real magic here comes from the fact that `str.translate`—like many Python standard library functions—is implemented in C, a lower-level and much faster language in Python.

Popular libraries are typically written by experienced programmers that have identified common problems and written very optimal solutions for them. Like `str.translate`, they often implement their solutions in C, which simply gives a performance you can never achieve in pure Python. One of Python's biggest advantages is its large, well-documented standard library and broader ecosystem of packages like numpy. In general, if there's a library function that can solve your problem, you should use it!

Lessons Learned

With relatively straightforward tweaks, we've made a simple Python function 2,000 times faster. To put this into more human terms: if the first function took a full 24 hours to run, then this final function would run in less than a minute.

We could push this even further7 but this would require some less intuitive tweaks. There's also some more clean-up we can do here, like handling the files with a context manager and adding proper docstrings. But for demonstration purposes, I think this is sufficient for now.

So, to summarize our lessons learned here:

• Minimize repetitive system calls. (Opening and closing a file, reading and writing, network operations, etc.)8
• Avoid copying large data structures (strings, lists, etc.) as much as possible.
• Use optimized library functions whenever you can.
• Most importantly: benchmark your code. Don't just guess if your tweaks are making a difference. timeit and pytest-benchmark are good options here.

One final observation: notice how the code became simpler as we preceded. This is often a theme in optimization. Simple code usually runs more quickly than complex code that introduces redundant abstractions.9

So why did I write this blog post, so I could just dump on one of my undergrad peers? No! I wrote this because I too am guilty of making all of the mistakes above.

For my senior project in computer science, I used Rust, a language I had only recently started learning. As the semester drew to a close, I rushed and started taking shortcuts. If you look at my codebase, there's string copying operations everywhere, I'm not properly buffering my reads at all, there's `// TODO: fix this` comments all over the place... it's a mess. Maybe I'll try to fix it up sometime, but that's a blog post for another day.

Point is, there are inherent trade-offs between optimization and developer time.10. When you're up against the clock, all you care about is whether or not your code works.

To give the credit to its creator, the original "bad code" worked. Moreover, it was algorithmically optimal, with linear time and space requirements. The issue here wasn't that the creator didn't grasp the conceptual nature of the problem, the problem is that they didn't leverage the hardware to its full extent. And that's totally fine. Most academic code is written for one-time or personal use anyway, and every minute spent tinkering in a terminal is time that could be spent at the bench.

However, if you want other people to run your code and want to optimize it a bit more, consider the tips above. With just a bit of benchmarking and consideration, you would be surprised how much you can improve your performance.

Notes

1. Timestamp at 6:40. If you do any kind of programming, I think both this and some of Muratori's other talks are worth a watch. The refterm saga demonstrates just how competent (and controversial) he can be.
2. Note: this function computes the complement, not the reverse complement. That is, given sequence `"AACTGG"`, we want `"TTGACC"`, not `"CCAGTT"`.
3. Benchmarks are available in the source code linked above. The 250 ns figure is based on a benchmark with Python's built-in write buffering. (See note 4 below.) All of these numbers are highly variable depending on the speed of your hardware and how busy your operating system is from microsecond to microsecond.
4. In lower level languages, reading and writing a file byte-by-byte can be very inefficient. Just as with opening a file, each read/write requires a system call that forces the program to wait. However, behind the scenes, Python automatically buffers your reads and writes to avoid these repetitive calls. In other words, when you try to write one byte, Python will wait until you have several more bytes to write before performing one big batch write. Proper buffering can result in huge performance improvements in different scenarios, but Python's default behavior will serve most burgeoning programmers well.
5. This block was even longer originally, as it handled lowercase letters as well. I removed these for simplicity.
6. I suspect the real culprit here is that Python iteration is really slow. `str.translate` side-steps this entirely and handles the iteration in C.
7. I was able to get down to 11.5 ms. See `complement_fasta7` in the source code if you're interested.
8. Having a rough notion of how long each operation takes helps as well. These numbers from Peter Norvig's essay "Teach Yourself Programming in Ten Years" are a bit outdated but still demonstrate how different computer operations can be orders of magnitude slower.
9. To use an apocryphal Einstein quote, your code should be "as simple as possible, and no simpler".
10. No discussion of optimization would be complete without at least an acknowledgment of Amdahl's Law. TL;DR: if you work hard to make one function 10 times faster, but it only accounts for 10% of your program's execution time, then you've only made your program ~9 percent faster.