So far, WebAssembly has been used for all sorts of applications, ranging from gaming (e.g. Doom 3), to porting desktop applications to the web (e.g. Autocad and Figma). It is even used outside the browser, for example as an efficient and flexible language for serverless computing.
Note: This article delves into some advanced topics such as compiling C code, but don’t worry if you don’t have experience with that; you will still be able to follow along and get a sense for what is possible with WebAssembly.
The web app we will work with is fastq.bio, an interactive web tool that provides scientists with a quick preview of the quality of their DNA sequencing data; sequencing is the process by which we read the “letters” (i.e. nucleotides) in a DNA sample.
Here’s a screenshot of the application in action:
We won’t go into the details of the calculations, but in a nutshell, the plots above provide scientists a sense for how well the sequencing went and are used to identify data quality issues at a glance.
Although there are dozens of command line tools available to generate such quality control reports, the goal of fastq.bio is to give an interactive preview of data quality without leaving the browser. This is especially useful for scientists who are not comfortable with the command line.
The input to the app is a plain-text file that is output by the sequencing instrument and contains a list of DNA sequences and a quality score for each nucleotide in the DNA sequences. The format of that file is known as “FASTQ”, hence the name fastq.bio.
If you’re curious about the FASTQ format (not necessary to understand this article), check out the Wikipedia page for FASTQ. (Warning: The FASTQ file format is known in the field to induce facepalms.)
In the original version of fastq.bio, the user starts by selecting a FASTQ file from their computer. With the
Once the metrics are calculated for that chunk of data, we plot the results interactively with Plotly.js, and move on to the next chunk in the file. The reason for processing the file in small chunks is simply to improve the user experience: processing the whole file at once would take too long, because FASTQ files are generally in the hundreds of gigabytes. We found that a chunk size between 0.5 MB and 1 MB would make the application more seamless and would return information to the user more quickly, but this number will vary depending on the details of your application and how heavy the computations are.
The box in red is where we do the string manipulations to generate the metrics. That box is the more compute-intensive part of the application, which naturally made it a good candidate for runtime optimization with WebAssembly.
fastq.bio: The WebAssembly Implementation
To explore whether we could leverage WebAssembly to speed up our web app, we searched for an off-the-shelf tool that calculates QC metrics on FASTQ files. Specifically, we sought a tool written in C/C++/Rust so that it was amenable to porting to WebAssembly, and one that was already validated and trusted by the scientific community.
After some research, we decided to go with seqtk, a commonly-used, open-source tool written in C that can help us evaluate the quality of sequencing data (and is more generally used to manipulate those data files).
Before we compile to WebAssembly, let’s first consider how we would normally compile seqtk to binary to run it on the command line. According to the Makefile, this is the
gcc incantation you need:
# Compile to binary $ gcc seqtk.c -o seqtk -O2 -lm -lz
On the other hand, to compile seqtk to WebAssembly, we can use the Emscripten toolchain, which provides drop-in replacements for existing build tools to make working in WebAssembly easier. If you don’t have Emscripten installed, you can download a docker image we prepared on Dockerhub that has the tools you’ll need (you can also install it from scratch, but that usually takes a while):
$ docker pull robertaboukhalil/emsdk:1.38.26 $ docker run -dt --name wasm-seqtk robertaboukhalil/emsdk:1.38.26
Inside the container, we can use the
emcc compiler as a replacement for
# Compile to WebAssembly $ emcc seqtk.c -o seqtk.js -O2 -lm -s USE_ZLIB=1 -s FORCE_FILESYSTEM=1
As you can see, the differences between compiling to binary and WebAssembly are minimal:
- Instead of the output being the binary file
seqtk, we ask Emscripten to generate a
.jsthat handles instantiation of our WebAssembly module
- To support the zlib library, we use the flag
USE_ZLIB; zlib is so common that it’s already been ported to WebAssembly, and Emscripten will include it for us in our project
- We enable Emscripten’s virtual file system, which is a POSIX-like file system (source code here), except it runs in RAM within the browser and disappears when you refresh the page (unless you save its state in the browser using IndexedDB, but that’s for another article).
# On the command line $ ./seqtk fqchk data.fastq # In the browser console > Module.callMain(["fqchk", "data.fastq"])
Having access to a virtual file system is powerful because it means we don’t have to rewrite seqtk to handle string inputs instead of file paths. We can mount a chunk of data as the file
data.fastq on the virtual file system and simply call seqtk’s
main() function on it.
With seqtk compiled to WebAssembly, here’s the new fastq.bio architecture:
As shown in the diagram, instead of running the calculations in the browser’s main thread, we make use of WebWorkers, which allow us to run our calculations in a background thread, and avoid negatively affecting the responsiveness of the browser. Specifically, the WebWorker controller launches the Worker and manages communication with the main thread. On the Worker’s side, an API executes the requests it receives.
Out of the box, we already see a ~9X speedup:
This is already very good, given that it was relatively easy to achieve (that is once you understand WebAssembly!).
Next, we noticed that although seqtk outputs a lot of generally useful QC metrics, many of these metrics are not actually used or graphed by our app. By removing some of the output for the metrics we didn’t need, we were able to see an even greater speedup of 13X:
This again is a great improvement given how easy it was to achieve—by literally commenting out printf statements that were not needed.
Finally, there is one more improvement we looked into. So far, the way fastq.bio obtains the metrics of interest is by calling two different C functions, each of which calculates a different set of metrics. Specifically, one function returns information in the form of a histogram (i.e. a list of values that we bin into ranges), whereas the other function returns information as a function of DNA sequence position. Unfortunately, this means that the same chunk of file is read twice, which is unnecessary.
A Word Of Caution