Problems with WebAssembly (WIP title)

# Problems with WebAssembly (WIP title) [TODO - Abraham Simpson shaking his fist at WA] For the past few months, I've been working at Artifex Software on porting one of their main products, [the MuPDF rendering library](https://mupdf.com/), to WebAssembly. This is my second job where I've been hired to port pre-existing C software to wasm. My previous gig was at [VideoLabs](https://videolabs.io/), where I was tasked to port VLC to the web for the (now abandoned AFAIK) VLC.js project. While this article is about MuPDF.js, the VLC.js project had similar dynamics. While I'm still making progress on MuPDF.js, I'd like to take a moment to explain where the project came from, what its ambitions were, and how the reality of WebAssembly smashed some (but not all) of these ambitions to pieces. ## About MuPDF MuPDF is a lightweight PDF, XPS, and E-book viewer. The project has a few components (command-line tools, demos), but the core of the project is a C library that can be used to render PDFs and other graphical formats. MuPDF only includes PDF renderers as demos of the library's features. Our business model revolves around licencing the library itself to other companies who want to make their own PDF renderer without spending 3 years figuring out the PDF format from Adobe's documentation. MuPDF is a fairly modular project: it can handle multiple file formats, input streams, output formats, and colorspaces. It has an OOP-like architecture with extensible interfaces in case you want to add your own formats/inputs/outputs, it handles multi-threading, progressive rendering, cancellation, resource caching and other conveniences. [TODO - Puff this part up a bit? Btw, where is mupdf used] As a C project running graphics operations based on performance-intensive parsing of user input, MuPDF does seem like a pretty good candidate for a WebAssembly port. ## The ideal When you own a long-running C project, WebAssembly can seem like a miracle. Now you can run your project *on the web*! On any computer, without having to install it! You just need to share a URL with your users! And you don't even need to rewrite it in another language. And when your project is a modular library, meant to be used in other programs, WebAssembly can feel like an easy way to reach new levels of composability. The ideal that I've seen floating around when discussing WebAssembly projects is for libraries to become *modules*. The C ecosystem is notoriously bad when it comes to composing projects together: there's no official package manager, no official build system, build options have to be specified with preprocessor macros, etc. So what we'd *really* like is to turn these C libraries into WebAssembly packages, and them distribute them with [npm](https://www.npmjs.com/), or [some other package manager](https://wapm.io/). Then using your library is as simple as calling `npm install my_package` in your project. This was (and mostly still is) the plan with MuPDF.js. Our goal is to wrap MuPDF's API in a JS API, so that end-users go from writing this: ```c fz_document *doc = fz_open_document(ctx, filename); fz_page *page = fz_load_page(ctx, doc, page_number); fz_pixmap *pixmap = fz_new_pixmap_from_page(ctx, page, fz_scale(zoom_level), colorspace, alpha); ``` to writing this: ```javascript const doc = mupdf.Document.openFromUrl(url); const page = doc.loadPage(pageNumber); const pixmap = page.toPixmap(Matrix.scale(zoomLevel), colorspace, alpha); ``` The project also includes a demo web-view we're calling `MuPDF-viewer` that should render PDFs in the browser, and also allow them to edit annotations. In principle, this view could be modular as well, to be used in React or Angular components. We also want the library to run in Node.js, both for unit tests and so that user can create CLI utilities with it. ## Our implementation To implement this, we compiled MuPDF to WebAssembly+JS using [emscripten](https://emscripten.org/). Emscripten is the state-of-the-art for compiling C to wasm, insofar as... well, there's no other option available. In theory, you could skip emscripten and use LLVM directly (and binaryen and the other components of emscripten's toolchain), but in practice emscripten handles a lot of boilerplate for you, and is the one with the most documentation and support online. From the beginning, our implementation relied on having a WebWorker run the wasm binary. This is because PDF parsing and rendering are both CPU-intensive tasks, and we don't want to block the UI thread if we have a performance problem. To communicate to the WebWorker, the application sent commands like LOAD_BUFFER_TO_DOCUMENT or RENDER_DOCUMENT_PAGE_TO_PIXMAP or GET_LIST_OF_TEXT_ELEMENTS, and the worker sent back the results of the matching operations. Most operations were implemented directly in C in a library wrapper. The results would take the form of buffers storing either pixmaps or JSON objects. So a reponse might look like: ```json { LIST_OF_TEXT_ELEMENTS: [ { x: 10, y: 48, w: 30, h: 15, content: "hello" }, { x: 10, y: 68, w: 30, h: 15, content: "world" }, "etc", ] } ``` The main thread would then parse that buffer, and generate DOM nodes based on the JSON values. This architecture had a few advantages: - Very little state needed to be coordinated across the Worker and wasm boundaries. The application would request a page, and would only need to send the page number, and get back a buffer of pixels. This is the kind of data wasm likes to trade: numbers and byte arrays. - Because most of the logic was written in C, that logic was inherently type-safe. While C isn't a perfect language, it will at least warn you if you try use a `fz_document*` instead of a `fz_page*`. It also had some minor inconvenients: - The serializing to JSON had to be done on the C side, using printfs. C isn't really a great language for writing JSON, and this was error-prone. - Actually, C isn't a great language for writing a document editor either, and yet all the editing logic had to be written in C too. (In practice it wasn't too bad, it just felt like reinventing the wheel.) But the biggest drawback was modularity. We were moving away from the ideal API described above, and writing something more low-level. Someone wanting to hack on our code would need to recompile the C wrapper every time, which wasn't what we wanted. Remember, Artifex doesn't make PDF editors, we make PDF editor *components*. And yet we'd picked an architecture that was easy for us to write, but hard for users to integrate. So we went back to the drawing board, and wrote something closer to the ideal API described above. As I'm writing this article, two months later, we're *just now* getting to feature parity with the previous iteration. So. What was the problem? ## The architecture: wrapping wasm in JS MuPDF as a very object-oriented architecture. That architecture thus translates fairly well to JS. For instance, the `Document` class in JS will look something like: ```javascript class Document { constructor(buffer, mimeTypeString) { let bufferPtr = libmupdf.malloc(buffer.byteLength); libmupdf.HEAPU8.set(new Uint8Array(buffer), bufferPtr); this.pointer = libmupdf.wasm_open_document_with_data(bufferPtr, mimeTypeString); libmupdf.free(bufferPtr); } free() { libmupdf.wasm_drop_document(this.pointer); } countPages() { return libmupdf.wasm_count_pages(this.pointre); } loadPage(pageNumber) { let pointer = libmupdf.wasm_load_page(this.pointer, pageNumber); return new Page(this.pointer); } } ``` Now, have you spotted the problems with the above code? You may have spotted a few, but you probably didn't spot all of them. As I wrote this API, I kept bumping into them for weeks. Let's go over them, one by one: ```javascript this.pointer = libmupdf.wasm_open_document_with_data(bufferPtr, mimeTypeString); ``` You can't send a string directly to WebAssembly. First you have to allocate space for the string in linear memory, then encode the string (JS uses UTF-16, most C libraries use UTF-8), then pass a pointer to the allocated string, then free the space afterwards (or not, if the function takes ownership). Passing the string directly will just silently pass a null pointer. ```javascript libmupdf.free(bufferPtr); ``` This is wrong, because `wasm_open_document_with_data` will call `fz_open_document_with_stream`, which keeps a reference to the data passed as input. Thus, freeing the buffer immediately afterwards will lead to use-after-free, which will silently create unpredictable errors. There is no way to anticipate that without knowing the implementation of multiple MuPDF functions. ```javascript return libmupdf.wasm_count_pages(this.pointre); ``` There is a typo in the name. Since `this.pointre` is undefined, the C function will be called with a null pointer. Since the function only does a field access, it probably won't crash: null-adjacent pointer access is legal in WebAssembly. The function will silently return garbage data (probably just 0). ```javascript return new Page(this.pointer); ``` I wrote `this.pointer` instead of `pointer`, and the Page constructor silently accepts it. Later on, when Page methods are called, they will be called with a `fz_document *` instead of a `fz_page *`, which will likely lead to "function signature mismatch" runtime errors (basically "you used a function pointer of the wrong type"). The takeaway from this example is that, when calling WebAssembly from JS, it's very easy to introduce small mistakes that are accepted silently, and blow up in your face unpredictably. The more different calls you have, the bigger your attack surface. And when you're trying to port an entire C library one-to-one, with each C function having a matching JS methods, you have a *very* big attack surface. Of course, none of these problems are unsurmountable. The problem with parpercuts when coding isn't that it's impossible. The problem is that efficient coding is all about having a short "write code, test code, get feedback" loop. Having your code introduce subtle bugs in unpredictable places damages that loop. You end up spending an hour chasing a bug for a feature you could have written in 10 minutes; because you're afraid to bump into bugs, you test your code less often, and you trap yourself in a vicious cycle. This is *terrible* for productivity. ### Death by a thousand papercuts So far I've talked about wasm but there's also WebWorkers Same problem of coordinating info across boundary emscripten has bad support for ES modules Type safety problems setjmp/longjmp No blocking calls WebWorkers + wasm Gets far away from the OOP ideal Makes for awful stack traces Virtuall no community No codepen (WebAssembly Studio is defunct) Small functions are useless to port (because for performance it's just as good to copy-paste the code) embind works at runtime pthread stuff Can't use a wasm binary boh for multhreading and non-multithreading