Over the last eighteen months, a specialist group within the Development Team at Market Dojo have been hard at work on optimising and updating our primary Sourcing offering.
One of the larger pieces of this work has been to improve the performance, and thereby the capacity, of our Advanced RFQ functionality. The majority of our clients require only a few hundred cells, having optimised their processes for rapid and supplier-friendly Events, but there are a significant number whose existing processes, legacy requirements, or business needs are such that this is not possible. These users expressed a need to have suppliers register interest and provide pricing information for catalogues of many hundreds of line items, far beyond the cell count which Market Dojo was at the time capable of supporting.
In response to this change in expressed needs, we set about digging into the internals of the RFQ system. This is the oldest part of Market Dojo, having been the core of the initial functionality over a decade ago, and has grown to meet the needs of a multitude of businesses over time, leading to enormous complexity and no small amount of code which nobody currently on the team fully knew or understood. As with the majority of large software projects, the architecture had grown to some extent organically, rather than having been composed by an overarching vision, and we were forced to take stock of the implications of ten years of patches, improvements, workarounds and rapid adaptations to meet the changing needs of clients. Business, after all, rarely has the time or money to wait for the regular reconstruction a project of this size requires.
This issue of developing rapidly and then needing to go back and rework or suffer drag on subsequent changes is often referred to as “technical debt”, but while pithy this doesn’t particularly capture the essence of what is going on. A better analogy, albeit far less popular, is the unhedged call option. That is, every piece of development where full rework, refactoring and cleanup is not performed is a sale of a call on development time to the codebase itself, for the premium of getting the product or feature out of the door and into the hands of clients. If the project is time bounded, as in single-use applications or contract work, one can benefit from theta decay in the normal way; eventually the project ends, the call expires and the developers can walk away clean with the premium.
In a project such as Market Dojo, however, there is no theta to benefit from. We’ve been going a decade and see no sign of slowing down. Worse and furthermore, as we are expecting Market Dojo to grow in all aspects across both the business and the codebase and we do not expect to get less busy, we are paradoxically long on the underlying (our time) for which we are selling calls. This means we can, given our priors, reasonably expect that at some point we will want to make a change and the codebase will exercise its options and call away our time, whatever the cost to us.
So it was with Advanced Lots. We set out with the challenge of bringing the system up to the task of supporting Advanced Lot RFQs with 5000 cells in a Lot, for creation and bidding, fast enough that users were able to comfortably create and take part in RFQ Events covering potentially many hundreds or thousands of Line Items.
Our first discovery was that, regardless of what we might do with the data, our rendering process was incapable of supporting what we wanted to achieve. We make heavy use of Rails partials and the collections helper for them, but due to the deeply nested structure of an Advanced Lot we found that the time taken to render the page grew exponentially as we approached the cell count we were targeting, chewing up unreasonable amounts of memory in the process.
This led to the need to initiate a subproject to rearchitect and rebuild the UI in a somewhat cleaner fashion, taking advantage of more modern technologies to allow us to process more data, faster, splitting the work between the server and the browser. Moving the rendering and display logic to React and loading much of the data asynchronously gave us about a 5x increase in speed for the maximum cell count, though the effects at low cell counts were minimal and dominated by other factors such as database IO.
Having a frontend that wasn’t using Rails partials gave us the opportunity to step away from Rails’ inbuilt model serialization, instead fetching just the data we needed from the database and serializing to JSON using a native library. This angle of attack saved us roughly an order of magnitude in the data processing time for the full 5000 cells, and in this case the effect was roughly uniform across all cell counts, though with some additional benefits at the very high end as we reduced pressure on the database with smaller queries fetching less data.
Combined with the improvements in rendering time, and some additional work on lazy-rendering those cells which were not currently being displayed to the user, we were able to afford ourselves a drop from around eight minutes to assemble and render a full size Lot to roughly twenty seconds, representing a 95% drop in time in the best case. This still varies significantly with the content and complexity of the cells, as a price cell carries far more information than a text cell, but overall brought these larger Event sizes into the realm of usability.
Creation and bidding, on the other hand, were still an issue. It’s all very well being able to load a page rapidly, but if the user is unable to populate it then it is simply decorative. To that end we had to optimise the pipelines for Lot creation, in particular from XLSX uploads, and for bidding, again with a focus on the very popular approach of filling out a bid sheet in Excel and then uploading to Market Dojo for processing.
Lot creation proved relatively tractable; the existing code was riddled with N+1 queries and code optimised for (and tested with) a few dozen cells. This was expected; anything more complex would have thoroughly violated the principle of “you’re not going to need it”, since the initial specification for Advanced Lots did not imagine anything over a couple of hundred cells. We were able to entirely remove the existing code for processing Advanced Lot uploads, and replace it with a far more highly optimised version leveraging bulk inserts and processing of data directly without going through ActiveRecord. These changes, after a few iterations, gave us a performance increase on the upload of a full size lot of around 98%, as far as could be measured. The initial data here was very poor simply because in the majority of cases the time taken would exceed our ten minute timeout for our cloud environments and fail to complete, or would exceed the available memory and likewise fail.
Bidding, on the other hand, proved to be tricky. Uploading a bid sheet triggers a sequence of events including the re-ranking of every other current bid and ranked price component for that Lot across the Event. This was fundamentally complex, and despite several revisions of the approach taken for Lot uploads we were eventually forced to concede defeat with respect to implementing it in Ruby with acceptable performance. Following some discussion about possible approaches, we chose to reimplement just that portion of the process in Rust, a much more strict and compiled language which emphasises safety, speed and zero-runtime-cost abstractions. Created as a separate Ruby gem and called into using Rust’s C-compatible ABI and Ruby’s ability to load native extensions as long as they look mostly like C, we were able to see roughly a 20x speedup across that area despite only making minimal changes to the logic. This was a lovely piece of synergy, with both languages providing all the tools we needed to integrate them and derive the benefits we needed. The bid processing pipeline is now dominated by data aggregation, which still takes place in Ruby, and we intend to revisit it at a later date.
Overall, we are pleased with how far we have come on this journey, but we fully acknowledge that we have much further to go. The astute reader will have noticed that I discussed primarily RFQs. Equivalent work for the e-auction side of Sourcing is ongoing, and even so will likely not reach the same Lot size, not least due to the nature of e-auctions as fast-paced Events where it would be unfair or unreasonable to expect suppliers to bid with any precision on many thousands of items.