Migrating from a database to a text file can be a complex challange. I will explain an approach we use to evolve the system gradually.

This is the second episode on replacing binary data with a textual representation. In the previous episode, we described the benefits of storing data in the form of a domain-specific language (DSL). Here, we will describe a convenient first step in this replacement process: the substitution of a legacy key/value database library with a drop-in replacement that stores the data in a textual format with one record per line. Besides being an ideal stepping stone for further migration, it also provides some basic version control functionality. That allows users in the organization to adjust to working with version control tools.
Setting the stage
We were contracted to assist in the renovation of a large legacy system for modeling financial products. The models were all stored in an opaque binary format in a proprietary key/value database.
The content of the models was quite diverse: Some parts consisted of source code, while other parts were data structure definitions or specifications of user-interface behavior. Think of a proprietary version of MS Access or Oracle Forms.
A key goal of the renovation was enabling effective version control of the models. Peer review of changes, inspecting a previous version to look for the source of a problem, merging changes from different sources, or undoing individual changes or changesets were all extremely difficult or even impossible.
We wanted to achieve this goal by representing the model as a series of text documents in different DSLs. However, as is often the case when renovating a legacy system, a “big bang” operation was infeasible.
So, we broke up the renovation into manageable steps:
- Represent the records in the database as JSON documents, with one record per line
- Gradually identify DSLs and migrate the data from the record structure to DSL source files
In this post, we describe the first step. The second step will be the focus of a future post.
From binary data to text documents
In the first step, we introduced a textual format to store the records from the legacy database. Each file was a JSON document where each record was represented as a single line with the actual record data stored as a base-64 encoded string.
Version control tools tend to be line-oriented. An early advantage of a record-per-line text format is the somewhat usable version control.
A typical textual database file looks something like this:
{ "version":"1.0", "records":[ {"key":"Customer/4C89DE000", "data":"TG9yZW0gaXBzdW0gZG9s....."}, {"key":"Contract/FFFF00000", "data":"VGlyYW1pc3UgdG9wcGlu....."}, ... ] }
This is just a standard JSON document containing a list of records with some metadata. As the data is represented as a base-64 encoded string, the actual content changes between the two versions cannot be seen. However, using record types as readable strings in the key makes it possible to see what kind of records have changed between the two versions and how many of these have changed. In practice, this information is only really useful for technicians and not the end user.
Although this format supports version management, more sophisticated tasks, like reviewing changes, have to wait until the migration to DSLs. However, the format proved to be a successful midway point between the original legacy binary format and a full migration to textual DSLs, as it allowed the users to gain experience with version management of their data.
Implementation through API compatibility
API calls to the proprietary key/value database manager were used all over the codebase. Replacing all these calls with calls to our textual database API was not realistic. So, we took the approach of API compatibility and created an API for our textual database that was compatible with the API of the proprietary database. This also allowed us to address both databases in parallel during testing. We did that by introducing an API multiplexer that called both database APIs on each call, verifying that the results were the same.
This proved to be an excellent way of finding bugs in the implementation, especially those caused by misunderstandings of the semantics of the proprietary database API – for which no documentation or source code existed.
Does it perform?
Performance degradation is a big risk when replacing a binary database with a “line per record” textual format. We used several strategies to mitigate this risk:
- Index the entire database file when it is first opened. The index contains a mapping from the key to the physical position in the file.
- Maintain an in-memory record cache.
- If a record needs to be written but the record size does not change, update the record in place instead of rewriting the entire file.
- Split the database into multiple files. The smaller the file, the easier it is to write the entire file. However, too many small files will grind file and version control systems to a halt, so there is a tradeoff here.
- When performing batch operations, delay writing the database file(s) until the batch operation is complete.
Together with some other optimizations, this approach proved to be enough to achieve less than 5% performance loss, which is more than satisfactory for the capabilities gained.
Key takeaways
- Replacing a proprietary key/value store with a textual key/value database can liberate your data by offering rudimentary version control.
- It can be a valuable first step in an incremental process towards a full DSL approach.
Get in touch
If you have an application that is in desperate need of better version control, feel free to contact us to discuss the possibilities of converting your data to a textual format.