Serving a large user base with reliable, consistent and low latency data is a very hard challenge for any backend team. At Ledger, we made the strategic choice to host our own blockchain core data services. By not relying on third parties, we can manage our clients’ data ourselves, ensuring that underlying processes adhere to our security guidelines and performance-oriented Service Level Objectives (SLO).
But this strategy brings its own set of challenges too.
Our first challenge is to migrate these core data-providing services away from cool and shiny noSQL tools. In this article, I will dive into why we made this difficult decision, the complexities we met and the benefits we reaped.
The goal of this article is to show the technical aspects that led us to choose PostgreSQL as our new baseline storage layer for blockchain data.
Deep dive into Blockchain Data
Blockchain data has several key features.
First, it is ever growing, and nothing is ever deleted from it. However in practice, although most of a blockchain is immutable, the youngest part of the blockchain may change due to conflicts that need to be resolved. Indeed, as the chain is a peer to peer network, several legitimate blocks may coexist temporarily. Usually, the older is deleted, resulting in what we call a reorganization. Long story short, the data is divided between an immutable cold tail and a rarely changing head state.
The issue we’re trying to solve is that whilst blockchains are great for having byzantine fault-tolerant data, they are less effective for slicing and dicing it over many axes. Namely, getting the list of operations that affected an account is very difficult. Even getting an account balance on a blockchain like bitcoin is a challenge when you don’t already have the list of transactions.
To overcome these challenges, Ledger Explorer Services index the whole blockchain.. It is a large, critical and performance-sensitive service fully written in Scala, using the cats-effect high performance runtime. We are over 10k rps on bitcoin, while maintaining a tail p95 latency under 100ms. We are also recruiting 😊.
A little bit of history
At the beginning of our story, way before I joined the company, the Ledger data service layer was handled by an embedded Neo4j database. Each serving box was indexing its own data and served it locally, which caused a lot of issues.
Data consistency between instances was not guaranteed, and the sheer size of the state that needed to be indexed, combined by neo4j disk and ram usage, was not scalable. This problem only worsened as the company grew, making it increasingly challenging to spawn new instances.
Cassandra was then chosen as the main driver of this new setup: it is a clustered, horizontally scalable database that is on the AP side of the CAP theorem. It solves the issues related to data sharing and allows for a clear separation between the indexing, blockchain aware component and the headless API servers.
But what is the point of having the whole historical state available if we are never going to actually read from it ?
Regarding our use case, raw historical data is rarely needed because our user’s account state can be aggregated from it. This led us to challenge the existing data storage solution which is based on the Cassandra distributed database.
The volume of data we need to store per blockchain, although in the terabyte range, is not what one can call “big data”. Moreover, the part of if that will be used to answer most of the queries (a.k.a. The hot path) is even smaller. Nowadays one can easily find commodity hardware servers with more than 16TB of NVMe SSD storage. Vertical scaling is a very powerful tool, and a relational database is too.
Finally, the main issue we had with the current cassandra setup was neither the wasteful storage model nor the poorly fitted data use case, but the lack of developer friendliness. Developing a new data based feature on cassandra has proven being unnecessarily time consuming. We strived for implementing each new axis on which we need to provide data.
Given our team’s expertise in data modeling skills and SQL proficiency, PostgreSQL was the perfect candidate. This solution is Battle-tested, robust, easy to extend, making it an ideal choice.
Why we chose SQL over NoSQL:
- Reads / Writes balances: the blockchain data use case has been strongly skewed over reads rather than writes (blockchain writes very few data at a very reasonable rate, even for a blockchain like Polygon). Cassandra has the ability to absorb a very high amount of writes – the read path is actually longer than the write path.
- Indexing support: Indices are a key component of a DBMS to answer queries and new business cases or opportunities. Cassandra has limited support for indexing. Indices are only effective if the query already specifies a way to restrain the partition on which the query will run. We pay here the cost to have an arbitrary distributed database. PostgreSQL support for indices is efficient, extensible and at the edge.
- Aggregation support: Same case for aggregation; since Cassandra does not allow multi-partition aggregation and tolerates no GROUP BY clause in its query language, its support is kind of lacking. PostgreSQL proposes an extensive aggregation support, even on exotic data types like ranges and jsonb blobs.
- Data modeling: Cassandra is very, very limiting in the way data modeling is possible. A table must be created for nearly each request you want to answer, and data has to be denormalized into large rows (using fully the wide column store aspect of C* and also the fact that writers are dirt cheap). PostgreSQL allows us to leverage the relational aspect of the blockchain (calls, transactions, blocks) and spare disk space, encouraging data reuse.
- Ad-hoc queries and auditing: Being able to use the full standard of SQL and do arbitrary queries means we can explore and search for potential bug root cause or have exploratory data for future use cases. We can really use the database as an interactive and smart tool rather than a dumb storage. Doing so on Cassandra without an extensive and costly analytics compute cluster like Presto, Spark, etc. (and as we are running on bare metal servers, we don’t have access to easily spawned distributed data analysis tools like EMR).
- Storage usage: Cassandra assumption is that storage is very cheap and that the cluster can be easily extended with new machines. That means that all the limitations on both indices and aggregations must be paid for with storage. No globally efficient indices and join support means that we have to denormalize and store a copy of the whole table for each axis we want to query. PostgreSQL spares us terabytes of storage.
- Consistency: As Cassandra is a distributed, AP-oriented database (communication is made with gossiping between nodes), consistency is only eventual in terms of writes. You can tune the consistency policy of each statement for both reads and writes, but the goal of this database was never to have strong consistency. PostgreSQL has a strong story of being used for critical missions and is highly resilient. Being centralized also means that there is no network involved in the write path.
- Transactions and MVCC:
- Transactions: Cassandra supports only lightweight transactions on DML queries. Some batching can be applied (doc) but there are numerous caveats, namely that the rows have to be in the same server (= partition) to not have horrendous performance.
- MVCC: Cassandra supports row time stamping but the full MVCC is not guaranteed. A compaction can erase stale data and there is no way to tell C* it shouldn’t (like with e.g. a transaction in PG).
- PostgreSQL supports a strong MVCC model that ensures a consistent read path for our users.
- Tooling: PostgreSQL has many more tools that are widely used to easily operate the database. Moreover, a tool like flyway ensures we maintain a strong versioning of the database schema. We already integrated it with our code base successfully. There is no equivalent with this level of maturity on Cassandra.
- Horizontal scalability: This is the key selling point of Cassandra.Just add more machines as your data expands. No equivalent for PostgreSQL as sharding and partitioning have to be made manually.
How we plan to scale
As we have seen, the only downside to using a Postgres setup is scaling both on reads and storage. What can we do to overcome this limitation?
The first effective tool we have is to segregate every protocol or blockchain we support into its own database, as thus can be scaled appropriately given the volume and traffic. Segmentation by business domain ensures a first layer of scaling.
By taking this concept further, we can also segment cold, historical data into temporal partition. The latest versions of Postgres have improved a lot the usability of partitioned tables, that could enable to move seamlessly data across a cluster of machines. For instance, we could use cheaper machines with less compute power to host the majority of historical data, while keeping beefy user-serving RAM-stacked behemoths to host aggregated tables and user’s latest operations.
This approach works very well in our use case because there are no cross-partition foreign keys in historical storage (everything is ultimately attached to the block). From the perspective of the main server, historical data could even be transparently accessed using partitioning and the postgres_fdw extension.
In order to help put all of this in place, we have also looked into the TimescaleDB extension. This extension adds a lot of functionalities to baseline postgres, and most of these are a perfect fit for our use cases:
- Automatic partitioning of tables based on a time like column (in our case, we adapt it by taking the blockchain height as our reference).
- Automatic, data type aware and column based compression of older chunks. This ensures a nearly perfect compression ratio by using state of the art algorithms on data that is very similar.
- Efficient time-bucket based aggregation to easily compute historical balances and market data graphs.
We are just at the beginning of experimentation regarding storage,and this unlocks a lot of use cases. Proof of concepts using a small amount of data (~10k blocks on the ethereum mainnet, so around 2 days of data) showed disk space reduction as high as 40%.
As we have seen, data volume, provided we use the right strategy, is not an issue. But how to scale with the size of our user base?
We already have a nice advantage here: we index the whole blockchain data. Thus, the storage needed won’t grow like the number of users, but like the total blockchain size. Storage and reads optimizations are totally orthogonal in their resolution.
This setup, combined with the very low write need in proportion to the read volume which needs to be served, is the dream setup for a classing leader-follower replica pattern. In order to enhance further performance and throughput, we also can place the postgres read replicas on the same machines as the API servers and take advantage of the UNIX domain sockets to skip the network round trips.
Here is an example of a data replication strategy that we could use to scale our reads. Light gray boxes represent single servers. We can see here that API pods are directly co-located with replicas of the hottest data to ensure a minimal transfer time between storage and users. Archive instances described precedently are not represented to not complicate the schema too much.
As a long term Cassandra user, I want to emphasize that it is a great database in its design, that suits a wide variety of applications. Unfortunately, the choice that was made at Ledger to use it was made on a data use case which never materialized.
Our team productivity was impacted, and looking forward to the challenges we have to solve we chose to bite the bullet and to not fall for the sunken cost fallacy.
In many cases, your data is not big data. Managing data distribution is not a difficult task in most cases, and the tradeoffs of a full-fledged distributed database really needs to be carefully considered. The key consideration is the developer experience as it frees up valuable time to build anything else.This is the real use case we need to heavily invest in.