Efficient Analytics of On-chain Transactions Leveraging Blockchain’s Twin Node

8 min readMay 10, 2022

By Hitarshi Meenketu Buch, Dheeraj Budhiraja

Setting the Stage

The data processed by blockchain has tremendous value and can provide deep business insights. However, the shared ledger residing on each blockchain node cannot be directly consumed for reporting and analytics purposes because of the following technical limitations:

On blockchain, underlying transactional data is stored in a key-value pair format, and the querying of current or historical state is possible only when the “key” is used. This severely limits retrieval of data.
Analytics and complex reporting require querying the metadata associated with the blockchain transaction, which is the object mapped as value to the “key”. Performing analytics by slicing and dicing the state of the data cannot be achieved directly.
Because of privacy reasons, it is not possible to maintain PII (Personal Identifiable Information) or confidential data on blockchain. An encrypted form or hash of such data is stored on-chain. Therefore, business applications must rely on off-chain data stores for querying and reporting purposes.

Solution implementers have relied on non-standard, peripheral techniques for storing business transaction data in off-chain repositories to overcome this shortcoming of blockchain. This effectively means that the same set of data must be maintained on blockchain as well as other off-chain components, which can lead to data inconsistency and integrity issues. In the absence of a standardized mechanism, this approach of creating a data replica can quickly go out of sync as the transaction volume increases.

Our Solution — BANDIT (Blockchain Analytics via Node’s Digital Immutable Twin)

Transactions and the state processed by smart contracts on blockchain are stored in the node’s database in Level DB as key value pair on which rich querying and analytics cannot be performed directly. For reading data from blockchain, one must invoke the query APIs by transaction hash and then decode the data using Contract ABI to know what happened inside that transaction.

These technical limitations of blockchain technology can be overcome with a solution we call BANDIT (Blockchain Analytics via Node’s Digital Immutable Twin). BANDIT enables the ingestion of blockchain transactions and storing the data in a read-only NoSQL datastore that is used for rich querying and reporting.

Key Technology Decisions

To achieve the goal of implementing a scalable and reliable solution to enable the twin node, we considered the following architectural decisions to select the best-fit, open-source technology building blocks.

Transaction Relay Mechanism from Main Node to Twin Node

Technical Architecture

Our technical architecture is comprised of solution building blocks based on architectural decisions. Components are segregated into two main categories based on the functionality required.

Main Node Transaction Processor

The following components are used to ensure that the transactions occurring on the blockchain network are processed.

Web 3 Transaction Listener makes use of the web3 SDK for EVM (Ethereum Virtual Machine) blockchain platforms to listen to the block and transaction events and capture the transaction receipt on real time basis. This component can also be configured to listen to specific smart contract events. This component includes a Kafka subscriber to determine the transactions ingested and push transactions to Kafka on an incremental basis.

Twin Node Registration provides APIs for registering the twin node that will be authorized to subscribe to the blockchain transaction from the main node. This registration uses Kafka SASL authentication (SHA 512) and ensures subscriber specific bi-directional transport layer security for communication between main and twin node. The eligibility of twin node as subscriber is validated and the data is encrypted using the subscriber’s public key.

Blockchain Event Configurator is used to set up the type of events to be captured from the main node. The default behavior is to capture all the blocks and encompassed transactions with the option to specific smart contract addresses provided by the decentralized application (DAPP, Dapp, or dApp).

Kafka Message Publisher ensures that the event captured from the main node is pushed to a Kafka cluster to prevent message loss. The rate at which Kafka cluster should accept messages and queue depth can be configured based on the expected transaction volume.

Twin Node Transaction Consumer

The following components enable analytics and reporting for blockchain transactions:

Kafka Message Subscriber listens to the Kafka cluster for transactions over a secured connection and populates the twin node’s Mongo database on a real-time basis. The transactions are populated in MongoDB based on the database configuration.

Database Configurator helps to decide how the collections and indexes are created in the Mongo database. The default configuration saves all transactions relayed from the main node in one collection. Alternatively, transactions can be stored in multiple collections based on the smart contract address, block count, etc. This module also provides APIs to store a smart contract ABI (Application Binary Interface) that can be used to decode and store the transaction payload.

Twin Node Database is a read-only data store that allows the message subscriber module to write to the database. It is set up to provide read-only rights for all other purposes. It can be comprised of one or more collections based on the database configuration setup.

Spark Context and Spark Query Language (SPARQL) are used for advanced querying requirements in which a large data set needs to be used for generating reports.

An API gateway provides the query APIs that can be used by MIS (Management Information Systems) for analytics, reporting, and verifying the transactions on the twin node using the Merkle proof available on the main blockchain node on a need-to-know basis.

Testing BANDIT for Scale

We tested BANDIT’s performance from the perspective of ingesting blockchain transactions on a real time basis and the query performance. The first aspect is important so that a transaction’s capture, publishing, and storage to twin node is fast and reliable. The second aspect helps to understand querying large amounts of data effectively.

Ingesting Blockchain Transactions

Based on the solution architecture, we conducted performance testing of BANDIT using the Quorum blockchain, which is a permissioned fork of Ethereum blockchain. It uses an EVM as the node’s runtime engine and supports Solidity smart contracts.

Inbound blockchain transactions on Quorum blockchain network

Transaction publishing to the Kafka queue was initiated after 6 million transactions were processed on the blockchain. In real time, this will be limited to the throughput the blockchain can support.

Transaction consumption details on twin node

This BANDIT setup can be used for any EVM blockchain network like Ethereum Mainnet. The maximum transactions processed on Ethereum Mainnet is approx. 1.8 million per day. This can be easily accommodated by the BANDIT solution.

Consuming Blockchain Transactions

We performed queries on the dataset comprising six million records in MongoDB, and the results varied based on the type of query performed. Response times were anywhere from 3 to 21 seconds. Based on the use case and querying requirements, the following optimizations are recommended using MongoDB and Spark in tandem:

Represent the MongoDB collection as a DataFrame, in which a configuration object and a SQL parser object are also created, to represent the collection in 1-minute windows for efficient querying.
Use secondary indexes on MongoDB and run Spark queries on any slice of data without requiring table scans. For example, if it is a syndicated loan use case where the loan agreement is handled by a smart contract, then the smart contract data can be represented as a slice to run a high-performance query for all the loans disbursed in a year for a specific region.

Synopsis

The BANDIT solution is applicable to blockchain platforms such as Ethereum and several other blockchain protocols that rely on key-value store databases. The twin node approach will require large storage that will continue to grow as blockchain transactions increase. This can be mitigated by purging and archiving the data on the twin node at regular intervals.

As blockchain has evolved, some of the private permissioned distributed ledger platforms have tried to resolve the querying limitations by providing an alternate NoSQL or relational data store as part of the main node. This affects the overall performance and can compromise the security and data integrity of the blockchain ecosystem.

Instead of using bespoke methods of replicating the blockchain state to support querying and reporting, BANDIT provides a standardized, secure, and reliable mechanism for leveraging an event-driven, credential-based, read-only twin node with the configurability to cater to use-case specific smart contracts and query optimizations using MongoDB and Apache Spark.