A blockchain index structure based on subchain query

Blockchain technology has the characteristics of decentralization and tamper resistance, which can store data safely and reduce the cost of trust effectively. However, the existing blockchain system has weak performance in data management, and only supports traversal queries with transaction hashes as keywords. The query method based on the account transaction trace chain (ATTC) improves the query efficiency of historical transactions of the account. However, the efficiency of querying accounts with longer transaction chains has not been effectively improved. Given the inefficiency and single method of the ATTC index in the query, we propose a subchain-based account transaction chain (SCATC) index structure. First, the account transaction chain is divided into subchains, and the last block of each subchain is connected by a hash pointer. The block-by-block query mode in ATTC is converted to the subchain-by-subchain query mode, which shortens the query path. Multiple transactions of the same account in the same block are merged and stored, which simplifies the construction cost of the index and saves storage resources. then, the construction algorithm and query algorithm is given for the SCATC index structure. Simulation analysis shows that the SCATC index structure significantly improves query efficiency.


Introduction
With the advancement of computer science, the development of technologies such as big data [1], blockchain [2,3], and the Internet of Things [4][5][6] has been promoted, and many convenient services [7] have also been brought to users. However, there are still many problems, such as user data privacy leakage [8][9][10], low algorithm efficiency [11], search efficiency [12], and other issues. Since traditional centralized institutions are not completely credible, users' data may be leaked. Blockchain with decentralized characteristics can store data safely and protect users' data privacy [13].
In 2008, Bitcoin was proposed by Satoshi Nakamoto in "Bitcoin: A Peer-to-Peer Electronic Cash System" [14], marking the emergence of blockchain technology. As the *Correspondence: ylchen3@gzu.edu.cn 1 State Key Laboratory of Public Big Data,College of Computer Science and Technology, Guizhou University, Guiyang 550025, China 4 Blockchain Laboratory of Agricultural Vegetables, Weifang University of Science and Technology, Weifang 262700, China Full list of author information is available at the end of the article underlying technology of Bitcoin, blockchain has received extensive attention [15][16][17][18]. Blockchain is a distributed database technology that has the characteristics of decentralization [19][20][21], traceability, tamper-proof, collective maintenance, etc. [22]. The emergence of this technology solves a series of problems such as high cost, low efficiency, and low trust brought by centralized institutions [23]. However, the blockchain is a chain structure, which will cause the query efficiency to decrease as the number of blocks grows. Take the Bitcoin blockchain as an example. As of June 7, 2021, the block height has reached 674,000, which means that when querying historical data, hundreds of thousands of blocks may be traversed. Such a query method cannot meet the current query requirements.
Level-DB is the mainstream database in the blockchain system, which is based on the storage structure of the LSM tree. This leads to the lower reading performance of the blockchain [24]. Besides, Level-DB only supports simple Key-Value queries, not relational queries [25,26].
When querying transactions, users can only traverse in block order, which further reduces query efficiency [27]. The blockchain system only supports related queries with transaction hashes as keywords and does not query with account hashes as keywords. The query method is single. In response to this problem, some current solutions are to transfer the data on the chain to the off-chain storage [28,29] to improve query efficiency, but the offchain storage violates the decentralized characteristics of the blockchain. Third-party databases are faced with trust issues, and they may also be attacked with a single point of failure, data loss, data tampering, and other issues. There are huge security holes in off-chain storage [25]. Therefore, under the premise of ensuring security, improving the retrieval efficiency on the chain is a current research hotspot.
You et al. [30].designed a hybrid index mechanism that supports blockchain transaction traceability based on the Ethereum state tree. In this mechanism, a hash pointer is embedded in the account transaction, which points to the block where the previous transaction. Through the pointer, the Account Transaction Trace Chain (ATTC) can be quickly traced. The query method based on ATTC improves the query efficiency of account transactions, but for some active accounts with longer transaction chain length, a longer chain still needs to be traversed. Besides, users do not always want to find all the historical transactions of an account, and it is still difficult to find target transactions in massive account data. In this regard, we improve the query scheme based on ATTC and propose a SCATC index structure, which solves the shortcomings of the ATTC index structure in the query effectively. The main contributions of this paper are as follows: 1. We divide the transaction chain into subchains and connect different subchains with hash pointers to shorten the query path when querying early historical transactions. This solution is not a query mode that uses space for time. While reducing the time complexity, the space complexity does not increase significantly.
2. We design a constructing algorithm and query algorithm for the SCATC index structure. The simulation results show that the SCATC-based query is more efficient when querying the early transactions of accounts.
3. Multiple transactions of an account in the same block are merged into one, and at most one index is built within each block for the same account. This reduces the cost of index construction and storage overhead.
The paper is organized as follows. "Related works" section of this article introduces the related work of blockchain in the data query; "Preliminaries" section introduces some preliminary knowledge of blockchain; "SCATC index structure" section elaborates on the construction method and query algorithm of SCATC index structure; "Experiment and analysis" section is efficiency analysis and simulation experiment. The full text is summarized in "Conclusions" section.

Related works
In order to improve the efficiency of blockchain query, many researchers have made relevant studies. Xu et al. [31]. proposed an Educational Certificate Blockchain (ECBC) in response to the issue of education certificate management. ECBC has built a tree structure (MPT-Chain), which not merely supports effective query of transactions, but also supports historical transaction query of accounts. The index structure improves the efficiency of querying account transactions.
Morishima et al. [32]. propose to accelerate blockchain search through GPU using the higher computing power of GPU. Utilizing the feature that blockchain data does not need to be updated or deleted, an array-based Patricia tree structure is introduced, which is suitable for GPU processing. To study the identity verification and range query issues in the hybrid storage blockchain, Zhang et al. [33].used a unique gas cost model to design an authentication data structure GEM2-tree that can be effectively maintained by the blockchain. It not only saves gas consumption in smart contracts but also effectively supports identity verification queries. Aiming at the inefficient query of the Elastic-Chain [34] model on the blockchain, Jia et al. [35].propose an ElasticQM (elastic query model) query method based on the model. In the user layer, the model catches the user's first query result to improve the efficiency of the second query. In the data layer, the B-tree is combined with the Merkle tree to construct the blockchain data storage structure of the B-M tree. This storage structure improves the query efficiency of the internal data of the block. Jiao et al. [36]. propose a blockchain database system framework, which realizes the application of data management on the blockchain. Combining red-black trees with Merkle trees, they propose a tamper-resistance index based on hash pointers. Through the index can realize the fast positioning of the data in the block. Zheng et al. [37] divides the data attributes on the blockchain into discrete attributes and continuous attributes and proposed a MHerkle tree index structure for different attributes, which supports range query. H et al. [38].proposed the Ethereum data index structure EBTree based on the B+ tree index. The EBTree index supports real-time top-k query and range query. In addition, EBTree only stores the identifier of the blockchain data, which occupies a relatively small storage space and has better search and insertion performance. Ren et al. [39] introduce a DCOMB (Dual Combination Bloom filter) scheme, which converts the computing power used for Bitcoin mining into the computing power for data query. DCOMB has higher random read performance and lower error rate than COMB (Combination Bloom filter). The encrypted signature tree data structure of the Merkel Block Space Index (BSI) [27] modifies the Merkle KDtree to support fast Spatio-temporal query processing. In Ethereum, when a user initiates a transaction, the system checks the status of the account. Wan et al. [40]. built a Merkle Patricia tree account storage structure GMPT (Group Merkel Patricia Tree) to speed up the query of account status. However, GMPT does not support fast queries of historical transactions. For this, an index directory BKV (B-Key-Value) is constructed in combination with the B-tree index [41].

Preliminaries
Blockchain is a chain structure, as shown in Fig. 1. The internal structure of the block is divided into two parts: the block header and the block body. The block header records some information such as the timestamp, the hash value of the previous block, and the Merkle Root. A Merkle tree is recorded in the block body, the user's transaction is hashed to obtain the leaf node hash value.
Combine the hash values of the two leaf nodes and perform a hash operation to obtain a new hash value, which is used as the hash value of the parent node. Through continuous iteration and hash operation, the hash value of Merkle Root can be finally obtained, which can be used to verify the transactions in the block. Unlike traditionally linked lists, the pointers used in the blockchain are hash pointers, which store hash values instead of addresses in memory. The blocks are connected into a chain by hash pointers, and the pointers point from the new block to the old block in chronological order.
When querying transactions, users can only traverse from the new block to the old block through the hash pointer. The data in the block body is queried through the Merkle tree. First, check the Merkle Root, and then traverse the Merkle tree from top to bottom through the hash pointer in the Merkle Root. The hash pointer of the leaf node can locate the transaction storage location. If the target transaction is not found in the current block, the next block will be inquired until the target transaction is found. When querying early historical transactions, it is necessary to traverse a longer blockchain. If the transaction does not exist in the chain, the query will proceed to traverse the complete blockchain. This block-by-block traversal query method is extremely inefficient.

SCATC index structure
In this section, we have optimized the ATTC scheme and designed the SCATC index scheme on the basis of this scheme. The details of the SCATC index structure are introduced in detail. In addition, we also designed a construction algorithm and query algorithm for the SCATC index structure.

Index design
Given ATTC's shortcomings in retrieval, we improve it based on the index structure. In ATTC, the transactions of accounts in different blocks are connected by hash pointers. The hash pointers here are called FHP (First Hash Pointer). In the SCATC index structure, transaction chain is divided into subchains. Every k(k > 1) block is divided into a subchain, and each subchain has a subchain number. Each transaction of the account will identify the location of the transaction when it enters the chain. For example, Account n,k (Account is the account name, n and k are both positive integers)means that the account is in the kth block in the nth subchain of the transaction chain. Every time a user participates in a transaction of k blocks, another hash pointer is added to the account branch leaf node in the block Account n,k pointing to the block Account n−1,k . The hash pointer connecting the blocks at the last block of the two subchains is SHP (Second Hash Pointer). The index structure of SCATC is shown in Fig. 2. Figure 2 shows the chain structure of the blockchain. In the blockchain, each block connected by the FHP constitutes the transaction chain of the account, and each SHP will span a complete subchain. The FHP in SCATC is not embedded in the transaction but embedded in the leaf nodes of the Merkle tree. When querying early historical transaction, the system will directly filter the user's recent transaction data. For accounts with low activity, the latest transaction may exist in the earlier part of the chain. In the SCATC scheme, the state tree not only maintains the account balance status but also maintains the subchain number of the latest transaction. Through the status tree, users can quickly locate the block location where the latest transaction. The same account may generate multiple transactions in a short time, and transactions with higher transaction fees usually enter the chain first, so the same account may have multiple transactions in the same block. To simplify the construction of the index, we merge multiple transactions of the account in the same block for storage, and the account branch leaf nodes can directly access all the transaction of the target account in the block. The storage diagram is shown in Fig. 3.
Taking Account_A as an example, regardless of whether the transaction of Account_A is included in the latest block, the leaf node of the account branch of the latest block will maintain a hash pointer pointing to the latest transaction of the account in the block. While maintaining the global state, the state tree will also record the specific transaction records of each account whose state has changed in the block. All transaction records of the same account in the same block are combined and stored together, such as the transactions Tx_A1 and Tx_A2 of Account_A in block N, the transactions Tx_A3, Tx_A4 and Tx_A5 of block K. Specific account transactions can be accessed through the state tree, without the need to build a separate transaction tree, which reduces the cost of index construction.

Algorithm design
We first designed the index construction algorithm, and then designed the query algorithm according to the SCATC index structure.

Index construction algorithm
The algorithm traverses all the accounts whose status has changed and judge whether the accounts are new users one by one. If it is a new user, assign the value of one to the subchain number of the transaction chain and the block number in the subchain. If not, judge whether the block number of the subchain of the previous block in the account transaction chain is less than k − 1. If less than k − 1, the subchain number of the new transaction is the same as the previous block, and the block number in the subchain is increased by one. If the block number in the subchain where the previous block is located is equal to k − 1, the subchain number of the new transaction is the same as the previous block, and the block number in the subchain is assigned the value k. Then, add SHP to the account branch node corresponding to the new transaction, pointing to the kth block of the previous subchain. Through SHP, users can directly access the data of the previous subchain. If the block number of the previous block in the subchain is equal to k, the subchain number in the new transaction will increase by 1, and the block number is assigned a value of 1. The block with block number 1 is the first block of the new subchain. The algorithm first creates a list TargetAccountData to save the data of the target accounts that have been accessed. Lines 2-8 of the algorithm visit the latest block in the account transaction chain. If the sequence number of the block is less than k, traverse from the latest block to the first block in the subchain. Lines 9-13 of the algorithm, according to the hash pointer in the kth block, access the kth block of the previous subchain until the kth block of the target subchain. During this process, only one block is visited in each subchain. Lines 14-18 of the algorithm traverse all the blocks in the target subchain. Before the query reaches the target subchain, only one block is visited in all subchains except the latest subchain. The blockby-block traversal query method is transformed into a subchain-by-subchain query, which shortens the access path in the search process.

Efficiency analysis
The length of the subchain affects the scope and efficiency of the query. Assuming that the transaction chain length of the current target account is s, the number of blocks in each subchain is k(k > 1, k ∈ Z), and the number of subchains is n(n > 1, n ∈ Z). When the transaction chain length s is determined, the number of subchains n and k are inversely proportional.
When k increases, the number of block accesses in the subchain will increase, and the query range will increase. The number n of subchains will continue to decrease as k increases, thereby reducing the frequency of SHP construction, because each subchain only constructs an SHP once for the transaction chain of the account. If k decreases, the length of the subchain becomes shorter, and the query range of a single subchain is reduced. If the user wants to increase the query range, the range from the initial subchain to the end subchain of the query needs to be given. In addition, the number of subchains will continue to increase with the k decrease, and the frequency of SHP construction will increase. We define access to the block where the target transanction is located as valid queries, and queries other than valid queries as invalid queries. Invalid queries are represented by the symbol ψ. If ψ is larger, the query efficiency is lower, and it also means that more computing resources are wasted. In the SCATC-based query method, the number of blocks to be accessed by the initial subchain for querying is μ The number of blocks of irrelevant subchains accessed is ψ 1 ,then Because when the query proceeds to the target subchain, other subchains only access the last block, which reduces the number of irrelevant blocks that need to be visited when locating the target subchain. The transaction chain query method requires access to the complete transaction chain when querying the data of the initial subchain. The number of irrelevant blocks accessed is ψ 2 As s keeps increasing, n presents a monotonous increasing trend. Eqs. (3) and (4) can be regarded as a linear function. In Eq. (3), the coefficient of the independent variable n is 1, and in Eqs. (4), the independent variable coefficient is k − 1(k > 1). With the n continuous growth, ψ 2 >> ψ 1 . Invalid queries based on the transaction chain have a faster growth rate, while invalid queries based on the SCATC query have a slower growth. The larger the n, the more obvious the advantages of the SCATC-based query.

Simulation experiment
The simulation environment is a host computer, where the CPU is Intel(R) Core(TM) i7-5500U, 12GB memory, and the 64-bit operating system Windows10 Professional Edition. The SCATC index structure is written and implemented in python language. The blockchain requires each full node to maintain a complete ledger, so the data retrieval of the simulation is performed locally. The simulation compares the query efficiency of ATTC, MPT-Chain and SCATC query methods under different transaction chain lengths. Set the subchain length k to 10. The length of the transaction chain is set to 1000-6000 blocks, and the corresponding number of subchains is 100-600. The simulation experiments are divided into six groups according to different transaction chain lengths, and each group of simulations is repeated eight times. To better highlight the effect of simulation comparison, each query is tested with the initial Subchain. Three query methods start from the latest block to the earliest block, so the query time in the method in SCATC includes the time to locate the subchain. The simulation experimental data obtained are shown in Tables 1, 2 and 3.
The average value of simulation experimental data of ATTC, MPT-Chain and SCATC is plotted as a line chart shown in Fig. 4. As the length of the transaction chain continues to grow, the query time based on the ATTC and MPT-Chain query method is constantly increasing. However, the query method based on SCATC has not changed significantly in query efficiency. The subchain length k is 10, and AVG is the average query time For active users in the blockchain system, the length of the transaction chain has increased at a faster rate. From a theoretical analysis, no matter which of the above query methods, as the transaction chain grows, the length of the transaction chain that needs to be traversed will be longer, and the query efficiency will show a downward trend. However, after the SCATC index structure divides the transaction chain into subchains, it greatly reduces the number of block visits. The limited length of the transaction chain cannot cause a significant change in SCATC's query efficiency.
Take K to 50 and 100 to conduct a simulation experiment again, and compare the query efficiency of the three query methods when k takes different values. The simulation results obtained are shown in Figs. 5 and 6.
Compared with Figs. 4, 5 and 6 show no significant change in query efficiency. The query method based on ATTC and MPT-Chain has no obvious change in query efficiency. The main reason is that no matter how the value of k changes, the method needs to traverse a complete transaction chain. The SCATC-based query method has no obvious change in query efficiency. The reason is that after the transaction chain is divided into subchains, the number of blocks that need to be accessed is significantly reduced. The length of the transaction chain that needs to be accessed is not long enough to cause a significant drop in query efficiency. The subchain length k is 10, and AVG is the average query time The subchain length k is 10, and AVG is the average query time

Time complexity
For an algorithm, its efficiency is related to the language it implements and the hardware configuration of the computer. Putting aside these factors related to software and hardware, it can be considered that the efficiency of the algorithm is only related to the scale of the problem.
In the traditional traversal query method, it is assumed that the block height is h 1 , and the number of nodes of the Merkle tree in the block is p 1 . As the transactions of users in the system continue to increase, h 1 will continue to increase, while p 1 will be relatively unchanged, so h 1 is the scale of the problem. Since there may be multiple transactions in the same block in the same account, the traversal query method needs to traverse a complete tree. In addition, the system does not know whether the next block also contains the target account transaction, so the system will continue to traverse the next block until the entire blockchain. So in the process of traversal query, the number of query operations that the system will execute is λ(h 1 ) = p 1 × h 1 . Since the block size usually varies little, the number of nodes p 1 can be regarded as a constant. Then the time complexity of the algorithm can be expressed as In ATTC, assuming that the block height of ATTC is h 2 , as the transactions of the account continue to increase, the length of the transaction chain continues to grow, so h 2 is the scale of the problem. The Merkle Patricia tree in the block is generated by the account ID, so the length of the query path is fixed. Suppose the number of fields in the transaction is g, and the number of query operations that the system will execute is also λ(h 2 ) = p 2 × h 2 × g. The number of fields in each transaction fluctuates slightly, so g can be regarded as a constant. Then the time complexity of the algorithm can be expressed as In MPT-Chain, suppose the height of the MPT-Chain block is h 3 , and the length of the query path p 3 in the Merkle Patricia tree remains unchanged. Compared with In SCATC, the index is also constructed based on the Merkle Patricia tree. Assuming that the height of the transaction chain of the target transaction to be checked is h 4 , the length of the query path within the block is p 4 , and the length of the subchain is k. Then the number of query operations that the system will perform is λ(h 4 ) = 1 k × h 4 × p 4 .Then the time complexity of the algorithm can be expressed as From the above analysis, it can be seen that the time complexity of any query method is linear order O(h), and it cannot reach the ideal constant order O (1). But in the case of linear order time complexity, the most important factor affecting query efficiency is the block height h. In the above-mentioned several query methods, there is p 4 k < p 3 < p 2 g < p 1 , so λ(h 4 ) < λ(h 3 ) < λ(h 2 ) < λ(h 1 ). Therefore, in the above scheme, SCATC needs to perform the least number of query operations during the query process, and the query efficiency is higher.

Conclusions
We improve the query efficiency of the ATTC index structure and proposes a SCATC index structure that supports querying account subchain data. We divide the transaction chain into subchains, add hash pointers to the account branch nodes of the block at the last block of each subchain, and each subchain is connected by hash pointers. Through this pointer, the query mode of traversing the transaction chain is converted to the subchain query mode, which effectively reduces the access to irrelevant block data and reduces the computational overhead. All transactions of the same account in the same block are merged and stored together, which simplifies the construction cost of the index and reduces the storage overhead. Besides, we also design a query algorithm for the SCATC index. Simulation experiments and analysis show that the index structure based on SCATC can improve the query efficiency of account transactions effectively. However, the improvement in query efficiency of this solution is only for accounts with a longer account transaction chain, and there is no significant improvement for accounts with a shorter account transaction chain. At the same time, this solution is only for retrieval optimization in the plaintext state, and the data privacy of blockchain users cannot be guaranteed. Our next step will be dedicated to the optimization of ciphertext data retrieval in the blockchain.