Joint optimization of energy trading and consensus mechanism in blockchain-empowered smart grids: a reinforcement learning approach

Under the trend of green development, the traditional fossil fuel and centralized energy management models are no longer applicable, and distributed energy systems that can efficiently utilize clean energy have become the key to research in the energy field nowadays. However, there are still many problems in distributed energy trading systems, such as user privacy protection and mutual trust in trading, how to ensure the high quality and reliability of energy services, and how to motivate energy suppliers to participate in trading. To solve these problems, this paper proposes a blockchain-based smart grid system that enables efficient energy trading and consensus optimization, enabling electricity consumers to obtain high-quality, reliable energy services and electricity suppliers to receive rich rewards, and motivating all parties to actively participate in trading to maintain the balance of the system. We propose a reputation value assessment algorithm to evaluate the reputation of electricity suppliers to ensure that electricity consumers receive quality energy services. To minimize the cost, maximize the benefit for the electricity suppliers and optimize the system, we present an algorithm based on reinforcement learning DDPG to determine the power supplier, power generation capacity, and consensus mechanism between nodes to obtain power trading rights in each round. Simulation results show that the proposed energy trading scheme has good performance in terms of rewards.


Introduction
Traditional energy industries, such as power companies, were once powered by fully integrated power companies investing in and building transmission and distribution networks [1,2].However, due to increasing electrification and energy demand, as well as poor transmission and distribution networks, traditional fossil fuels and centralized utilities are increasingly unable to meet the demand of consumers and suppliers [3].In addition, under the general trend of clean energy replacing fossil energy and renewable energy replacing non-renewable energy [4,5], electricity energy trading should develop towards green development [6][7][8][9].In order to build a new electric energy transaction network, which can not only use clean energy efficiently, but also maintain the balance of the energy market by providing a better Quality of Experience (QoE) and maximizing the benefits of suppliers, the distributed energy resources (DERs) has become the focus of current energy research [10,11].
DER system [12] enables the connection between users to form a distributed network, which is a new energy production-supply-consumption system.It is the product of the development of mature new energy technology and energy storage technology [13][14][15][16], and the power balance is transferred from the demand side.At present, the energy of large scale DER systems is mainly electric energy [17].However, the implementation of a DER transaction system is still under study, and the issues of user privacy protection and transaction trust that may be encountered in decentralization need to be addressed.How to provide better QoE and reliable energy services in a transaction, while offering substantial rewards and incentives to suppliers, are among the issues that need to be addressed.
In recent years, blockchain has attracted attention from all walks of life due to its decentralized and tamper proof characteristics.Its essence is a distributed database, which is jointly maintained by all nodes.Blockchain is also used to satisfy the trusted construction of the metaverse [18].It is the theoretical basis for the implementation of DER transaction systems in [19][20][21].With the help of blockchain technology, energy producers and consumers can be directly connected, thus simplifying the mutual relations and interactions between the parties.In [22], the authors compare the electric transaction market based on block chain and existing the difference between electric power market.They point out that the blockchain has broken the boundaries and constraints of the design logic of the contemporary energy market, and are expected to change the traditional centralized energy system through the blockchain.In [23], the author designed a DRE energy transaction authentication mechanism based on blockchain technology applied to distributed energy trading, but this mechanism cannot work in practical production due to its poor throughput.In [24], the author proposes a trading model for energy transaction market to local electric vehicle transactions.Simulation results and evaluation conclude that the blockchain platform improves the autonomy of grid participants but does not involve the overall benefits and rewards of the system network.In [25], the author proposed a heterogeneous computing and resources allocation framework for wireless powered federated edge learning to investigate the performance of the system from users' perspective.They minimize energy consumption and achieve energy harvesting by optimizing the problem.Compared to other methods, this system can achieve efficient federated learning.The overall performance of the power energy trading network is considered and optimized to maximize transmission and minimize consumption, providing a reference for system design.In [26], the authors focus on the cost of individual participants, rather than merely optimizing the cost of the entire process as in existing works.They improved the convergence speed of federated learning by adjusting the local CPU cycle frequency and other related parameters.It can be seen from the experimental results that they have well balanced the cost and fairness.In [27], the author analyzed the transparency of blockchain.Smart contracts make the operational rules of the entire system open and transparent, achieve information symmetry and market effectiveness, and ensure the security and reliability of the trading system.In this paper, we propose a trusted transaction method for blockchain-based manufacturing services.
In [28], the authors propose a blockchain-based approach to manufacturing service composition.The key contributions are the study of dynamic QoS evaluation methods and consensus algorithms, and the design of resource rent-seeking and matching mechanisms based on smart contracts.That approach can adaptively complete the composition of manufacturing services while balancing the privacy, security and openness of transaction information, which greatly enhances the trustworthiness of the cloud manufacturing service platform and the processing speed of the system.In [29], this paper has proposed a novel P2P energy trading system for two separate optimization problems, one is an individual optimal charging algorithm designed for those consumers to obtain the best daily charging schedule, the other is a P2P energy trading mechanism to reduce the total daily energy cost.But they ignore the coupling between the two optimization problems.In [30], this paper build a trust mechanism based on blockchain technology, view the creation of digital assets as a process of evaluating behavior, design smart contracts to handle the evaluation behavior, and build a blockchain system based on the reputation values of alliance members.The system uses sidechain technology to transfer the created digital assets, which can increase the authenticity guarantee of the blockchain in other trading scenarios.Experimental results show that the system is characterized by low cost and memory space that is not easily expanded.However, this article does not evaluate the performance of the system.
Although some works have studied the system of electricity transaction, there exist new challenges to address.On the one hand, How to achieve better QoE and reliable services to meet the needs of consumers.On the other hand, how to guarantee the power supplier's reward and revenue maximization.More importantly, how to meet the above two requirements as far as possible, under the premise of the best service quality as far as possible to reduce the cost and expand the revenue, we will use the trusted reputation management system and the problem of revenue maximization two aspects to study.
We believe that reinforcement learning would be a good choice when the system needs to make decisions to achieve a balance between risk and reward in complex situations [31][32][33].The selective federated reinforcement learning (SFRL) proposed in [34] can improve the accuracy of the automatic driving model very well.In this paper, we propose a blockchain-based smart grid system that can ensure efficient energy transactions by considering the situation of each node comprehensively, and can dynamically select the consensus mechanism of the blockchain to achieve consensus optimization.The efficient implementation of the system enables electricity consumers to obtain high quality and reliable energy services, while electricity suppliers are richly rewarded, thus motivating them to participate in the transactions again.For the real-time dynamics of the system, we design a MADDPG-based reinforcement learning algorithm to decide the electricity supplier that gets the power trading rights in each round, the generating power, and the consensus mechanism among the nodes.Multi-agent deep deterministic policy gradient (MADDPG) is a powerful reinforcement learning algorithm.In recent years, the leading contenders are deep Q-learning [35], "vanilla" policy gradient methods [36], and trust region natural policy gradient methods [37,38].However, Q-learning is not ideal in dealing with high-dimensional problems, because it is easy to be constrained by dimensional disasters and is poorly understood, vanilla policy gradient methods have poor data effiency and robustness.
The contributions of this paper are summarized as follows: (i) in order to solve the problems of mutual trust under the electricity transaction and realize the transparency of the system, we propose a reputation evaluation system based on blockchain technology, so that the power consumers can obtain reliable and better QoE services; (ii) in order to make the lowest cost of power suppliers and the biggest gains, we try to optimize transmitted power and charging power.We also choose the consensus algorithm and the state of the charge and discharge method to enable the power supplier to earn a larger profit and actively participate in the power supply and trading incentives.(iii) using reinforcement learning MADDPG effectiveness to solve the convergence of the experimental results.
The remainder of this paper is organized as follows.The related works are described in Section "Introduction".Section "System description and problem formulation" depicts the system model and problem formulation.Section "Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithms" presents the solution to the optimization problem.Section "Experiment result" presents the simulation results.The conclusion and future research issues are given in Section "Conclusion".

System scenario
There is an electric power system with a set of electric consumers (ECs) M = {1, 2, 3, • • • , M} and some elec- tric suppliers (ESs) N = {1, 2, 3, • • • , N } .ESs can gener- ate electricity through new energy sources such as wind energy, solar energy, and tidal energy, and can also generate electricity through conventional energy sources such as hydropower, oil, and nuclear energy.ECs can be different power-consuming users such as factories, residential life, and charging piles.At the beginning, ECs send the power order requests to ESs, ESs monitor the transaction requests, and N ESs compete for the transaction right of the orders at the same time.The system scenario is shown in Fig. 1.Assuming that the electronic request is an electronic order transaction, the electronic request is packaged into blocks and consensus is carried out in the blockchain.If the ES n is the first to obtain the power to produce blocks, it will obtain the right to trade electricity energy.The optimized strategy of transactions between EC and ES can be executed through smart contracts, thereby ensuring its correct, reliable and transparent execution.

Trust and reputation modeling
The paper [39] proposes a reputation management scheme based on multi-arm slot machine (MAB), which can effectively select vehicles with good reputation.In our system, there are two situations when two ESs interact.Communication trust generally refer to the transmission of data, including both cooperative and non-cooperative situations.The case of data trust generally refers to the data aggregation, including correct transmission and incorrect transmission.In Bayesian analysis, the beta distribution is usually used to represent the conjugate prior distribution of the binomial distribution parameters, where the beta distribution is simple and flexible, and can be used to simulate the trust distribution.
Beta function can be described by gamma function as follows, where a represents the number of normal cooperation and b represents the number of data transmission error.For the prediction of ES's behavior, the probability distribution P of ES's reputation can be obtained by using beta distribution.When calculating trust and reputation values, we consider both communication trust and data trust.According to the beta distribution, the reputation of ES m to ES n in time slot t is expressed as [40] (1) Therefore, we can obtain the final reputation value of ES n in time slot t, In our reputation scheme, we classify all ESs in the system into three categories, which are trusted nodes, uncertain nodes and untrustworthy nodes.For both trusted and uncertain nodes we will give the opportunity to participate in the transaction, while untrustworthy nodes need to stay in the network for observation.We will give all nodes an initial reputation value T in = 0.5 , which means that all nodes are initially uncertain nodes.The study of [41] points out that malicious nodes are a minority in P2P systems and suspicion of additional nodes is one of the important reasons for the degradation of the overall system performance.The formula given by [42] (2) n,m = Beta(a + 1, b + 1). ( is used to distinguish the type of Es n.When T n (t) M < 0.5 , the reputation value is untrustworthy.If T n (t) M = 0.5 , the reputation value is uncertain; if T n (t) M > 0.5 , the reputa- tion value is trustworthy.Trustworthy nodes can participate in the next round of transactions, while untrusted nodes need to stay in the network for observation.The more times a node is untrusted, the longer it will wait.
Since the deployment environment of each ES cannot be determined, some problems will inevitably arise when ESs are distributed in a harsh environment.Therefore, it makes more sense to provide a second chance for untrusted nodes.Untrusted nodes should stay in the network for observation rather than be immediately excluded from the transaction.When a node is considered untrusted, we mark it as untrusted and start the clock T(t).During this period, untrusted nodes are not allowed to participate in power transactions.After this period, untrusted nodes will be restored to their initial credibility.T(t) is not fixed.The more times nodes are untrusted, the longer nodes stay in the network for observation until nodes are blacklisted.
Fig. 1 The system scenario

Blockchain system in energy trading
By deploying blockchain in the energy trading market, consensus algorithms and smart contracts can be used to make the entire trading process more reliable, credible, and transparent.

Consensus mechanism
ES uses different consensus algorithms to produce different block intervals and transaction throughputs.Denote β(t, x) = {0, 1} as the parameter of the con- sensus mechanism to show whether the consensus algorithm x is selected, where x ∈ {0, 1, 2} represents the blockchain choosing the different consensus algorithms, PBFT, DPOS, and POS.β(t, x) = 1 means con- sensus algorithm x is chosen.Otherwise, consensus algorithm x is not chosen in time slot t.These three commonly used consensus algorithms are described as follows, (1) Practical Byzantine Fault Tolerance (PBFT): FBFT can tolerate not only node failures but also the existence of certain malicious nodes or Byzantine nodes.PBFT has requirements on the number of nodes in the system.Similar to the Byzantine Generals problem, PBFT requires that the number of nodes in the system N be no less than 3f + 1 , where f is the number of "malicious nodes".The "malicious node" here can be a node that is deliberately malicious, a node that is attacked and controlled, or even a node that has lost its response.
In short, as long as it is abnormal, it can be considered malicious.PBFT classifies each node in the system into two categories: primary node and replica nodes.They all use a state machine mechanism to record their actions.If the operation of each node is consistent, then their state machine will always remain consistent.(2) Delegated Proof of Stake (DPoS): In the DPoS consensus algorithm, the normal operation of the blockchain depends on the delegates, and these delegates are completely equivalent.It is to vote through the proportion of stake, and more people have joined the power of the community.People will vote to select relatively reliable nodes for the maximization of their own interests, which is more secure and decentralized.DPOS uses a professionally run network server to ensure the security and performance of the blockchain network.It does not require computing power to solve mathematical problems, but the holder of the stake chooses who will say the producer.
(3) Proof of Stake (PoS): PoS is a consensus algorithm that distributes interest based on the amount and time of stake you hold.The core logic of the POS mechanism is that whoever holds the stake has control over the network.In the POS mechanism, there is still computing power mining, which requires computing power to solve a mathematical problem.However, the difficulty of mathematical problems is related to the "coin age" of the coin holder.The longer the coin holder has the coin, the simpler the problem and the greater the probability of mining the coin.The more stake it has, the greater the chance of meeting the Hash goal and obtaining the accounting right.

Generate energy trading blocks
ES ′ .We assume that ES has a first in first out (FIFO) data buffer to store the arrived but not yet verified blocks.Hence, the dynamics of the processing queue at the beginning of the t + 1 time slot can be given by as follows, Therefore, the total time cost in the consensus process can be given by Then, the energy transaction throughput [43] can be expressed as where χ(t) is the average size of transactions.

Energy trading model
After the new block is added to the blockchain, the physical transaction of electricity between ES and EC (5) can take place in the energy market.Let the transaction power be P = {P n(t) ge , P m(t) ch } , where P n(t) ge is the generating power of ES n in t time slot, and P m(t) ch is charging power of EC m.Let x n (t) ∈ {0, 1} be the power supply status of the ES n.Specifically, x n (t) = 1 is the state of selling electricity, and x n (t) = 0 means stop the power supply.
In order to optimize the revenue of suppliers ESs and incentivize each ES to participate in electricity supply in blockchain-enabled smart grids, the benefit of ES n can be given by where C 1 (P n(t) ge ) is the operating cost of ES n to gener- ate power P n (t)  ge in time slot t, and C 2 (P n(t) ge ) is the basic maintenance cost of ES n to generate power P n (t)  ge in time slot t. ρ m (t) is the unit payment by the EC m for the obtained charging power P m(t) ch from ES n with different reputations.

Problem formulation
In order to allow ECs to obtain charging services from highly reliable and high-quality ESs, while ensuring the benefits of ESs, so as to realize a virtuous circle of energy trading market.We can get the following utility function in time slot t, where ω 1 (0 < ω 1 < 1) is a weight factor to combine the benefit R n (t) and throughput of the blockchain T h (t) , and ω 2 is a mapping factor that ensures that the two func- tions is at the same level.
Let A = {P n(t) ge , P m(t) ch , β(t, x), x n (t)} , eventually, we can formulate the following energy trading and consensus optimization problem to maximize the benefit of electric consumers, where P min ge , P max ge , and P max ch , are the minimum generating power, the maximum generating power, and maximum charging power, respectively.z max is the maximum delay requirement for block generation, and we can set (9) the block interval as T b (t) = f (β(t, x), z max ) in slot t, which is caused by the selection of different consensus algorithms x.In ( 11), (C 1 ) and (C 5 ) are the generating power and charging power constraints, respectively.(C 2 ) and (C 3 ) restrict the consensus algorithm selections.(C 4 ) is the constraint of the total time delay of generating block.

Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithms
In the system, there are MECs [44,45] and NESs in each round, and the ECs initiate the power trade request and the ESs compete for the power trade right.In order to ensure real-time and reliable transactions, the system selects the status of ESs, generating power(P ge ) of ESs, and consensus mechanism of the blockchain of the competing ESs in each round.This selection problem can be modeled as MDP.Considering that the system needs to carry a huge volume of transactions, we use a DRL algorithm known as MADDPG as the solution.MADDPG is based on deep deterministic policy gradient(DDPG).MADDPG improves the actor-critic framework, which adopts the rule of centralized training and decentralized execution.It provides a general and novel idea for solving multi-agent problems.Firstly, similar to RL, agents interact with the environment according to the principles of MDP and receive rewards.The purpose is to continuously accumulate experience to make better decisions adapting to the environment.Specifically, during slot t, the agent will update its state-value function according to (S n (t), A n (t), r n (t), S n (t + 1)) as follows: Then we expand ασ (t) as follows: Where σ (t) is the TD error, which should be 0 at the best Q value, α is the learning rate, and γ is the discount factor that narrows with t increases.After the expansion of the TD error, R(t + 1) + γ Q(S n (t + 1), A n (t + 1)) is the TD Target, which minus the current Q(S n (t), A n (t)) to get TD error, which can be understood as the updated value of Q.The Target value in the formula can be expanded as follows: Which means the sum of expected future rewards.Target is not known in slot t.But we calculate it by: This formula eliminate the State and Action after t + 1 .And Q π is called the action-value function.Furthermore, the maximum value of Q π can be obtained by: Consider a Markov decision process, defined by the tuple (S, A, P ′ , G) representing the dynamics of the system.
State S : Space of states of the system, which are the input of the actor network.The state of system in time slot t is denoted by S(t),S(t) ∈ S.Define S(t) as follows: where T(t) and � s (t) are the reputations and the stakes of blockchain nodes in time slot t, respectively.Denote the sets of reputation and the sets of the stake by T (t) = {T 1 (t), T 2 (t), ..., T N (t)} and � s (t) = {� 1 (t), � 2 (t), ..., � N (t)} , respectively.F(t) is the computing resources of edge servers,which generate blocks and verify transactions.There are as many edge servers in the system as there are ES.Denote the sets of computing resource of edge servers by F (t) = {F 1 (t), F 2 (t), ..., F N (t)} .The computing resource of edge server n in time slot t + 1 can be given by (6).
P ′ : A state transition probability matrix.P ′ (s(t + 1)|s(t), a(t)) defines the probability that the state s(t) transforms to s(t + 1) under action a(t).
G : The total discounted return from t to t + j can be expressed as: where γ ∈ (0, 1] is a discount factor that encodes the importance of future rewards, and r(t) denotes the rewards available in the current state s(t):

MADDPG Method :
The algorithm is as shown in Algorithm 1.In line 4, we use ǫ − greedy to select a random action.in line 5-line 6, each Con n executes the action and receive a reward and ( 16) (18) a(t) = [x(t), P ge (t), β(t)], (19) G(s(t)) = r(t) + γ r(t + 1) + ... + γ j r(t + j) = j i=0 γ i r(t + i), (20)  J (θ n ) .For the deterministic policy gradi- ent algorithm on continuous action space, the actor will output deterministic greedy actions according to the state, which may lead to some actions never being chosen, so the random behavior policy must be used to ensure adequate exploration when selecting actions.So we should use the random policy to get actions as much as possible.Here we use ǫ − greedy to explore actions: With the training progresses, ǫ is gradually reduced to 0. So the final result is still a deterministic policy.During the train, actor updates the policy by calculating the gradient of J (π n ) .This deterministic policy gradient formula is as follows: Where a n = π n (o n ) .o = {o 1 , o 2 , ..., o n } is the local obser- vation for the agent.And Q π n (o, a 1 , a 2 , ..., a n ) is the cen- tralized action-value function of the agent.Each agent learns its own Q π n independently and obtains rewards, so agents can complete the cooperative task in this model.D is an experience replay buffer which is composed of (o, o ′ , a, r) .In addition, the centralized critic updates the action-value function Q π n according to the following minimization loss function: Where, n is the parameter that the target policy has lagged update property.At the end of each train, the agent will get the learned policy parameters and updates its own actor and critic network parameters by: Where ζ is the update step.
The process of MADDPG-based ESs state, P ge and consensus mechanism selection algorithm is shown in Algorithm 1.

Experiment result
In this section, we exhibit the performance of the energy trading and consensus optimization.

Simulation parameters
We simulate the performance of the processed scheme based on Prtorch 1.0.2 with 3.9 as the software environment.The settings of the simulation parameters are shown below.we consider a energy trading system consisting of 20 ECs and 5 ESs.The function of the block interval is modeled as log(1 + β(t)z max ) .The minimum generating power, maximum generating power, and maximum charing power are respectively P min ge = 0.2 W, P max ge = 2 W, and P max ch = 1 W. The maximum time of the consensus algorithm is z max = 2 s.Meanwhile, we show the three benchmark schemes to verify the proposed scheme.The first is the fixed consensus algorithm scheme, where one consensus algorithm is selected, referred as FCAS.The second is that the scheme dose not allocation the generating power of ES, called FPAS.The final is the single objective optimization, where the optimization problem only considers the benefits of ES, referred as SOAC.

Numerical results
In Fig. 2, we show the convergence of the resource allocation scheme based on the MADDPG algorithm.Observing the figure, we can find that the algorithm has a fast the convergence rate.Figure 3 shows the impact of the total reward of the system on the maximum delay requirement for block generation τ max .It can be seen that the reward increase with the increase in the maximum delay requirement.Meanwhile, we find that the proposed scheme has the best performance, while the SOCA scheme has the worst performance.This is because the single-objective optimization does not consider dynamic edge computing node resource changes and competition.
In Fig. 4, we show the effect between reward and P max ch .Looking at the trend of the graph, we can find that the increase of P max ch has a growing trend in the influence of reward.Obviously, the proposed energy trading scheme performs the best.At the same time, we find that the growth of P max ch has little effect on FPAS, because FPAS itself does not allocate power, so P max ch does not affect the performance of the FPAS scheme .The impact of P max ch on the other two schemes, namely FCAS and SOCA, is relatively small.Figure 5 shows the effect of the number of ESs on reward.From the figure, we can see that as the number of ES increases, the reward shows an increasing trend.
In Fig. 6, we show the effect between average selection ratio and reputation value T(t).Obviously, we can find that as the ES reputation value increases, the ES is more likely to obtain power trading rights, and ESs with low reputation value will not be completely deprived of the power to obtain power trading rights.Figure 7 shows the effect between selection ratio of the ES with the different trading rounds m and the number of ESs N. We can find that, for a given number of transactions, as the number of ESs increases from 5 to 100, the proportion of ESs with high reputation value to obtain power trading rights gradually decreases.This is because a high reputation is not the only requirement for gaining power trading rights.The system will give ES with low reputation value the opportunity to participate in the transaction.As the number of transaction rounds m increases, the proportion of high-reputation ESs obtaining power trading rights will also increase, because at this time the reputation value of each ES is in a relatively stable state, and the system has accumulated enough experience to make the best choice.

Conclusion
Although distributed energy has become a hot research topic, there are still many problems in distributed energy trading system, such as user privacy protection and mutual trust in trading, how to ensure the high quality and reliability of energy services, how to encourage energy suppliers to participate in transactions.To solve these problems, in this paper, we propose a blockchain smart grid system to optimize efficient energy transactions and blockchain consensus using a reinforcement learning MADDPG algorithm for power supplier selection.Through the construction of a reputation evaluation system, electricity consumers can obtain reliable and high-quality power services.In addition, the generation and charging power are optimized in this paper.
By choosing the consensus algorithm and charging and discharging states, the power supplier's revenue is maximized, thus incentivizing the power supplier to participate in the supply trading network and ensuring the long-term stability of the power resource market.Finally, we analyze the simulation results in detail and compare them with existing algorithms.The feasibility of the proposed algorithm can be demonstrated by the validity and convergence of the results.

Fig. 7
Fig. 7 Selection ratio of the ES with the different trading rounds m v.s.Number of ESs N C 4 , C 5 are satisfied 0 otherwise then load them to the replay buffer.In line7-line 11, each Con n updates its actor and critic and target network.Specifically, We let Con n , n ∈ N act as agents.And we use θ = [θ 1 , θ 2 , ..., θ n ] represent the policy parameters of agents.Then we use π = [π 1 , π 2 , ..., π n ] represent the policies of agents, each agent updates its policy parame-