Method for detection of unsafe actions in power field based on edge computing architecture

Due to the high risk factors in the electric power industry, the safety of power system can be improved by using the surveillance system to predict and warn the operators’ nonstandard and unsafe actions in real time. In this paper, aiming at the real-time and accuracy requirements in video intelligent surveillance, a method based on edge computing architecture is proposed to judge unsafe actions of electric power operations in time. In this method, the service of unsafe actions judgment is deployed to the edge cloud, which improves the real-time performance. In order to identify the action being executed, the end-to-end action recognition model proposed in this paper uses the Temporal Convolutional Neural Network (TCN) to extract local temporal features and a Gate Recurrent Unit (GRU) layer to extract global temporal features, which increases the accuracy of action fragment recognition. The result of action recognition is combined with the result of equipment target recognition based on the yolov3 model, and the classification rule is used to determine whether the current action is safe. Experiments show that the proposed method has better real-time performance, and the proposed action cognition is verified on the MSRAction Dataset, which improves the recognition accuracy of action segments. At the same time, the judgment results of unsafe actions also prove the effectiveness of the proposed method.


Introduction
Generally, power working is carried out in the environment of high voltage and large current, any non-standard actions may lead to major safety accidents.In order to ensure the safety of operation, the power industry has strict operation procedures and training to regulate the behavior of operators.However, in the actual operation process, there will still be problems such as operators not strictly following the safety operation procedures due to the negligence of the operators or the lack of awareness of the risk, which will raises great safety concerns.In order to strengthen the safety management and control of power operation sites, numerous important departments or workplaces are equipped with monitoring systems [1].Using this video information to timely discover, remind or stop the operator's possible violations will reduce the occurrence of relevant safety accidents.
However, the analysis of surveillance video still relies on manual work at present.This way of working has the following disadvantages: first, the detection of abnormal behaviour depends on the skill level of the viewer, and there are few experienced experts; second, long-term watching is easy to lead to missed inspection defects and missed correction of violations due to visual fatigue, leading to safety risks not found in time [2].Therefore, making full use of the current artificial intelligence technology to realize real-time intelligent analysis and judgment of video information is an effective way to improve the safety of power system equipment operation and reduce the incidence of safety accidents.
At present, the intelligent video monitoring system based on artificial intelligence technology mainly involves the research of foreign matter identification of transmission line, line fault, abnormal state analysis of substation, illegal invasion of personnel and so on [3][4][5][6].However, there is little research on real-time prediction and alarm of nonstandard and unsafe actions based on the operation specification of power systems in specific environments, so as to reduce the probability of accidents.
Therefore, in view of the unsafe actions that violate the operation specifications, such as unlocking without five-prevention keys, electricity testing without insulated gloves, etc., a security action recognition architecture based on edge cloud architecture is proposed in this paper.Based on this architecture, an end-to-end unsafe action determination method is deployed to the edge , and the learning of the model is deployed to the cloud, providing more real-time and more accurate services.The architecture is shown in Fig. 1.
This study offers three contributions as follows: 1.In this paper, an unsafe action prediction architecture based on edge cloud technology is designed.In this architecture, the decision of unsafe actions is deployed to the edge, which improves the real-time performance of the decision.At the same time, the method of model relearning and synchronous updating is proposed to realize the continuous evolution of the model, which enhances the accuracy and reliability of the identification and improves the ability to identify new unsafe actions.

Edge computing
The architecture design of edge computing dates back to 2009 when cloudlet [7] was proposed in Carnegie Mellon University.In 2011, Cisco first proposed the concept of fog computing [8].Zhao [11] proposed an intelligent edge computing technology based on federated learning, in which the edge development board realizes the identification of video monitoring, the enterprise server implements the model learning, and the cloud is responsible for combining the models of each enterprise and updating the model.Jia [12] discussed the application prospect of edge computing Fig. 1 Prediction framework of unsafe actions in power operation based on edge cloud model based on distributed data collection and processing in intelligent video detection.In reference [13], the face recognition application is moved from the cloud to the edge, which greatly reduces the response time.In the paper [14], a method of constructing a face detection video monitoring system based on Mobile Edge Computing (MEC) is proposed.This method uses different detection algorithms in the edge and cloud, and determines whether it needs to be sent to the cloud according to the confidence of edge detection.Wang et al. [5] proposed a transmission line online monitoring system architecture based on ubiquitous Internet of Things by studying image recognition and mobile edge computing technology based on deep learning.Lu et al. [6] proposed a method based on edge calculation and deep learning for transmission line foreign matter detection.However, the existing edge detection model does not have the ability of relearning, and the model can not be improved, so it does not have the ability to identify new class.

Action recognition
The traditional method of action recognition is to extract the temporal and spatial features of video images manually, and then classify the actions based on these features.Klaser et al. [15] extended the Histogram of Oriented Gradient (HOG) feature of static image to the space-time dimension, and proposed 3D HOG feature to represent action.Wang et al. [16] proposed the dense trajectories (DT) algorithm, which extracts and uses Histograms of Oriented Optical Flow (HOF), HOG and Motion Boundary Histograms (MBH) to represent actions.Ha [17] proposed a violence detection method in surveillance video system.The proposed method estimates the motion vector using the Combined Local-Global approach with Total Variation in the object region.The above method of action recognition uses the method of manually extracting the spatio-temporal characteristics of the image, which is inefficient, heavy workload and limited recognition ability.
Action recognition based on deep learning is an end-toend method.This method uses the depth model to extract the spatial and temporal features contained in the video, and then classifies them.Simonyan K [18] first proposed action recognition method based on dual stream convolution network, which uses spatial stream network and time stream network to process spatial and temporal information in video separately.Feichtenhofer et al. [19] explored a variety of spatial and temporal information fusion methods on the basis of dual stream convolution network.The deep learning framework based on 3D convolution neural network can directly extract the temporal and spatial features of video [20][21][22][23][24].The idea is to treat the video as a spatiotemporal cube, that is, the 2D convolution operation in the spatial domain is naturally extended to the 3D convolution operation in the spatiotemporal domain by adding time dimension.
Recurrent Neural Networks (RNN), especially Long Short-Term Memory (LSTM) networks and GRU networks, have strong ability to extract temporal features [25][26][27].In reference [28][29][30], a 2D Convolutional Neural Network (CNN) is used to extract spatial features of video images, and LSTM or GRU, a variant of RNN, is used to extract temporal features of actions, and good recognition rate is obtained.In order to simulate the spatiotemporal evolution of different actions and extract different spatiotemporal features, song et al. [31] constructed an action recognition model based on CNN using LSTM.The model used different levels of attention to learn the recognizable joints of bones in each input frame.Tae Soo Kim et al. [32] propose to use a new class of models known as TCN for 3D human action recognition.TCN provides a way to explicitly learn readily interpretable spatio-temporal representations for 3D human action recognition.Methods used in the above literatures can only classify the action after it is completed, and can't predict the future action, so they can't be used for action early warning.

Action prediction
In view of the early warning function, the system should not only identify the action, but also predict the future action.That is to say, the system should have the ability to classify actions according to the segments of some actions.Fragkiadaki et al. [33] proposed an Encoder-Recurrent-Decoder model based on recurrent neural network for recognition and prediction of human body pose in videos and motion capture.Jain et al. [34] introduced RNN Structure, which combined high-order spatiotemporal images and RNN to predict actions in a short time.Martinez et al. [35] modified the standard RNN model for human motion to form a simple and extensible RNN architecture, and good performance of human motion prediction was obtained.Keet et al. [36] proposed a new Latent Global Network based on adversarial learning for action prediction.In this model, skeleton is used for action prediction, which aims to identify an action from partial skeleton sequence with incomplete action information.Kong et al. [37] adopted an adversarial learning scheme to learn action features, extract model parameters and classifier parameters, and generate optimized features for action prediction.The method has achieved good prediction accuracy.RNN has a strong ability to extract temporal features, but the accuracy of RNN (LSTM, GRU, etc.) for long sequences is low.In order to improve the accuracy of motion prediction and obtain better performance, TCN is used to extract local temporal features, and recurrent neural network GRU is used to extract global temporal features.

Action detection architecture based on edge cloud
In order to judge unsafe actions based on video information, deep network model is needed to complete action recognition and target detection.This scheme involves a large amount of data and requires high computational performance and real-time information processing ability.Video capture terminal is a common embedded system with limited computing power, so it will take a lot of time to complete model calculation and information processing.If the action is determined by the cloud, the transmission of a large number of video information on the Internet will take a certain amount of time, and there may be network congestion, so it is difficult to guarantee the real-time performance of information processing.However, the system for unsafe action recognition and early warning requires high real-time performance.Therefore, in this paper, an unsafe action decision framework based on edge cloud is designed.The overall network architecture is shown in Fig. 2.
The video capture terminal and the edge cloud are interconnected by the Local Area Network (LAN), and the edge cloud is connected with the central cloud through the Internet.The decision service of unsafe actions is deployed to the edge to improve the real-time performance of identification.The model training module is deployed in the cloud.At the same time, the cloud is also responsible for sample collection, module relearning and updating services to ensure the continuous learning and evolution of the model.The deployment of each functional module of the system is shown in Fig. 3.
The video image acquisition and simple preprocessing part of video image are deployed to the perception terminal.When the continuous image frames with human motion are detected by the perception terminal, these images will be standardized and scaled, and the data will be sent to the local edge computing entity using the Message Queuing Telemetry Transport (MQTT) protocol.
The decision service of unsafe action is deployed to local edge computing.When the unsafe action judgment entity at the edge receives the video image data sent by the terminal, it is handed over to the action recognition model for action recognition, and the data is sent to Yolo model for equipment target detection.Based on the identification results and detection results, the unsafe

Judgment of unsafe actions in power workplace
The power industry has strict regulations for the operators to work in the power field.For example, a person entering the power working place must wear safety helmets, the operators must wear insulating gloves for electricity testing, and the five-prevention keys must be used for unlocking.Otherwise, it will be unsafe actions.Therefore, the judgment of unsafe actions not only needs to identify the operator's action, but also to detect the necessary equipment for their operation.

Power operation action recognition model based on TCN+GRU temporal feature extraction
A video of an operation action is a group of image sequences with temporal relationship, which contains not only the spatial characteristics of the action such as posture and the position relationship with other objects, but also the temporal information of the action, such as the change process of posture.According to the related work, RNN, especially LSTM network and GRU network, has a strong ability to extract temporal features.However, RNN network can not carry out parallel operation, and the number of operation steps should not be too large, otherwise it will take too long.TCN network is used to extract the temporal features of actions, which solves the problem of parallel computing in temporal feature extraction [38,39].However, the extracted features of TCN network also contain temporal feature information.In order to make full use of the advantages of RNN and TCN, an end to end unsafe action recognition model based on TCN+ GRU for temporal feature extraction is presented in this paper.In this model, multi-layer TCN is used to extract local temporal features and down sample, which shortens the length of the time series.Then, the global temporal feature is extracted using GRU, which avoids the problem of long steps in GRU.Restnet50 network structure is used to extract spatial features.The overall structure of the action recognition model is shown in Fig. 5.

Spatial feature extraction based on Restnet50
The action video of power operation is essentially a sequence of images, which contains the spatial information of the action, such as the shape of the limbs and the position relationship with other objects.In general, deep convolution network is used to extract image features.However, with the increase of depth, the ability of network representation is enhanced, and the difficulty of training is also increasing.In the process of back-propagation, for the deep neural network, the continuous cumulative multiplication can easily lead to the gradient too small or too large, resulting in the degradation of the learning ability of the deep network.
To solve this problem, the residual structure network Restnet50 presented in reference [40] is used to extract the spatial features of video actions.The Restnet50 network consists of several residual blocks.The output of a residual block is expressed as: Where x is the input of the residual block, F(x) is the output of the network in the block, and H(x) is the output of the residual block.Therefore, during forward propagation, the features of the upper layer can be reused in the next layer.At Fig. 4 The working principle framework of the whole system the same time, when back propagating, the gradient is expressed as: This ensures that when H(x) F(x) → 0, the gradient of the lower layer H(x) can be directly transferred to the upper layer for updating, thus solving the problem of learning ability degradation.
Restnet50 uses residual blocks repeatedly to extract features, including 16 residual blocks and 50 layers in total.Since Restnet50 has many layers and many parameters, it is difficult to learn.Therefore, drawing lessons from the idea of transfer learning, based on the Restnet50 pretraining model provided by Keas, the trained Restnet50 model is obtained by relearning from our sample set and synchronized to the edge computing to extract the spatial features of video.

Local temporal feature extraction based on TCN
TCN is a one-dimensional convolution in the time dimension.The characteristics of its convolution calculation determine that TCN has a very good local feature extraction ability.At the same time, because the discrimination of unsafe actions is mainly for the purpose of action prediction, only the frame information in front of the moment is used, and the information behind the moment can not be used.Considering the above problems, in this paper, TCN based on causal convolution is used to extract local temporal features in video.
After extracting spatial features by Restnet50 model introduced in the former section, the video spatial feature extraction sequence is set as: where x t is the feature of frame image img t extracted by Restnet50 model, X is the input sequence of the expanded TCN.Let the convolution kernel

and
In order to better extract local features and increase receptive field, this model uses two layers of k = 4 TCN for convolution, and the output of each layer is pooled with step size of 2.

Global temporal feature extraction based on GRU
By TCN convolution, some temporal features have been extracted, but the output of TCN convolution is still a sequence, including temporal information.Since the cyclic neural network has the ability to describe the sequence information, it can often get more accurate results.Therefore, in this paper, GRU is used to further extract global features from the results of TCN convolution.
GRU has set up update gate z t and reset gate r t , the calculations for them shown as formula ( 4) and ( 5).The reset gate r t determines how much information of the previous state h t−1 is used to calculate the candidate state h t as shown in formula (6).The update gate z t determines the amount of information that the current state h t needs to obtain from the previous state h t−1 and candidate state h t as shown in formula (7).The calculation formula is as follows: Where σ is the sigmoid function, by which the data can be transformed into a value in the range of 0 ∼ 1 to act as a gate control signal.x t is the input of the current time, h t−1 is the state of the previous time and h t is the state of the current time.W z , W r and W are the network parameters for calculating z t , r t and h t .The output of the last step of the GRU network contains all the features of the video.Therefore, the features output in the last step can be directly classified into two categories after being fully connected to predict unsafe actions.

End to end action recognition model
The structure and parameter setting of the end-to-end action recognition depth model proposed in this paper is shown in Fig. 6.After scaling and standardization preprocessing, a power operation action video is taken as the input of the action recognition model.It is expressed as , in this paper T=32, so the shape of X is [32,3,448,448].
For each image [3,448,448], the spatial feature is extracted by Restnet50 network, and the feature output of each image is 1000 dimensions.Therefore, after the input X passes through the Restnet network, the output shape is [32,1000].The 3-layer TCN network is used to extract local temporal features.The number of convolution kernels are set to (512, 256, 128).After extracting the local temporal features through the TCN network, the output shape is [8,128].The number of GRU hidden nodes is set to 128, and the dimension is [128] after global temporal features are extracted by GRU.By the softmax classifier, the class of action is output.

Equipment target detection based on Yolov3 model
The judgment of unsafe action is closely related to the necessary equipment.Therefore, equipment target detection is an important content of unsafe action judgment.
For object detection, Yolo model has achieved good results, which can quickly detect the target and mark the target position [41].In this paper, yolov3 model [42] is used for real-time detection of necessary equipment such as safety helmet, insulating gloves, five-prevention Fig. 6 The structure and parameter settings of the action recognition model For the Darknet53 part of yolov3 model, this paper adopts the method of transfer learning, and retrains the model with our own labeled dataset.And ten kinds of equipment or objects, such as person, helmets, heads, insulated gloves, insulated poles, five-prevention keys and electric boxes, are mainly detected in this system.At present, the system can detect ten kinds of targets, such as person, helmet, head, insulated gloves, insulated pole, five-prevention and so on.

Unsafe action judgment based on rule classification
The judgment rules of unsafe actions are defined based on the operation rules of power industry.For example, in view of not wearing safety gloves and unlocking without five-prevention keys, a group of unsafe action judgment rules can be defined as follows: Here, we define only two unsafe actions.If new unsafe actions need to be defined, rules can be added.Based on the output of action recognition model and Yolo detection model, unsafe actions can be identified by using the rule classifier.

Experiment
In order to verify the real-time performance of edge computing Architecture, we build a proof of concept platform.The sensing terminal uses the Raspberry Pi 3B as the controller, and it is connected to the camera, which is a camera with a built-in speaker with a focal length of 12mm and a resolution of 1080p zoom.It's operating system is Raspbian, and it connects to the LAN network by Wi-Fi.The edge computing server CPU is Intel Xeon E5-2600 v3, the graphics card is NVIDIA Quadro K4200, and it's operating system is Ubuntu Linux.The terminal and the edge computing server are in the same local area network.The central cloud uses Alibaba cloud enterprise level universal ecs.g6.2xlarge, and its operating system is Ubuntu Linux.The edge computing connects to the central cloud through the Internet.

Time validation of the architecture
After the action recognition model and Yolov3 model are trained, the entity of unsafe action detection ran in the  terminal, the edge computing and the cloud in the three ways as shown in Fig. 8.
The transmission time T tr , the calculation time of unsafe action judgment T c and the total time T tol are shown in the following formulas.
T c = t out_s − t video_r (10) T tol = t out_r − t video_s (11) Where, video_s is the time when the terminal sends video, t video_r is the time when the edge end or cloud receives the video, t out_s is the time when the identification service completes sending the identification result, t out_r is the time when the terminal receives the identification result.
In the three ways shown in Fig. 8, the time spent to determine whether a video action segment (K = 32) is unsafe is shown in Table 1.
In the experiment, the recognition time of cloud is relatively high, which is due to the low performance of the cloud server we applied for.However, it can be seen from the table that the transmission time of the edge computing is far lower than that of the cloud, and the recognition calculation time is far lower than that of the terminal.The determination result can be returned in about 2 seconds using the edge computing architecture, which can meet the requirements of real-time performance.

Verification of accuracy
In order to verify the effectiveness of the action recognition model based on TCN+GRU temporal features extraction, we use MSR action 3D standard dataset [43] to verify.The MSR action dataset recorded 20 kinds of actions, 10 subjects, and each object performed each action 2-3 times.The resolution of depth map is 640 * 240.As in reference [38], 116 sample actions of 3 and 9 subjects are taken as testing set, and 451 sample actions of other subjects are used as training set.
Because the unsafe action judgment is not for a complete action, but for a part of an action video.Therefore, in order to verify the recognition ability of our method for action segment, we use the sliding window to obtain part of the action frame to represent an action segment.The number of samples obtained by using different sliding window length K is shown in Table 2.
In order to verify the superiority of TCN+GRU temporal feature extraction proposed in this paper, after extracting the spatial features, TCN [38], GRU [30] and TCN+GRU are used to extract temporal feature, and then the action As shown in the experiment, we can find that no matter how much K is, the accuracy of action recognition based on TCN+GRU temporal feature extraction method is higher than the other two methods.The main reason is that the TCN + GRU temporal feature extraction method not only extracts global features but also local features, so it is less affected by the sample quality and better overcomes the shortcomings of single feature extraction method.This proves the superiority and rationality of the method proposed in this paper.

Detection of unsafe action in electric power operation
There are safety regulations for operation in the power plant, such as electric testing, unlocking, opening or closing switch etc..For example, if you will perform electric testing, you must wear a safety helmet and gloves, otherwise it is an unsafe.In this experiment, we test two kinds of unsafe actions as shown in Fig. 10, one is electric testing without insulating gloves, and the other is unlocking without using five-prevention keys.

Video sampling
In the experiment, we asked four person to finish 25 groups of safe and unsafe actions as shown in Fig. 10, and captured images at the speed of 30 frames per second.And we obtained 100 video samples.The action types of the video samples are shown in Table 3.

Action recognition
Because the action prediction is not to recognize the whole action, but to recognize the specific action by using a small part of the action frame.Therefore, for the processing of action video, one image is taken every three frames, that is, 10 images are taken from one second video.At the same time, the sliding window k = 32 frames is used to collect action fragments, so the duration of one action segment is 3.2 seconds.Because some action videos take a long time, a total of 887 sets of motion clips are obtained.100 groups of action fragments are randomly selected as the verification set and the rest as the test set.Because the distinction between the two types of actions is obvious, the recognition rate of the two types of actions is 100% by using the proposed action recognition method.

Object detection
In order to detect objects with yolov3 model, we label 556 images with labelimg tool.The detection objects include person, helmet, face, insulated gloves, pair key, insulating pole, hand and so on.The original yolov3 model is retrained by using the marked pictures to get a new model.The results of some object recognition are shown in Fig. 11.

Unsafe action judgment
Yolov3 model is not stable for target detection, and it can not be detected in every frame.Therefore, in order to improve the reliability of target detection, we use youlov3 to detect 32 frames of an action segment.As long as the equipment is detected in two images, the equipment is considered to be true, and then the unsafe action is determined by rules.The key will not be detected due to occlusion and other reasons, and the safety action may be judged as unsafe action, resulting in the accuracy rate of unsafe action detection reduced to 91% in the test set.

Conclusion
In this paper, the problem of real-time judgment of unsafe actions in power operation is discussed, and the intelligent monitoring architecture based on edge cloud technology and the problem of judging unsafe actions are explored.According to the above analysis and experiments, it can be seen that: 1.The unsafe action detection architecture based on edge cloud architecture in this paper can solve the problem of network transmission delay and meet the needs of continuous learning and upgrading of the model.Due to the time limit, there are still some improvements in this paper, such as how to better combine the action recognition model and Yolov3 model, and more work needs to be done in the future.
It extends cloud computing by introducing an intermediate fog layer between mobile devices and cloud, which solves the problems of cloud computing's inability to perceive location and high latency.In 2016, the team [9] of Wayne State University gave the formal definition of edge computing for the first time and studied the application scenarios of edge computing.Then, with the joint release of edge computing reference architecture 3.0 [10] by ECC and AII in 2018, artificial intelligence solutions based on edge computing have become a research hotspot.

Fig. 9 of 14 Fig. 10
Fig.9 The recognition accuracy of action segments with different lengths in different models

2 .
By adding a GRU layer to extract the global temporal information, the proposed action recognition model increases the recognition ability of action segments.At the same time, combined with the results of equipment detection by Yolov3 model, the unsafe action can be judged by rule classification, and the purpose of early warning is achieved.
2.In order to improve the accuracy of action recognition, an action recognition model is proposed in this paper, in which the TCN is used to extract local temporal features, and a cyclic neural network GRU is added to extract global temporal features.

Table 1
Time of different modes

Table 2
Number of samples of different length action fragments generated by MSR Data set