IoT-enabled directed acyclic graph in spark cluster

Real-time data streaming fetches live sensory segments of the dataset in the heterogeneous distributed computing environment. This process assembles data chunks at a rapid encapsulation rate through a streaming technique that bundles sensor segments into multiple micro-batches and extracts into a repository, respectively. Recently, the acquisition process is enhanced with an additional feature of exchanging IoT devices’ dataset comprised of two components: (i) sensory data and (ii) metadata. The body of sensory data includes record information, and the metadata part consists of logs, heterogeneous events, and routing path tables to transmit micro-batch streams into the repository. Real-time acquisition procedure uses the Directed Acyclic Graph (DAG) to extract live query outcomes from in-place micro-batches through MapReduce stages and returns a result set. However, few bottlenecks affect the performance during the execution process, such as (i) homogeneous micro-batches formation only, (ii) complexity of dataset diversification, (iii) heterogeneous data tuples processing, and (iv) linear DAG workflow only. As a result, it produces huge processing latency and the additional cost of extracting event-enabled IoT datasets. Thus, the Spark cluster that processes Resilient Distributed Dataset (RDD) in a fast-pace using Random access memory (RAM) defies expected robustness in processing IoT streams in the distributed computing environment. This paper presents an IoT-enabled Directed Acyclic Graph (I-DAG) technique that labels micro-batches at the stage of building a stream event and arranges stream elements with event labels. In the next step, heterogeneous stream events are processed through the I-DAG workflow, which has non-linear DAG operation for extracting queries’ results in a Spark cluster. The performance evaluation shows that I-DAG resolves homogeneous IoT-enabled stream event issues and provides an effective stream event heterogeneous solution for IoT-enabled


Introduction
Real-time streaming empowers an organization to process live data feed generated through an on-line data production system [1]. In the late 90s, an American scientist Peter J. Denning presented a streaming idea to save in-process bits for solving complex calculations much faster than traditional machine processing. This method helps create, process, and observe the data-stream of an instrument and generate a statistical result set [2]. complexity also includes the management of an enormous number of indices in non-tabular datasets that ultimately raises the concept of big data management that could handle large-scale datasets [6]. There are several enterprises out in the market that offers big data management systems such as SQLstream [7], TIBCO [8], IBM [9], Striim [10], and Apache software foundation [11]. Among these, Apache group offers several open-source GPLv3 licensed big data stream engines i.e. Flume [12], Spark [13], Storm [14], NiFi [15], Apex [16], Kafka [17], Samza [18], Flink [19], Beam [20] and Ignite [21], that includes various streaming features as shown in Table 1.
These streaming engines are programmed to handle several forms of data-types, such as structured data, unstructured data, and semi-structured data [22]. These data types are generated through sources that include sensory devices and web-based intelligent portal [23]. Internet of Things (IoT) is a sensory device that consists of an intelligent processor, sensor to detect and store records in its cache storage and an interface to exchange datasets with global networks [24]. This device also generates a continuous flow of data that requires persistent storage to store, and streaming engines categorize its data into three forms, such as unprocessed, processed, and replicas [25]. The unprocessed data is a non-filtered collection that holds an association of tuples with indices only, whereas, the processed data is the extraction of query result onto the unprocessed data. The replica is a block of processed data ready to be exchanged with streaming engines to perform real-time analytics in a distributed computing environment [26], as shown in Fig. 1.
IoT devices also generate several metadata events, i.e., monitoring the temperature of factory devices through smart meters, recording a credit card transaction, and detecting an unwanted object in a surveillance camera [27]. These events are a crucial part of metadata along with logs and routing path information and direct streaming queries to identify data tuples in the repository [28]. By default, streaming through Apache engines involve few steps such as (i) stream sourcing, (ii) stream ingestion, (iii) stream storage, and (iv) stream processing [29]. Stream sourcing represents an IoT device that provides a continuous flow of datasets, and stream ingestion consumes the same sourcing data chunk to queue the tasks inside a streaming engine systematically. The stream storage then formulates a micro-batch, a collection of live data feed having an adequate size s in time t sequentially, and stream processing facilitates the system to execute queries and retrieve a real-time result set [30] as shown in Fig. 2.
The data transformation phase divides micro-batch into four further subtypes, i.e., local generation, file system Table 1 IoT-based application attribute feature model (HDFS) generation, dataset-to-dataset generation, and cache generation [31]. This transformation process is considered relatively lazy because of having an abstract extraction of datasets without any real action. Thus, stream processing requires a task route mapper, that could redirect dataset extractions per query into the respective repository. For this, the streaming engine uses a built-in feature of a directed acyclic graph (DAG) that extracts micro-batches to respective column fields without directed cycles [32]. DAG workflow consists of n MapReduce stages and transforms micro-batches through a scheduler, which transports dataset through resource allocations using stage functions. By default, a simple DAG consists of Stage 0→1 stages, whereas, multi-purpose DAG involves Stage 0→n stages to transform stream into a dataset as shown in Fig. 2a and b. This workflow facilitates live queries' extraction from a micro-batch; however, it does not recognize the type of IoT data tuples during micro-batch formation. Thus, when processing IoT stream events, it encounters four problems, such as (i) homogeneous micro-batches, (ii) dataset diversification, (iii) heterogeneous data tuples, and (iv) linear DAG workflow issue [33].
This article proposes an IoT-enabled Directed Acyclic Graph (I-DAG) for heterogeneous stream events that minimize the processing discrepancy issue in data transformation. The presented I-DAG enhances workflow operation by reading labeled stream tags in heterogeneous event stream containers and scheduling workflow task processing in a spark cluster. Thus, I-DAG contains additional features of processing IoT tuples and managing the existing DAG properties mentioned below.
The significant contributions of I-DAG are highlighted as: • A novel event stream tag manager • A novel parser to filter heterogeneous event streams in the stream engine • An innovative workflow manager that bypasses the unnecessary tasks queued in stages of MapReduce Operation.
-Stage 0→1 I-DAG workflow -Stage 0→n I-DAG workflow The remaining paper is organized in the following manner. "Motivation" section discusses the benefits and complications; "IOT-Enabled directed acrylic graph (I-DAG)" section addresses the motivation; "Performance evaluation" section explains the proposed model I-DAG; "Conclusion" section shows experimental evaluation over the spark cluster. "Declaration" section presents the conclusion with future work.

Motivation
I-DAG is an enhancement in the existing workflow of executing event streams in spark clusters. Let us discuss the benefits and complications of a smart meter use case in a smart grid. Smart meters cope with on-ground streaming that includes continuous submission of record streams for grid analytics. A smart grid evaluates the functional and procedural performance of distribution end units through that stream. It simultaneously observes the performance of smart meters, i.e., stream accuracy, optimal workload management, and proper functioning of components. A smart grid generates a complicated scenario in bi-directional processing, where a system confirms the accuracy of a stream through the functionality of a source object. Thus, a smart grid cannot verify the accuracy of streaming analytics through a transformed dataset only, but also, it must monitor the error accuracy of smart meters. Therefore, it requires a streaming event analyzer that copes with Stage 0→n transformations concurrently, and I-DAG provides such features through label-based stream event analytics [34,35].
Smart meters generate heterogeneous IoT events concurrently through bi-directional streaming that creates asynchronous problems in the smart grid, i.e., outnumbered of metadata than traditional processing and overwhelmed analytical accuracy. Thus, when the I-DAG technique applies, it acquires cache containers to jump few MapReduce tasks that usually a developer skips to include in the programming model [36,37].
Nowadays, the world is moving towards an unpredictable scale of managing IoT devices and their streaming event analytics. This increment would drastically increase with time, and the demand for resource management would be considered a vital issue that must be managed on a priority basis. At that time, a customized Direct Acyclic Graph for IoT event stream processing would fulfill this demand. This IoT-enabled direct acyclic graph would address future heterogeneous workflow event stream operations in the spark cluster [38].

IOT-Enabled directed acrylic graph (I-DAG)
From a functional perspective, we divide I-DAG into three sub-components: • Label-based event streaming • Heterogeneous stream transformation • IoT-enabled DAG workflow

Label-based event streaming
Let IoT devices events be a sequence of error, backup and information messages with a representation as E i , B i and I i , where each of the message belongs to sensory devices as Device i in the distributed computing environment as shown in Fig. 3. At each time interval t, streams generated through a function f i holds an array of event messages Therefore, when a new occurrence of event messages arrive, the function representation changes to G [i + +] and the individual event message collection at each node could be represented as, Where, G [i + +] is a container managing multiple event messages arrival with x ≥ 0. In order to approximate the inner function elements of and z (I [1..n]) are added into the stream instruction set with a proportion of (E i , x) + +, (B i , y) + + and (I i , z) + + and returns an output approximation as, Where, Event m > 0 and represents the container of processed heterogeneous event messages.

Fig. 3 Label-based Heterogeneous Streaming Workflow
The individual data segments of E i , B i and I i arrives at nodes N o , N p and N q through an incremental function G [i + +] that assembles segments in formation order. This order summarize stream segments in such a way that . Therefore, the constraints are residing within the p,q are independent and could be retrieved as, E N o,p,q = 1 2 1 + 1 2 (−1). After that, the linearity of expectation could be represented as, Where E SE o,p,q manages the heterogeneous events with independent expectation parameters.

Lemma-4: average T 1 and T 2 of SE o,p,q
Let A be the output of algorithm-1, so and that equals to the, Therefore, the bound of stream segment could be obtained as, Thus, In order to reduce the variance, we apply Chebyshev inequality [40] to √ 2ε > 1, we get the output as, Now, by Chebyshev's inequality, as we get, At this point, streaming bound δ could be obtained but since a dependence of 1 δ is present, therefore, we apply lower bound inequality Hoeffding [41] on Now execute median function Z of T 1 T 2 onto B, B 1 , ..., B T 1 T 2 and we get, when, The stream approximation could be obtained as, This stream approximation defines the existence of managing heterogeneous parameters in the I-DAG.

Heterogeneous stream transformation
The distributed stream elements with probability α (t) are sampled at time t with a computing average of, and, : error ε, constant over time In order to perform encapsulation, reservoir sampling is used because it allows adding first k stream elements to the sample having total items t − th with probability k t . Thus, for every t and i ≤ t, the sample probability is evaluated as, and for t + 1, the sample probability becomes, This is mandatory because of the inter-connected heterogeneous IoT tuples that are to be incorporated with the internal of time. The processing of t + 1 with i ≤ t eventually reduces the role of s i and returns s t+1 as, The frequency table of stream events uses the event arrival probability P i,t+1 into Like space saving of countmin sketch to bring an order between transformed heterogeneous stream events as shown in Fig-3. This space saving function provides an approximation f x to f x for every x and consumes memory equals to O 1 . Therefore, when a stream vector G [n] is processed with G [i] ≥ 0 for ∀i t, it estimates heterogeneous stream G of G as, .d and the frequency table of heterogeneous events stream could be retrieved as, This declares that the accessibility of the heterogeneous events stream in enlisted in the I-DAG.

Lemma-5: G [i] ≥ g [i]
The minimum count of heterogeneous events stream G [i] remains ≥ 0 for ∀ i with a frequency of update g p (i) . The stream element having hash function I o,p,q = 1 if g p (i) = g p (k) = 0 could be retrieved as, Now, this stream is well connected and could not be ready independently. Therefore, we apply Markov inequality and pairwise independence as, for fixed value of i as shown in Figs. 4 and 5. Thus, we observe that the events are synchronized to a central container with independence of accessibility.

I-DAG workflow
The events generated through IoT devices with a sequential order of PE ∀ j : A o,p ≥ ε |G| 1 are scheduled onto the I-DAG that consists of an identifier Locator I−DAG which reads events labels in the source file and shuffle the pointer between n stages as shown in Fig. 6. In order to perform stage predictor evaluation, the workflow targets PE ∀ j : A o,p ≥ ε |G| 1 : stage (n) → stage (n + 1) with Locator I−DAG : stage (n) → stage (n + 1) keeping the error under loss function ϑ : stage (n + 1)×stage (n + 1) → R. The predictor error can be obtained as,

Fig. 5 Heterogeneous Event Stream Transformation
This predictor error manages the discrepancies of interconnection in the I-DAG workflow.
The Locator I−DAG with an approximated finite heterogeneous event labels can be sampled with S I−DAG = stage (n) 1 The workflow loss function are categorized into two types: (i) regression and (ii) classification. The regression loss on predictor Locator I−DAG is expressed as, and classification loss on predictor Locator I−DAG is expressed as, Thus, I-DAG is ready to facilitate the independent heterogeneous IoT entries with prediction locator.

Performance evaluation
I-DAG technique is incorporated into the Spark cluster, having a virtualized distributed environment, as shown in Table 2.
end for 11: return k ID . 12: end procedure

Environment
Spark cluster consists of Intel Xeon processor with a core computation capacity of 8 CPU units, 64 GB RAM and persistent storage media of 2 TB Disk and 1 TB SSD. The remaining partial workers consist of the Intel Core i7 processor having 4 Cores, 16 GB RAM, and persistent storage media of 1 TB Disk along with 500 GB SSD. The virtual environment consists of Virtual Box 5.2 installed at five virtual machines, as mentioned in Table 3.
It contains a collection of 4500 files storing stream data having a total volume of 8.6 GB.
The experiments performed on the AWS dataset consists of (i) Events labeling, (ii) Labeling error factor, (iii) Joining Heterogeneous Streams, (iv) Heterogeneous dataframes, (v) Workflow Endurance and (vii) Cluster performance.

Metrics of evaluation
I-DAG consists of two performance metrics, i.e., (i) Merging of disjoint streams and (ii) Stages bypass. The disjoint stream merging overlaps the individual element and strengthens connectivity between heterogeneous streams. The stages bypass reduces unnecessary consumption of RAM and a decrease in redundant garbage values that appear as a result of regular stage processing.

Results
This section discusses the experimental results generated through the proposed approach I-DAG tasks processing.

Events labeling
IoT devices generate events of errors, backup, and record information in the form of text data that the stream engine receives for micro-batch transformation. Event labels mark the stream elements with an I-DAG tag sequence having a hash function. This tagging creates an impact of trust, and it no longer requires a pair in a prefix or postfix, and the transformation function uses the same hash to bundle stream elements into the core engine. This labeling function consists of several sub-routines such as (i) data ingest, (ii)element queuing, (iii)stream chunk tagger, (iv) hash element, and (v) element dispatcher.
The data_ingest function fetches an enormous number of individual stream elements from several devices and uses Heap memory to enlist the element arrival into stream engine. The element_queuing feature then assign the indices to respective Heap function entries in FCFS (First come First serve) order. The stream_chunk_tagger method assigns a label Stream E i ,B i ,I i to each of the indexed entry and allocates a hash_element value for identifying any particular index in the stream and finally the dispatcher encapsulates the tags and transform the event streams as shown in Table 4. The tagged events are recognized by the stream engine much effectively than regular heterogeneous events, as shown in Fig. 7.

Labeling error factor
The error in an event labeling process appears due to improper placement of tag. It occurs during the application of events labeling relies upon several reasons such as (i) improper ingest, (ii) queue out of bound, (iii) abnormal tagging, (iv) inaccuracy in tag, and (v) partial release of an element. During the tag formation process, a stream element could lead to improper ingestion due to concurrent in-takes at the same time. The queue responsible for managing stream may lead to a buffer overflow problem if the tagging time interval increases than the usual timeline. Also, the stream could be released without having a proper index and hash function due to continuous inaccurate tag application. The errors in labelbased events, as well as healthy stream formation, can be observed in Fig. 8.

Heterogeneous streams join
The tagged stream elements require a join operation to combine like events in the stream engine. This requirement is a must because of the live ingestion of heterogeneous stream feed through enormous IoT devices. The functional aspect of a join operation consists of parsing tagged stream elements adjacent to each other so that streaming ingestion must be within the same range of time along with a conjunctive condition that offers to join elements with similar tagging. This conjunction function  correlates element n to n+1 through a forward-feed chain in the data transformation environment. The stream element join is executed through syntax Stream E i = join parse Tag n , Tag n+1 → Tag n , Tag n+1 keeping group-by phrase as a priority along with aggregate operators. The heterogeneous streams join of error, backup, and information record events through query operators, as observed through Table 5.
In the same way, the heterogeneous streams join of error, backup, and information record events through diff operator can be observed through Table 6. The comparative effectiveness of the tagged heterogeneous streams joins, as observed through Fig. 9.

Heterogeneous data frames
The label-based stream elements stored in a heterogeneous data frame that comprises a table having data structured properties. This data table assigns a sequence of indices to the stream elements that declare as equal length vectors. The frame categorizes into several subsections, such as (i) header, (ii) data row, and (iii) cell. The header represents the top line of tabular-structure that manages column names only. The data row depicts the stream element having a prefixed index value, and the cell is the stream element member of the row. The data frame supports event labeling transformation through a prior metadata information set of stream elements. Thus, the tagged stream elements are retrieved in a much more efficient manner than traditional stream elements, as shown in Fig. 10.

Workflow endurance
The issues encountered through in-process heterogeneous data streams measures the workflow endurance during stage processing. The IoT-enabled workflow uses data frames to learn about tagged stream elements already enlisted in the data table. Therefore, when a stream joins processes on the source file, the table allows the I-DAG workflow to skip unnecessary steps wherever encountered. This step skipping practice is learned very well through two case studies given as (i) Stage 0→1 and (ii) Stage 0→n . The Stage 0→1 consist of two stages having three operations in total that includes flatMap, Map and Reduce. If the label stream element already processed through the Map functionality, it can jump the control from flatMap to Reduce operation. In the case of Stage 0→n , when the compiler parses the source file that consists of a schedule, the control bypasses unscheduled operations in the stages. Thus, it reduces the usage of energy consumption and the computing capacity of a cluster along with skipping functional latency issues. The Stage 0→1 and Stage 0→n performance could be observed through Tables 7 and 8.

Cluster performance
The parameters measuring cluster performance comprise stage activity that includes map and reduce task processing and the exchange of i/O operations. i-DAG enables a cluster to perform switching in-between stage tasks depending on the source file's requirement. if the task does not require to produce map values, it bypasses the operation towards the next task, unlike traditional dAG that has to go through each of the individual operation producing i/O latency along with additional operational cost as shown in Fig. 11.

Conclusion
This paper proposes a novel technique that identifies different IoT devices stream events over a graph processing layer in spark cluster. The proposed approach provides a broad analytical perspective of how the stream events are generated, proceeded by their convergence in the heterogeneous form. In the end, the I-DAG workflow processes individual IoT devices' stream events with a cost-effective mechanism. It reduces graph workload along with decreasing the I/O traffic load in the spark cluster.