Rough fuzzy model based feature discretization in intelligent data preprocess

Chen, Qiong; Huang, Mengxing

doi:10.1186/s13677-020-00216-4

Research
Open access
Published: 18 January 2021

Rough fuzzy model based feature discretization in intelligent data preprocess

Qiong Chen^1,2 &
Mengxing Huang^1,2

Journal of Cloud Computing volume 10, Article number: 5 (2021) Cite this article

3056 Accesses
9 Citations
1 Altmetric
Metrics details

Abstract

Feature discretization is an important preprocessing technology for massive data in industrial control. It improves the efficiency of edge-cloud computing by transforming continuous features into discrete ones, so as to meet the requirements of high-quality cloud services. Compared with other discretization methods, the discretization based on rough set has achieved good results in many applications because it can make full use of the known knowledge base without any prior information. However, the equivalence class of rough set is an ordinary set, which is difficult to describe the fuzzy components in the data, and the accuracy is low in some complex data types in big data environment. Therefore, we propose a rough fuzzy model based discretization algorithm (RFMD). Firstly, we use fuzzy c-means clustering to get the membership of each sample to each category. Then, we fuzzify the equivalence class of rough set by the obtained membership, and establish the fitness function of genetic algorithm based on rough fuzzy model to select the optimal discrete breakpoints on the continuous features. Finally, we compare the proposed method with the discretization algorithm based on rough set, the discretization algorithm based on information entropy, and the discretization algorithm based on chi-square test on remote sensing datasets. The experimental results verify the effectiveness of our method.

Introduction

Edge-cloud computing is based on the core of cloud computing and the capability of edge computing, forming an elastic cloud platform built on the edge infrastructure [1,2,3]. As an extension of the centralized cloud, the edge cloud provides low-latency, self-organizing, and schedulable distributed cloud services for terminals [4, 5]. As shown in Fig. 1, the edge cloud, the centralized cloud and the terminals of Internet of things constitute an end-to-end technical architecture of “cloud-edge-terminal collaboration”. By allocating computing, network forwarding, storage, and other work to the edges for intelligent data preprocessing, the cloud pressure, response delay and bandwidth cost can be reduced [6, 7]. Feature discretization is an important reduction technology for mass data in industrial control [8, 9]. It can filter abnormal data, reduce system load, and improve the performance of intelligent algorithm [10] by transforming continuous features into discrete ones that are easier to understand, use, and interpret, so as to improve the efficiency of edge-cloud computing and prevent network attacks to a certain extent [11, 12].

In recent years, feature discretization has gradually become a key technology of intelligent data preprocessing, which has attracted extensive attention all over the world and achieved fruitful research results [13]. Obtaining the optimal discretization scheme has been proved to be an NP complete problem [14]. Most of the current methods are based on specific partition criteria to realize the discretization of continuous features, such as the equal width algorithm [15], the equal frequency algorithm [15], the discretization algorithm based on information entropy [16], the discretization algorithm based on chi-square test [17]. However, due to the complex correlation between features, the relatively fixed partition criteria cannot comprehensively measure the discrete interval. In addition, the distribution of sample attribute values in a dataset is often difficult to learn. Therefore, the discretization results obtained by these algorithms are often not the optimal scheme in specific application scenarios, and even fail to meet the accuracy requirements of the system [18].

Compared with the above discretization methods, discretization based on rough set [19] has achieved good results in many applications because it can make full use of the known knowledge base without any prior information. On the other hand, since feature discretization is a complex constrained optimization problem [13], it is very difficult to solve this kind of problem by traditional methods, and the genetic algorithm is more effective than traditional methods because of its group search strategy and calculation method that is not dependent on gradient information [20]. Through crossover and mutation operations, genetic algorithm takes into account the global and local equilibrium search ability. Compared with other swarm intelligence optimization algorithms, genetic algorithm can use more mature analysis methods to estimate the convergence rate [21]. Therefore, the combination of rough set and genetic algorithm can obtain better results than other methods. Chen et al. propose a genetic algorithm for discretization [22]. They conduct experiments on several datasets in UCI machine learning library. In the experimental process, they use some optimization strategies to continuously optimize the genetic algorithm. The experimental results show that the genetic algorithm is effective in both time complexity and accuracy. Ren et al. propose a heuristic genetic algorithm to discretize continuous attributes of decision table [23]. The algorithm takes the importance of continuous cut sets as heuristic information, and constructs a new operator, which not only keeps the identifiability of the selected cut sets, but also improves the local search ability of the algorithm. Dai uses the rough set model to construct the individual fitness function of genetic algorithm to evaluate the uncertainty of information system, so as to handle the consistency and minimum [24]. With the advantage of rough set in dealing with incomplete information, the above methods can use the strong search ability of genetic algorithm to obtain the minimum number of breakpoints while ensuring that the compatibility of the system is not destroyed. However, in big data environment, there are often a large number of complex types of data, and the uncertainty in decision-making is caused by the unclear classification of categories. The equivalence class of rough set is an ordinary set, which is difficult to describe the fuzzy components in the data, and the accuracy obtained in these complex data types is low. Fuzzy set is a mathematical tool used to describe fuzziness, and the combination of fuzzy set and rough set can better deal with the uncertainty of data [25].

For this reason, we propose a rough fuzzy model based discretization algorithm (RFMD). The main contributions of this article are as follows: (1) we create a fuzzy set for each category in the dataset, and use fuzzy c-means [26] to get the membership function of each category; (2) we use the membership function to fuzzify the equivalence relationship of rough set, and establish the fitness function of genetic algorithm [18] based on rough fuzzy model [27] to select the best breakpoints on continuous features.

The rest of this paper is arranged as follows: the second part introduces the basic concepts of feature discretization, rough set, and fuzzy set; the third part describes the discretization algorithm based on rough fuzzy model; the fourth part introduces the experimental environment and datasets, and analyzes and discusses the experimental results; the fifth part summarizes the full text.

Background

We inntroduce the basic process of feature discretization, and the binary coding of feature discretization in genetic algorithm. Then, we explain the related definitions of rough sets and fuzzy sets, and lead to the rough fuzzy model.

Feature discretization and genetic coding

Discretization is to divide the continuous features (also known as continuous attributes) into a finite number of subintervals by some specific method, and associate these subintervals with a group of discrete values (also known as breakpoints) [28]. Through discretization, the data scale can be greatly reduced, thus improving the efficiency of massive data processing at the edge nodes of edge-cloud computing, and greatly relieving the pressure of transmitting data back to the centralized cloud [11]. The basic process of feature discretization is shown in Fig. 2.

In the beginning, the values of continuous attributes are sorted and the duplicate values are deleted to get a set of candidate breakpoints; then, the partition breakpoints of the continuous attributes are selected from the candidate breakpoints set, and decide whether to divide the interval or merge adjacent subintervals according to the judgment criteria of the discretization algorithm; if the termination condition is satisfied, the discretization result is output; otherwise, the remaining breakpoints are selected from the candidate breakpoints set to perform discretization of attributes.

Genetic algorithm is a probabilistic evolutionary algorithm for global optimization [29], which has achieved good performance in many optimization problems [30]. Genetic algorithms use fitness function to evaluate the quality of individuals in population, and transform the problem-solving process into a process which is similar to the crossover and mutation of chromosomal genes in biological evolution. In many complex combinatorial optimization problems, genetic algorithm can quickly obtain better optimization result than some conventional optimization algorithms [20, 21]. However, genetic algorithm cannot directly deal with the parameters of the problem space, so the problem to be solved must be expressed as a chromosome or individual in genetic space by coding. This conversion is called genetic coding [30]. Genetic coding adopts the following criteria [18]: (1) completeness: all candidate solutions in the problem space can be represented as chromosomes in genetic space; (2) soundness: chromosomes in genetic space can correspond to all candidate solutions in the problem space; (3) non-redundancy: chromosomes and candidate solutions are one-to-one correspondence.

The discretization problem can be seen as the selection of candidate breakpoints [30]. Each chromosome in the population represents a possible discretization scheme. The length of chromosome is equal to the number of candidate breakpoints. We use binary coding to encode the candidate breakpoints. Each bit in the binary code corresponds to a candidate breakpoint. The values of ‘1’ and ‘0’ represent that the corresponding breakpoint is selected and not selected, respectively. The set of selected candidate breakpoints is a possible discretization scheme.

Rough sets

Rough set is a mathematical theory proposed by Pawlak to solve the problem of data uncertainty [31]. Rough set regards knowledge as the ability to classify objects in the universe. An equivalence relation on the universe represents a knowledge.

Definition 2.1

The two-tuple K = (U, ℝ) is a knowledge base, where, U is the universe, and ℝ is the cluster of equivalence relations on U.

Definition 2.2

For x ∈ U, R ∈ ℝ, the equivalent class of x under R is [x]_R = {y ∈ U| (x, y) ∈ R}. The quotient set U/R = {[x]_R| x ∈ U} is called a knowledge.

Definition 2.3

Suppose U is a non-empty finite universe, and R is a binary equivalence relation on U. For any X ⊆ U, the lower and upper approximations of X with respect to R are:

$$ {R}_{-}X=\left\{x\in U|{\left[x\right]}_R\subseteq X\right\} $$

(1)

$$ {R}^{-}X=\left\{x\in U|{\left[x\right]}_R\cap X\ne \varnothing \right\} $$

(2)

Discretization based on rough set evaluates the result of discretization according to the degree of dependence of X on R. The degree of dependence of X on R is:

$$ {\gamma}_R(X)=\frac{\mid {R}_{-}X\mid }{\mid U\mid } $$

(3)

Where, ∣ · ∣ is the cardinality of the set. It is easy to see that discretization based on rough set can make full use of the known knowledge base without any prior information. However, [x]_R is an ordinary set, which is difficult to describe the fuzzy components in data.

Rough fuzzy-model

Fuzzy set is a mathematical theory proposed by Zadeh to describe the fuzziness of data [32]. Compared with the ordinary set which can only express crisp concepts, fuzzy sets can represent not only crisp concepts, but also fuzzy concepts.

Definition 2.4

Let A be a mapping from set X to [0, 1], call A the fuzzy set on X, and function A(x) is the membership of x to the fuzzy set A. The fuzzy set A is expressed as follows when X is a finite set and when X is an infinite set:

$$ A={\sum}_{i=1}^nA\left({x}_i\right)/{x}_i $$

(4)

$$ A={\int}_XA(x)/x $$

(5)

Through the membership function A(x), the equivalent classes of rough sets can be fuzzified to obtain the rough fuzzy model [33]. If X is a finite set, then the cardinality of fuzzy set A is:

$$ \mid A\mid =\sum \limits_{x\in X}A(x) $$

(6)

Definition 2.5

Let U be the non-empty finite universe, R is the binary equivalence relation on U, and A is the fuzzy set on U. For any x ∈ U, the lower and upper approximations of x in the rough fuzzy model established by R and A are:

$$ {R}_{-}A(x)=\underset{y\in U}{\operatorname{inf}}\left\{A(y)|\left(x,y\right)\in R\right\} $$

(7)

$$ {R}^{-}A(x)=\underset{y\in U}{\sup}\left\{A(y)|\left(x,y\right)\in R\right\} $$

(8)

Accordingly, the approximate accuracy of the above rough fuzzy model is:

$$ \eta =\frac{\mid {R}_{-}A\mid }{\mid {R}^{-}A\mid } $$

(9)

Since R₋A(x) ≤ R⁻A(x), 0 ≤ η ≤ 1. The closer the value of η to 1, the higher the overall approximation accuracy. In the application process of edge-cloud computing, the massive data collected often have incomplete, fuzzy, and other uncertain information. Rough fuzzy model has the advantages of both rough set and fuzzy set. It can make full use of the known knowledge base without any prior information, and use membership function to fuzzify the equivalent relationship to describe the fuzzy components inside the data, so as to improve the accuracy of the massive data processing at the edge node of edge-cloud computing [27, 34].

Rough fuzzy model based discretization algorithm

We introduce the process of calculating membership by fuzzy c-means clustering. Then, we detail the fitness function based on rough fuzzy model. Finally, we describe the whole process of the proposed method.

Membership calculated by fuzzy c-means clustering

Fuzzy c-means integrates the essence of fuzzy theory [35]. Compared with the hard clustering of k-means, fuzzy c-means provides more flexible clustering results [36]. In most cases, the objects in the dataset cannot be divided into crisp clusters. It is hard to assign an object to a specific cluster, and errors may occur. Therefore, it is necessary to assign a weight between each object and each cluster to indicate the degree to which the object belongs to the cluster. Certainly, probability-based methods can also give such weights. But it is difficult for us to determine an appropriate statistical model. Therefore, it is a better choice to use the fuzzy c-means with natural and non-probabilistic characteristics [37].

The dataset can be represented by information table S = (U, R, V, f). Where, U is the non-empty finite universe, R is the set of the attributes, V is the range of attribute values, and f is the mapping function from the object to the range of attribute values. Suppose that U contains N samples, C categories, M attributes, x_ih is the value of sample x_i on the h-th attribute, 1 ≤ i ≤ N, 1 ≤ h ≤ M, and the class center c_j of the j-th class is initialized to $ {c}_j^0 $, 1 ≤ j ≤ C, then the membership of x_i to the j-th class is initialized as follows:

$$ {u}_{ij}^0=1/{\sum}_{k=1}^C\left(\frac{\sum_{h=1}^M{\left({x}_{ih}-{c}_{jh}\right)}^2}{\sum_{h=1}^M{\left({x}_{ih}-{c}_{kh}\right)}^2}\right) $$

(10)

Where, c_jh is the value of the current class center c_j on the h-th attribute. After the current membership is obtained, the class center c_j is updated to:

$$ {c}_j^1={\sum}_{i=1}^N\left({\left({u}_{ij}^0\right)}^2\times {x}_i\right)/{\sum}_{i=1}^N{\left({u}_{ij}^0\right)}^2 $$

(11)

u_ij and c_j are updated iteratively until the following termination condition is met:

$$ {\max}_{ij}\left\{|{u}_{ij}^{t+1}-{u}_{ij}^t|\right\}<\varepsilon $$

(12)

Where, t is the number of iterations, and ε is the error threshold. In this way, the membership of each sample in U is obtained, as shown in Algorithm 1.

Fitness function based on rough fuzzy model

After obtaining the membership of each sample in U to each category, we create a fuzzy set for each category:

$$ {A}_j\left({x}_i\right)={u}_{ij},1\le i\le N,1\le j\le C $$

(13)

Where, A_j is the corresponding fuzzy set of the j–th class. According to (7) and (8), we can calculate the lower and upper approximations of x_i in the rough fuzzy model established by attribute set R and A_j:

$$ {R}_{-}{A}_j\left({x}_i\right)=\underset{y\in U}{\operatorname{inf}}\left\{{A}_j(y)|\left({x}_i,y\right)\in R\right\} $$

(14)

$$ {R}^{-}{A}_j\left({x}_i\right)=\underset{y\in U}{\sup}\left\{{A}_j(y)|\left({x}_i,y\right)\in R\right\} $$

(15)

Accordingly, the average approximation accuracy of the rough fuzzy sets of all classes is:

$$ \overset{\_}{\eta }=\frac{1}{C}{\sum}_{j=1}^C\frac{\mid {R}_{-}{A}_j\mid }{\mid {R}^{-}{A}_j\mid } $$

(16)

Since the optimal discretization scheme is the best trade-off between data consistency and the number of breakpoints [38]. Therefore, the fitness function should be determined by the average approximation accuracy and the number of breakpoints. Assuming that ∣D∣ is the number of breakpoints reduced by discretization scheme D, the fitness function is as follows:

$$ {\displaystyle \begin{array}{l} Fit=\alpha \times \mid D\mid +\beta \times \overset{\_}{\eta}\\ {} where\ \alpha \ge 0,\beta \ge 0, and\ \alpha +\beta =1\end{array}} $$

(17)

Where, α and β are weight coefficients. The selection of the mentioned parameters is an open problem, as no specific selection can adapt to all datasets. Generally, the rationality of parameters is judged according to the characteristics of datasets and experimental observation [39]. ∣D∣ determines the magnitude of the reduction in the number of breakpoints, while $ \overset{\_}{\eta } $ controls the accuracy of data. If α is much greater than β, the accuracy of data will be very low. If α is far less than β, the number of breakpoints will be large, so the purpose of discretization cannot be achieved. Generally, in order to obtain as few breakpoints as possible while ensuring the accuracy of data, 0.1 ≤ α ≤ 0.5, i.e., 0.5 ≤ β ≤ 0.9. The purpose of this paper is to improve classification accuracy after discretization, and classification accuracy is directly related to the average approximation accuracy of rough fuzzy sets. Therefore, we set β to be larger than α (α = 0.1, β = 0.9), and achieve good results in the experiment.

Based on this fitness function, we iteratively perform genetic operation to find the optimal breakpoint set on continuous features. The whole process is shown in Algorithm 2. At first the membership function of each category is obtained through Algorithm 1, and the fitness function based on rough fuzzy model is established. Then, for each individual in the population, the average approximation accuracy of the corresponding discretization scheme is obtained by calculating the upper and lower approximations of all samples. Finally, in each genetic operation, the fitness of all individuals in the population is calculated by the number of breakpoints and the average approximation accuracy of the discretization scheme, and the global variable is updated by the individual with the highest fitness. When the accuracy requirement of the system is met or the set number of iterations is exceeded, the program is stopped and the optimal discretization scheme is output. Otherwise, the genetic algorithm will continue to be executed until the termination conditions are met.

Rough fuzzy model versus rough model

Fuzzification and rough set in RFMD enable reasoning uncertainty problems. Figure 3 is a simple example to illustrate the advantages of RFMD over the discretization methods based on rough set. Suppose that the dataset contains three samples (x₁, x₂, and x₃), the corresponding attribute values are v₁, v₂ and v₃, and the corresponding categories are C₁, C₁, and C₂. Through fuzzy c-means, the membership degree of three samples to C₁ and C₂ are: C₁(x₁) = 0.8, C₁(x₂) = 0.5, C₁(x₃) = 0.4, C₂(x₁) = 0.2, C₂(x₂) = 0.5, C₂(x₃) = 0.6. When the dataset needs to be divided into two intervals, there are two discretization schemes to choose from, shown in Fig. 3b and c. We can see that the membership of x₂ is quite different from that of x₁. In comparison, the membership of x₂ is closer to that of x₃. Obviously, the division in Fig. 3c looks more reasonable.

We use RFMD and rough set-based discretization methods to discretize the original information table in Fig. 3a, and verify the effectiveness of RFMD by comparing the discretization results:

(1) RFMD selects the best discretization scheme by comparing the average approximation accuracy η₁ and η₂ of Fig. 3b and c. In Fig. 3b, the equivalence classes under Attribute are {x₁, x₂} and {x₃}. According to (14) and (15), Attribute₋C₁(x₁) = inf {0.8, 0.5} = 0.5, Attribute₋C₁(x₂) = inf {0.8, 0.5} = 0.5, Attribute₋C₁(x₃) = inf {0.4} = 0.4, Attribute₋C₂(x₁) = inf {0.2, 0.5} = 0.2, Attribute₋C₂(x₂) = inf {0.2, 0.5} = 0.2, Attribute₋C₂(x₃) = inf {0.6} = 0.6, Attribute⁻C₁(x₁) = sup {0.8, 0.5} = 0.8, Attribute⁻C₁(x₂) = sup {0.8, 0.5} = 0.8, Attribute⁻C₁(x₃) = sup {0.4} = 0.4, Attribute⁻C₂(x₁) = sup {0.2, 0.5} = 0.5, Attribute⁻C₂(x₂) = sup {0.2, 0.5} = 0.5, Attribute⁻C₂(x₃) = sup {0.6} = 0.6, then ∣Attribute₋C₁ ∣ = 0.5 + 0.5 + 0.4 = 1.4, ∣Attribute₋C₂ ∣ = 0.2 + 0.2 + 0.6 = 1.0, ∣Attribute⁻C₁ ∣ = 0.8 + 0.8 + 0.4 = 2.0, ∣Attribute⁻C₂ ∣ = 0.5 + 0.5 + 0.6 = 1.6. According to (16), η₁ = (1.4/2.0 + 1.0/1.6)/2 = 0.6625. Similarly, in Fig. 3c, the equivalence classes under Attribute are {x₁} and {x₂, x₃}, then ∣Attribute₋C₁ ∣ = 0.8 + 0.4 + 0.4 = 1.6, ∣Attribute₋C₂ ∣ = 0.2 + 0.5 + 0.5 = 1.2, ∣Attribute⁻C₁ ∣ = 0.8 + 0.5 + 0.5 = 1.8, ∣Attribute⁻C₂ ∣ = 0.2 + 0.6 + 0.6 = 1.4. According to (16), η₂ = (1.6/1.8 + 1.2/1.4)/2 = 0.8730. It can be seen that η₂ > η₁, that is, the accuracy of discretization scheme in Fig. 3c is higher than that in Fig. 3b, which is consistent with the conclusion drawn from the previous analysis.

(2) Rough set-based discretization methods use (3) as the evaluation standard of the system compatibility after discretization. In Fig. 3b, ∣Attribute₋C₁ ∣ = ∣ {x₁, x₂} ∣ = 2, ∣Attribute₋C₂ ∣ = ∣ {x₃} ∣ = 1, then γ1 = γ_Attribute(C₁) + γ_Attribute(C₂) = (2 + 1)/3 = 1. Similarly, in Fig. 3c, ∣Attribute₋C₁ ∣ = ∣ {x₁} ∣ = 1, ∣Attribute₋C₂ ∣ = ∣ ∅ ∣ = 0, then γ2 = γ_Attribute(C₁) + γ_Attribute(C₂) = (1 + 0)/3 = 0.3333. Since γ1 > γ2, the discretization methods based on rough set will choose the discretization scheme in Fig. 3b, so the best discretization scheme cannot be obtained.

In summary, RFMD not only makes full use of the known knowledge base to generate rules as well as rough set-based discretization methods, but also fully considers the uncertainty caused by the fuzzy components in the data, so the samples with large internal component differences will not be classified into the same interval in the process of discretization, thereby obtaining a discretization scheme with higher precision.

Experiments

We introduce the experimental environment and datasets. Then, we compare the optimal breakpoint set obtained by RFMD algorithm with the discretization results of the current mainstream methods, mainly from the number of intervals, data consistency and classification accuracy.

Data source

The datasets used in this paper are as follows: (1) a Landsat 8 image from the northwestern region of Zhejiang Province, China, and a GF-2 image from Lingshui County, Hainan Province, China, as shown in Fig. 4. Where, Landsat 8 satellite data contains seven bands, while GF-2 satellite data contains four bands [40]. In the experiment, the objects on Landsat 8 image were divided into seven categories: broadleaf, town, needles, farmland, lei bamboo, water, and moso bamboo; the objects on GF-2 image were divided into five categories: construction, bare land, farmland, vegetation, and water. (2) Two methylation datasets, including N6-methyladenine (6 mA) and N4-methylacytosine (4mC) [41, 42]. The three attributes of the first methylation dataset are: mean, model prediction, and interpulse duration ratio; the three attributes of the second methylation dataset are error, model prediction, and interpulse duration ratio. (3) The banknote authentication dataset extracted from banknote-like images [43] is divided into genuine banknote and forged banknote, and contains four attributes, namely variance, skewness, kurtosis, and entropy.

Configuration of experimental environment

In order to verify the effectiveness of the proposed method, all four algorithms were executed on a computer with Intel(R) Core (TM) i5-5200U CPU@2.20GHz processor, 12G RAM, and 512 g hard disk. Visualization, programming, simulation, testing and numerical calculation processing of this experiment were implemented in MATLAB (R2016a version) environment. Radiometric calibration of images, atmospheric correction, and comparison of results before and after discretization were performed under ENVI 5.3 environment.

Datasets

The ground reflection or emission spectral signal obtained by remote sensor is recorded by pixel. The interior of a pixel contains only one type, which is called pure pixel. However, in most cases, the interior of a pixel often contains many kinds of surface features, and this kind of pixel is called mixed pixel. The mixed pixel records the comprehensive spectral information of various types of ground objects. Several areas covering seven categories were randomly selected from Landsat 8 image and labeled. After integration, they were used as training samples to be discretized, with a total of 2621 samples. Among them, 308 cases were broadleaf, 245 were town, 322 were needles, 675 were farmland, 296 were lei bamboo, 262 were water, and 513 were moso bamboo. We used another group of samples with the same number of training samples as the test set. In the test set, 308 cases were broadleaf, 245 were town, 322 were needles, 675 were farmland, 296 were lei bamboo, 262 were water, and 513 were moso bamboo. Let N be the number of samples, and C be the number of categories, then the initial fuzzy segmentation matrix of the training set is:

$$ {PM}^0=\left[\begin{array}{cccccc}{f}_1\left({x}_1\right)& {f}_1\left({x}_2\right)& .& .& .& {f}_1\left({x}_N\right)\\ {}{f}_2\left({x}_1\right)& {f}_2\left({x}_2\right)& .& .& .& {f}_2\left({x}_N\right)\\ {}.& .& .& .& .& .\\ {}.& .& .& .& .& .\\ {}.& .& .& .& .& .\\ {}{f}_C\left({x}_1\right)& {f}_C\left({x}_2\right)& .& .& .& {f}_C\left({x}_N\right)\end{array}\right] $$

(18)

Where,

$$ {\displaystyle \begin{array}{l}{f}_j\left({x}_i\right)=\left\{\begin{array}{cc}1,& {x}_i\ \mathrm{belongs}\ \mathrm{to}\ \mathrm{class}\ j\\ {}0,& \mathrm{otherwise}\end{array}\right.\\ {}\mathrm{where}\ 1\le i\le N\ \mathrm{and}\ 1\le j\le C\end{array}} $$

(19)

In the beginning, the above segmentation matrix is substituted into (11) to initialize the cluster center of each category. Then, all the pixels contained in each band were sorted and de duplicated according to the brightness value, and the initial breakpoints of seven bands were 1314, 1517, 1056, 1211, 1086, 1920, and 1832, totaling 9936.

Similarly, in GF-2 image, there were 7554 training samples to be discretized. Among them, 2094 cases were construction, 775 cases were bare land, 1478 cases were farmland, 2251 cases were vegetation, and 956 cases were water. We used another group of samples with the same number of training samples as the test set. In the test set, 2094 cases were construction, 775 cases were bare land, 1478 cases were farmland, 2251 cases were vegetation, and 956 cases were water. All the pixels contained in each band were sorted and de duplicated according to the brightness value. The initial breakpoints of four bands were 3685, 3769, 2535, and 757, totaling 10,746. In the methylation datasets, there were 3709 training samples to be discretized. Among them, 1290 cases were 6 mA and 2419 cases were 4mC. A total of 1500 samples were tested. Among them, 500 cases were 6 mA and 1000 cases were 4mC. All the values contained in each attribute of the first methylation training set were sorted and de duplicated, and the initial breakpoints of three attributes were 1718, 1748, 960, totaling 4426. All the values contained in each attribute of the second methylation training set were sorted and de duplicated, and the initial breakpoints of three attributes were 564, 1748, 960, totaling 3272. In the banknote authentication dataset, there were 1072 training samples to be discretized. Among them, 562 cases were genuine banknotes and 510 cases were forged banknotes. A total of 300 samples were tested. Among them, there were 200 genuine banknotes and 100 forged banknotes. All the values contained in each attribute were sorted and de duplicated, and the initial breakpoints of four attributes were 1052, 996, 1015, 940, totaling 4003.

Our method was compared with RS-GA [24], EDiRa [16], CVD [17], and RLGA [18] mainly from the data consistency and the number of intervals. Finally, we trained the neural network classifier with the discretized samples of the above methods respectively, and verified the effectiveness of the proposed method by comparing the classification accuracy obtained by each method.

Data consistency and number of breakpoints

The discretization results obtained on Landsat 8 image by RFMD, RS-GA, EDiRa, CVD, and RLGA are shown in Table 1 and Table 2.

Table 1 Number of discrete intervals in each band of Landsat 8 image

Full size table

Table 2 Number of data errors in Landsat 8 image

Full size table

It can be seen that the number of intervals obtained by RFMD algorithm is 487, which is the least in all algorithms, and there is no data error. The number of intervals in RS-GA algorithm is the largest in all algorithms, reaching 570, followed by EDiRa algorithm with 520. The number of data errors of these two algorithms are 5 and 13 respectively. The number of intervals of CVD algorithm is only 17 more than that of RFMD algorithm, but the number of data errors is the largest in all algorithms, with 17 errors. The number of intervals of RLGA is 493, and the number of data errors is 2, which is second only to RFMD. Table 3 and Table 4 show the results of the number of intervals in each band and data inconsistency obtained by RFMD, RS-GA, EDiRa, CVD, and RLGA on GF-2 image.

Table 3 Number of discrete intervals in each band of GF-2 image

Full size table

Table 4 Number of data errors in GF-2 image

Full size table

It can be seen that the number of intervals obtained by RFMD algorithm is 1035, which is the least in all algorithms, and there is no data error. The number of intervals in RS-GA algorithm is the largest in all algorithms, reaching 1391, followed by EDiRa algorithm with 1307. The number of data errors of these two algorithms are 14 and 25 respectively. The number of intervals in CVD algorithm is 118 more than that in RFMD algorithm, and the number of data errors is the most in all algorithms, which is 30. RLGA has 1078 intervals and 7 data errors, which is second only to RFMD. Table 5 and Table 6 show the results of the number of intervals in each attribute and data inconsistency obtained by RFMD, RS-GA, EDiRa, CVD, and RLGA on the first methylation dataset.

Table 5 Number of discrete intervals in each attribute of the first methylation dataset

Full size table

Table 6 Number of data errors in the first methylation dataset

Full size table

It can be seen that the number of intervals obtained by RFMD algorithm is 537, which is the least in all algorithms, and the number of data errors is also the least in all algorithms, with 12. The number of intervals in RS-GA algorithm is the largest in all algorithms, reaching 669, followed by EDiRa algorithm with 571. The number of data errors of these two algorithms are 80 and 113 respectively. The number of intervals in CVD algorithm is 26 more than that in RFMD algorithm, and the number of data errors is the most in all algorithms, which is 259. RLGA has 556 intervals and 71 data errors, which is second only to RFMD. Table 7 and Table 8 show the results of the number of intervals in each attribute and data inconsistency obtained by RFMD, RS-GA, EDiRa, CVD, and RLGA on the second methylation dataset.

Table 7 Number of discrete intervals in each attribute of the second methylation dataset

Full size table

Table 8 Number of data errors in the second methylation dataset

Full size table

It can be seen that the number of intervals obtained by RFMD algorithm is 715, which is the least in all algorithms, and there is no data error. The number of intervals in RS-GA algorithm is the largest in all algorithms, reaching 871, followed by EDiRa algorithm with 782. The number of data errors of these two algorithms are 6 and 11 respectively. The number of intervals in CVD algorithm is 36 more than that in RFMD algorithm, and the number of data errors is the most in all algorithms, which is 15. RLGA has 722 intervals and 3 data errors, which is second only to RFMD. Table 9 and Table 10 show the results of the number of intervals in each attribute and data inconsistency obtained by RFMD, RS-GA, EDiRa, CVD, and RLGA on the banknote authentication dataset.

Table 9 Number of discrete intervals in each attribute of the banknote authentication dataset

Full size table

Table 10 Number of data errors in the banknote authentication dataset

Full size table

It can be seen that the number of intervals obtained by RFMD algorithm is 27, which is the least in all algorithms, and there is no data error. The number of intervals in RS-GA algorithm is the largest in all algorithms, reaching 39, followed by EDiRa algorithm with 37. The number of data errors of these two algorithms are 1 and 2 respectively. The number of intervals in CVD algorithm is 8 more than that in RFMD algorithm, and the number of data errors is the most in all algorithms, which is 3. RLGA has 30 intervals and no data error, which is second only to RFMD.

Although the discretization standards adopted by EDiRa and CVD have certain rationality, the relatively fixed partition criteria cannot comprehensively measure the discrete intervals. In addition, both EDiRa and CVD require the distribution information of sample attribute values in the dataset to improve the accuracy of interval partition. RS-GA adopts the discretization standard based on rough set, so it can achieve better results without any prior information. However, RS-GA lacks the ability to describe fuzzy components in the data, and its performance is often poor in the datasets with complex types. RLGA introduces reinforcement learning mechanism in crossover and mutation operations to improve the search efficiency of genetic algorithm. It keeps the data error at a low level and constantly seeks solutions with the least number of intervals. However, like RS-GA, the fitness function adopted by RLGA is only based on rough set and lacks the ability to describe the fuzzy components in the data. RFMD combines the advantages of rough set and fuzzy set, fully considers the fuzziness of data and the correlation among attributes, and determines the breakpoints in multiple continuous variables through evolutionary search. In this way, the performance of RFMD has been greatly improved and can adapt to most datasets with complex types. Therefore, the discretization result obtained by RFMD is the best in the five algorithms. The key differences among them are shown in Table 11.

Table 11 Key differences among the mentioned discretization methods

Full size table

Classification accuracy

We trained the neural network classifier with the discretized samples of these five algorithms, and obtained the classification results of Landsat 8 image and GF-2 image, as shown in Table 12 and Table 13.

Table 12 Classification results in Landsat 8 image

Full size table

Table 13 Classification results in GF-2 image

Full size table

It can be seen that the classification accuracy of RFMD is the best among the five algorithms. RS-GA, EDiRa, and RLGA have fewer data errors than CVD. Accordingly, RS-GA, EDiRa, and RLGA have higher classification accuracy than CVD. Figure 5 is a classification effect map of Landsat 8 image obtained by RFMD. It can be seen that the texture of the feature information in the figure is clear, the boundaries of different types of objects are more obvious, and there are almost no noise spots. The seven categories of broadleaf, town, needles, farmland, lei bamboo, water, and moso bamboo on the image can be effectively identified. Figure 6 is a classification effect map of GF-2 image obtained by RFMD. The texture of the feature information in the figure is clear, and the boundaries of different types of objects are very obvious. The five categories of construction, bare land, farmland, vegetation, and water on the image can be effectively identified.

Tables 14, 15, and 16 show the classification results of the five algorithms on the first methylation dataset, the second methylation dataset, and the banknote authentication dataset, respectively. It can be seen that the classification accuracy of RFMD is the best in all algorithms. Therefore, the discretization scheme obtained by RFMD can achieve good results in classification accuracy.

Table 14 Classification results in the first methylation dataset

Full size table

Table 15 Classification results in the second methylation dataset

Full size table

Table 16 Classification results in the banknote authentication dataset

Full size table

Conclusion and future work

The data collected by edge nodes are often large in scale, complex in type, with incomplete, fuzzy, and other uncertain information. In order to lighten the system load, decrease the data inconsistency, and relieve the pressure on the centralized cloud, a discretization algorithm based on rough fuzzy model (RFMD) is proposed for intelligent data preprocessing of edge-cloud computing. The work of this paper mainly comes from the following aspects: (1) we create a fuzzy set for each category, and initialize all cluster centers according to the attribute values of the samples and the initial fuzzy segmentation matrix; (2) we use fuzzy c-means to obtain the membership function of each category, and establish the fitness function of genetic algorithm based on rough fuzzy model; (3) for each individual in the population, the average approximation accuracy of the corresponding discretization scheme is obtained by calculating the upper and lower approximations of all samples; (4) in each genetic operation, the fitness of all individuals in the population is calculated by the number of breakpoints and the average approximation accuracy of the discretization scheme, and the global variable is used to reserve the individual with the highest fitness, so as to obtain the optimal discretization scheme; (5) simulation experiments on real remote sensing datasets show that the proposed method can achieve good results in the number of discrete intervals, data consistency, and classification accuracy.

The future research work includes: (1) compare the performance of the proposed method on multiple classifiers to optimize the algorithm model, and further improve the efficiency of edge-cloud computing; (2) test and improve the proposed method in different datasets to expand its application scope, and further reduce the cost of data analysis and security management of edge cloud.

Availability of data and materials

The Landsat 8 and GF-2 datasets, the methylation datasets, and the banknote authentication dataset used to support the findings of this study are included within the article. All the data and materials in this article are real and available.

References

Taleb T, Samdanis K, Mada B et al (2017) On multi-access edge computing: a survey of the emerging 5G network edge cloud architecture and orchestration. IEEE Commun Surveys Tutorials 19(3):1657–1681
Article Google Scholar
Pan J, Mcelhannon J (2018) Future edge cloud and edge computing for internet of things applications. IEEE Internet Things J 5(1):439–449
Article Google Scholar
Fernando N, Loke SW, Rahayu W et al (2019) Computing with nearby Mobile devices: a work sharing algorithm for Mobile edge-clouds. IEEE Transact Cloud Comput 7(2):329–343
Article Google Scholar
Rodrigues TG, Suto K, Nishiyama H et al (2017) Hybrid method for minimizing service delay in edge cloud computing through VM migration and transmission power control. IEEE Trans Comput 66(5):810–819
Article MathSciNet Google Scholar
Wu H, Li X, Deng Y (2020) Deep learning-driven wireless communication for edge-cloud computing: opportunities and challenges. J Cloud Comp 9:21 (2020)
Article Google Scholar
Jarray A, Karmouch A, Salazar J et al (2017) Efficient resource allocation and dimensioning of media edge clouds infrastructure. J Cloud Comp 6:27 (2017)
Article Google Scholar
Liu H, Eldarrat F, Alqahtani H et al (2018) Mobile edge cloud system: architectures, challenges, and approaches. IEEE Syst J 12(3):2495–2508
Article Google Scholar
Garcia S, Luengo J, Saez JA et al (2013) A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng 25(4):734–750
Article Google Scholar
Chen Q, Huang M, Wang H et al (2018) A Feature Preprocessing Framework of Remote Sensing Image for Marine Targets Recognition. In: 2018 OCEANS - MTS/IEEE Kobe techno-Oceans (OTO), pp 1–5
Google Scholar
Simon HA (1996) The sciences of the artificial, 3rd edn. MIT Press, Cambridge
Google Scholar
Dbouk T, Mourad A, Otrok H et al (2019) A novel ad-hoc Mobile edge cloud offering security services through intelligent resource-aware offloading. IEEE Trans Netw Serv Manag 16(4):1665–1680
Article Google Scholar
Liu J, Wu J, Sun L et al (2020) Image data model optimization method based on cloud computing. J Cloud Comp 9(1):1
Google Scholar
Ramirezgallego S, Garcia S, Mourinotalin H et al (2016) Data discretization: taxonomy and big data challenge. Wiley Interdisciplin Rev Data Mining Knowl Discov 6(1):5–21
Article Google Scholar
Chlebus BS, Nguyen SH (1998) On finding optimal Discretizations for two attributes. Lect Notes Comput Sci:537–544
Wong AK, Chiu D (1987) Synthesizing statistical knowledge from incomplete mixed-mode data. IEEE Trans Pattern Anal Mach Intell 9(6):796–805
Article Google Scholar
De Sa CR, Soares C, Knobbe A et al (2016) Entropy-based discretization methods for ranking data. Inform Sci 329:921–936
Article Google Scholar
Wu B, Zhang L, Zhao Y et al (2014) Feature selection via Cramer’s V-test discretization for remote-sensing image classification. IEEE Trans Geosci Remote Sens 52(5):2593–2606
Article Google Scholar
Chen Q, Huang M, Xu Q et al (2020) Reinforcement learning-based genetic algorithm in optimizing multidimensional data discretization scheme. Math Probl Eng 2020(1):1–13
Google Scholar
Nguyen SH, Skowron A (1995) Quantization of real value attributes-rough set and Boolean reasoning approach. In: Proc. second joint Ann. Conf. Information sciences (JCIS), pp 34–37
Google Scholar
Kara N, Soualhia M, Belqasmi F et al (2014) Genetic-based algorithms for resource management in virtualized IVR applications. J Cloud Comp 3:15
Article Google Scholar
Nikravesh AY, Ajila SA, Lung C (2018) Using genetic algorithms to find optimal solution in a search space for a cloud predictive cost-driven decision maker. J Cloud Comp 7:20
Article Google Scholar
Chen C, Li Z, Qiao S et al (2003) Study on discretization in rough set based on genetic algorithm. In: International conference on machine learning and cybernetics, pp 1430–1434
Google Scholar
Ren ZH, Hao Y, Wen B et al (2011) A heuristic genetic algorithm for continuous attribute discretization in rough set theory. Adv Mater Res 2011:132–136
Article Google Scholar
Dai J (2004) A genetic algorithm for discretization of decision systems. In: International conference on machine learning and cybernetics, pp 1319–1323
Google Scholar
Ishibuchi H, Yamamoto T, Nakashima T (2001) Fuzzy data mining: effect of fuzzy discretization. In: Proc. IEEE Int’l Conf. Data Mining (ICDM), pp 241–248
Google Scholar
Krinidis S, Chatzis V (2010) A robust fuzzy local information C-means clustering algorithm. IEEE Trans Image Process 19(5):1328–1337
Article MathSciNet Google Scholar
Saltos R, Weber R, Maldonado S et al (2017) Dynamic rough-fuzzy support vector clustering. IEEE Trans Fuzzy Syst 25(6):1508–1521
Article Google Scholar
Dougherty J, Kohavi R, Sahami M et al (1995) Supervised and unsupervised discretization of continuous features. In: International conference on machine learning. Elsevier, pp 194–202.
Goldberg DE (1989) Genetic algorithms in search, optimization, and machine learning. Addison-Wesley Professional, USA
MATH Google Scholar
Ramirezgallego S, Garcia S, Benitez JM et al (2016) Multivariate discretization based on evolutionary cut points selection for classification. IEEE Trans Cybern 46(3):595–608
Article Google Scholar
Pawlak Z (1992) Rough sets: theoretical aspects of reasoning about data. Kluwer Academic Publishers, Norwell
MATH Google Scholar
Zadeh LA (1965) Fuzzy sets. Inf Control 8(3):338–353
Article Google Scholar
Mitra S, Banka H, Pedrycz W (2006) Rough–fuzzy collaborative clustering. IEEE Trans Syst Man Cybern B Cybern 36(4):795–805
Article Google Scholar
Han Y, Shi P, Chen S (2015) Bipolar-valued rough fuzzy set and its applications to the decision information system. IEEE Trans Fuzzy Syst 23(6):2358–2370
Article Google Scholar
Dash S, Luhach AK, Chilamkurti N et al (2019) A Neuro-fuzzy approach for user behaviour classification and prediction. J Cloud Comp 8:17 (2019)
Article Google Scholar
Ismaeel S, Karim R, Miri A (2018) Proactive dynamic virtual-machine consolidation for energy conservation in cloud data centres. J Cloud Comp 7:10 (2018)
Article Google Scholar
Elrawy M, Awad A, Hamed H (2018) Intrusion detection systems for IoT-based smart environments: a survey. J Cloud Comp 7:21
Article Google Scholar
Jin R, Yuri B, Chibuike M (2009) Data discretization unification. Knowl Inf Syst 19(1):1–29
Article Google Scholar
Huang M, Chen Q, Wang H (2020) A multivariable optical remote sensing image feature discretization method applied to marine vessel targets recognition. Multimed Tools Appl 2020:4597–4618
Article Google Scholar
Wu D, Huang M, Zhang Y, Bhatti UA, Chen Q (2018) Strategy for assessment of disaster risk using typhoon hazards modeling based on chlorophyll-a content of seawater. EURASIP J Wirel Commun Netw 2018(1)
Xiao C, Zhu S, He M et al (2018) N6-Methyladenine DNA modification in the human genome. Molecularcell 71(2):306–318
Google Scholar
Yuan D, Xing J, Luan M et al (2020) DNA N6-methyladenine modification in wild and cultivated soybeans reveal different patterns in nucleus and cytoplasm. Front Genet. https://doi.org/10.3389/fgene.2020.00736
Li Y, Huang M, Zhang Y et al (2020) Automated Gleason grading and Gleason pattern region segmentation based on deep learning for pathological images of prostate cancer. IEEE Access 8:117714–117725
Article Google Scholar

Download references

Acknowledgments

The authors would like to thank the support of the laboratory, university and government.

Funding

This work was supported by Hainan Provincial Natural Science Foundation of China (Grant #: 2019CXTD400), the National Key Research and Development Program of China (Grant #: 2018YFB1404400). (Corresponding author: Mengxing Huang.)

Author information

Authors and Affiliations

State Key Laboratory of Marine Resource Utilization in South China Sea, Haikou, 570228, China
Qiong Chen & Mengxing Huang
School of Information and Communication Engineering, Hainan University, Haikou, 570228, China
Qiong Chen & Mengxing Huang

Authors

Qiong Chen
View author publications
You can also search for this author in PubMed Google Scholar
Mengxing Huang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors take part in the discussion of the work described in this paper. The author(s) read and approved the final manuscript.

Authors’ information

Qiong Chen received his B.Eng. degree at Beijing University of Posts and Telecommunications, P.R. China, 2007 and M.Eng. degree at Politecnico di Torino, Italy, 2012. Now he is a Ph.D. student at College of Information Science and Technology, Hainan University, P.R. China. His research interests include remote sensing image processing, evolutionary computing, granular computing, fuzzy decision-making, rough sets, big data analytics and multi-source data fusion.

Mengxing Huang received the Ph.D. degree from Northwestern Polytechnical University,Xi’an, China, in 2007. He then joined staff with the Research Institute of Information Technology, Tsinghua University as a Postdoctoral Researcher. In 2009, he joined Hainan University. He is currently a Professor and a Ph.D. Supervisor of computer science and technology, and the Dean of School of Information and Communication Engineering. He is also the Executive Vice-Presedent of Hainan Province Institute of Smart City, and the Leader of the Service Science and Technology Team with Hainan University. He has authored or coauthored more than 60 academic papers as the first or corresponding author. He has reported 12 patents of invention, owns 3 software copyright, and published has 2 monographs and 2 translations. He has been awarded Second Class and Third Class Prizes of The Hainan Provincial Scientific and Technological Progress. His current research interests include signal processing for sensor system, big data, and intelligent information processing.

Corresponding authors

Correspondence to Qiong Chen or Mengxing Huang.

Ethics declarations

Competing interests

The authors declare no conflict of interest. The sponsors had no role in the design, execution, interpretation, or writing of the study. These no potential competing interests in our paper. And all authors have seen the manuscript and approved to submit to your journal. We confirm that the content of the manuscript has not been published or submitted for publication elsewhere.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Chen, Q., Huang, M. Rough fuzzy model based feature discretization in intelligent data preprocess. J Cloud Comp 10, 5 (2021). https://doi.org/10.1186/s13677-020-00216-4

Download citation

Received: 01 September 2020
Accepted: 26 November 2020
Published: 18 January 2021
DOI: https://doi.org/10.1186/s13677-020-00216-4

Rough fuzzy model based feature discretization in intelligent data preprocess

Abstract

Introduction

Background

Feature discretization and genetic coding

Rough sets

Definition 2.1

Definition 2.2

Definition 2.3

Rough fuzzy-model

Definition 2.4

Definition 2.5

Rough fuzzy model based discretization algorithm

Membership calculated by fuzzy c-means clustering

Fitness function based on rough fuzzy model

Rough fuzzy model versus rough model

Experiments

Data source

Configuration of experimental environment

Datasets

Data consistency and number of breakpoints

Classification accuracy

Conclusion and future work

Availability of data and materials

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Contributions

Authors’ information

Corresponding authors

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords