Visibility estimation via deep label distribution learning in cloud environment

The visibility estimation of the environment has great research and application value in the fields of production. To estimate the visibility, we can utilize the camera to obtain some images as evidence. However, the camera only solves the image acquisition problem, and the analysis of image visibility requires strong computational power. To realize effective and efficient visibility estimation, we employ the cloud computing technique to realize high-through image analysis. Our method combines cloud computing and image-based visibility estimation into a powerful and efficient monitoring framework. To train an accurate model for visibility estimation, it is important to obtain the precise ground truth for every image. However, the ground-truth visibility is difficult to be labeled due to its high ambiguity. To solve this problem, we associate a label distribution to each image. The label distribution contains all the possible visibilities with their probabilities. To learn from such annotation, we employ a CNN-RNN model for visibility-aware feature extraction and a conditional probability neural network for distribution prediction. The estimation result can be improved by fusing the predicting results of multiple images from different views. Our experiment shows that labeling the image with visibility distribution can boost the learning performance, and our method can obtain the visibility from the image efficiently.


Introduction
Meteorological visibility is a crucial index for reporting daily air quality, which has an important bearing on environmental protection. Visibility has a wide range of applications such as traffic safety [1], industrial, agricultural production, and smart city. Traditionally, visibility can be estimated by some specialized equipment, such as a transilluminator or foreword scattering sensor [2]. However, since the equipment is usually expensive and inconvenient, we can place the equipment at only a few weather stations to detect the visibility of some fixed scenes. This cannot satisfy the requirement of multiple monitoring applications. To realize ubiquitous and intelligent monitoring as done in [3], we can utilize abounding and budget *Correspondence: songmf@seu.edu.cn 1 the School of Computer Science and Engineering, Southeast University, Nanjing, China 2 the Key Lab of Computer Network and Information Integration (Ministry of Education), Southeast University, Nanjing, China Full list of author information is available at the end of the article cameras as an alternative. By analyzing the image taken by these cameras, we can obtain direct evidence for visibility estimation effectively and efficiently.
The image-based visibility estimation constructs the mapping between the image representation and the visibility value. At present, many image-based visibility analysis methods are proposed [4][5][6], and they focus on creating some complex models to extract the visibility-aware cues from the image. The recent trend is geared towards using deep learning [7,8] to obtain the visibility-aware feature. These methods assume the computing power is enough, and the actual deployment problem is not considered. The normal cameras cannot realize the image analysis processing due to their limited computational power. It is impractical to assign an external computer for every camera, since even powerful CPUs cannot handle deep learning tasks efficiently. Usually, a graphics processing unit (GPU) is required but its cost is much higher. Another way is to use some programmable cameras with the embedded computing device. However, a typical deep model is difficult to be deployed on such devices. Due to its limited resource, the model is required to be compressed according to its storage space [9,10], which might reduce the performance. Meanwhile, the inference speed is slow, since its computational capacity is much worse than the external devices having dedicated graphics cards. Moreover, the deployed model is hard to be updated since the model is stored in the offline device. This brings a high cost to the operation and maintenance of the visibility monitoring system when the model is upgraded. Therefore, it demands an efficient and effective way to estimate the visibility of the image.
To realize the real-time image-based visibility estimation, an effective way is cloud computing. Cloud computing is a distributed computing paradigm for sharing configurable computing resources. By such a technique, data partitioning and sampling [11] are utilized to improve the processing of big data. At the same time, the images generated from multiple cameras can be analyzed with high parallelism and programmable flexibility through the distributed architecture [12] and fast data exchange [13]. Recently, cloud computing has been used in video surveillance for traffic services [14], which improves the response efficiency of the services. Inspired by these methods, our method integrates cloud computing into the image-based visibility monitoring process. This alleviates the lack of computing power in the visibility estimation applications.
However, it is difficult to deploy an effective visibility estimation model in the cloud environment. Currently, most visibility estimation methods use deep learning to construct the visibility prediction model [7,8]. Since these methods formulate the visibility estimation as a regression problem, we should label the visibilities of all the training images accurately. However, it is difficult to obtain the precise ground truth of image visibility. The human specification of absolute visibility from a single image is unreliable [15]. The specialized equipment cannot generate accurate visibility labels either due to deployment variations and environmental influences. The annotation problem brings many challenges to the construction of a cloud computing system for visibility estimation.
To overcome the problem, one way is to use image pairs labeled by ranking their visibilities as the supervision [16]. Though it is much easier to achieve the relative visibility annotation given a pair of images, absolute visibility cannot be derived directly from the ranking information. Besides, the annotation burden of relative visibility is increased significantly since there are lots of image pairs to be annotated.
Inspired by a novel machine learning paradigm called Label Distribution Learning (LDL) [17], we propose to label an image by a mixture of visibility values with different intensities, which is described as a distribution. Since such label distribution contains several possible visibility values, the problem of inaccurate annotation is overcome. To obtain such a label, we transform an absolute visibility label provided by humans or equipment into a visibility distribution. The transformation is based on the following observation: the images with close visibility have a high degree of similarity. Accordingly, we adopt one-dimensional Gauss distribution for visibility annotation, where the absolute visibility label is the mean of the distribution (as shown in Fig. 1). Thus, the absolute visibility has the highest intensity, while the relevant degree of other visibilities is inversely proportional to their distances to the absolute label.
Compared with previous labeling types, the label distribution has two superiorities for visibility estimation. On the one hand, the label distribution considers both the ambiguity of the visibility label and the convenience of the visibility prediction. The label distribution improves the robustness of model learning with the uncertain label. Meanwhile, absolute visibility can be obtained directly by searching the highest intensity from distribution. On the other hand, the label distribution can provide more informative supervision. Since every training image labeled by particular visibility is relative to the other closer visibility, it can affect the learned model of adjacent visibilities. In other words, the training images for particular visibility are substantially increased without extra image collection and annotation. In this paper, a novel visibility estimation method using deep LDL in a cloud environment is proposed. The method employs cloud computing to process the visibility image data, which satisfies the real-time requirements of meteorological monitoring applications. To obtain the visibility estimation model for the clouding center, we use deep label distribution to train the deep neural network. Given a set of images labeled by absolute visibility, we first transform the absolute label into a label distribution. Then, we integrate LDL, CNN, and RNN into a unified framework for visibility model learning. The LDL module focuses on the label level and employs the label distribution to remove the label ambiguity. The CNN module focuses on the global feature level and is responsible for learning the overall visibility from the whole image. The RNN module focuses on the local region level and searches the farthest region to give richer information for visibility estimation. By integrating the three levels of information including the label, global feature, and local region, we can construct a more effective image representation to estimate the visibility. The learned model can generate the distribution of all the possible visibilities given a test image. We finally use the visibility with the highest probability as the predicted visibility. To improve the robustness of the automatic estimation, the Dempster-Shafer theory is used to fuse the visibility estimation results of multiple images obtained from different views at the same location and time.
The contributions of this study are summarized as follows: 1) We propose to adopt cloud computing to deal with image-based visibility estimation, which improves real-time performance. 2) We propose to use the label distribution as the supervision for visibility estimation, which can not only overcome the inaccurate annotation problem, but also boost the learning performance without the increase of training examples. 3) We utilize the Dempster-Shafer theory to fuse multiple predictions of the images from different views, which promotes the stability and robustness of the algorithm.

Related work
This work relates to several research areas. In this section, we briefly review the existing work on image-based visibility estimation, cloud computing and evidence theory.

Image-based visibility estimation
The early image-based visibility estimation methods use some hand-crafted features to achieve visibility, such as contrast, image inflection point, or dark channel. For example, Busch et al. [18] employ wavelet transform to extract the contrast information of the image edge for visibility estimation. Jourlin et al. [19] select some critical points to construct a scene depth map with the stereo vision method. Bronte et al. [20] measure the visibility in poor weather conditions by calculating the vanishing point of the horizon as the farthest visible pixel. Graves et al. [21] train a log-linear model by combining local contrast features and dark channel prior. Xiang et al. [22] integrate average Sobel gradient operator and dark channel theory to detect the daytime visibility level. These methods focus on low-level image processing techniques, and cannot lead to the practical and stable estimation result across different scenes.
To improve the performance of visibility estimation, some researchers create some physic probabilistic models by meteorological law. Babari et al. [4,5] employ a non-linear regression model called Koschmieder's theory, which describes the relationship between the contrast distribution and visibility. The measurement of the image contrast can be improved further by the extinction coefficient [6]. However, the performance of these physics models is affected by weather conditions, illumination, and scene variations. It is extremely difficult to hardcode the enormous variability of these complex factors in general.
To improve the adaptiveness, a more sophisticated way is to extract the visibility prediction model from data by deep learning. Li et al. [7] employ a generalized regression neural network (GRNN) and a pre-trained CNN model for visibility estimation. Giyenko et al. [23] detect the visibility range of the weather images by building Shallow CNN neural network with fewer layers. Palvanov et al. [8] use three streams of deep integrated convolutional neural networks to extract the visibility-aware feature from spectrally filtered images, FFT-filtered images, and RGB images. Wang et al. [24] use a multimodal CNN architecture to learn the visibility-aware features by combining the two sensor modalities. Li et al. [25] propose a transfer learning method based on the feature fusion for visibility estimation. Lo et al. [26] further introduce PSO feature selection into the transfer learning method to improve the performance. These methods all treat the visibility estimation problem as a regression model, which is suffered from the label ambiguity challenges. To solve it, You et al. [16] propose a relative CNN-RNN model to learn relative atmospheric visibility from images. They ask the annotators to rank two given images based on the visibility by the crowdsourcing technique [27]. Though the label ambiguity is eliminated by such supervision, the annotation cost is increased significantly. Our method is inspired by these approaches, but we propose to use label distribution as the supervision. This provides a cheap and effective way to resolve the label ambiguity issue in visibility estimation.
Moreover, meteorological visibility is also temporal correlated, which can be predicted by some time-aware dynamic analysis techniques [28,29]. For example, Wu et al. [30] utilize the environmental state information to achieve the visibility prediction results in the airport. In contrast, our method does not rely on temporal information and estimates the visibility from the real-time images.

Cloud computing
Cloud computing is a new computing paradigm, which can process billions of data in seconds and provide strong network service [1,14]. It can effectively reduce offloading and maximize the system utility by balancing resource allocation [31,32]. With the rapid development of cloud computing, more user records are generated and stored in the data center for other applications [33]. Cloud computing has been widely used in many applications with high load and high concurrencies, such as smart city [34], intelligent manufacturing [35], and internet of thing [36]. Based on its advantages for largescale data processing, we integrate cloud computing and image processing into a whole visibility monitoring framework with a huge network of cameras, which improves the real-time and accurate performance of visibility estimation.

Information fusion
Since a single sensor system is unstable, it is difficult to meet the demands of the complex environment. To improve the robustness of the system, information fusion is used to combine the prediction results from multiple sensor devices [37]. Such a technique can make multiple sensors cooperate with each other and produce more accurate results by synthesizing the complementary information [38,39]. There are many information fusion techniques, such as Dempster-Shafer theory [40], fuzzy logic [41] and rough set theory [42]. Among these techniques, we choose Dempster-Shafer (D-S) theory for its high fusion accuracy and flexibility. By the confidence function and the likelihood function, it can calculate the uncertainty interval to describe the trust distribution of different information. Currently, D-S theory has been widely used in event probability prediction [43], image target recognition [44] and case decision analysis [45]. Our method uses D-S theory to combine the predictions of multiple images from different views. When the visibility information derived from different images is inconsistent, D-S theory can extract the homogenous information through common credibility, This ensures that the accuracy of the fused result is better than the original single result. To the best of our knowledge, this is the first time to introduce the D-S theory into image-based visibility estimation.

Overview
We propose a visibility estimation method based on deep LDL in a cloud environment. The overall idea is that we combine cloud computing and image process to estimate visibility efficiently. With cloud computing, we can make the front-end monitoring device thinner. The camera is only responsible for capturing and compressing the images. Then, the high-performance network is utilized to transmit the image data from the camera to the cloud platform. Thus, the complex image analysis task can be performed quickly and remotely in the cloud with high performance computing cluster. The results can be feedbacked to the monitoring center. This alleviates the lack of front-end computing power in the visibility estimation application. Our developed method saves the required resources by utilizing high-performance parallel computing of graphics processing units (GPU) in cloud computing services. With the GPU cluster, we can adopt larger and deeper models to improve the accuracy of visibility estimation without sacrificing efficiency. Meanwhile, another advantage of cloud computing is the convenience of system maintenance. Compared with front-end computing in the local PC or programmable cameras, the deep model is much easier to be upgraded, which helps to ensure the stability and consistency of the system. Figure 2 illustrates the whole pipeline of our method. The camera nodes collect the images and transmit them to the cloud data center. These cameras can adjust their angles continuously to obtain different images. Since these images are generated at the same location and time, they have nearly similar visibility. These images are analyzed by the pre-trained deep model stored in the data center. The visibility at the location of the camera node is finally achieved by fusing all the predictions of multiple images from different views.
To train the deep model, we label the image visibility with a distribution vector, and integrate deep learning and LDL for visibility estimation. The training input is a training image set S, and every image x ∈ S is labeled by an absolute visibility y. We also annotate the farthest region of every image by the coordinates of the bounding box b. The training output is a visibility estimation model, which includes a deep CNN-RNN module and an LDL module. Overall, the training session contains three parts: label transformation, feature learning, and distribution learning. In the following, we first describe each part of the training stage in detail, and then introduce the prediction fusion method when estimating the visibility. x ∈[ 0, 1] is called description degree of the image x, which means the probability of its visibility being y i . Due to the complexity of the label distribution, fully manual annotation is impractical. Thus, we prefer an automatic way to generate the distribution from an absolute visibility label y.
For visibility estimation, the label space is intrinsically continuous. To ease the learning, we quantize the continuous label space into a discrete space Y with equal step size y. In our method, we set the label space Y =[ 1km : y : 12km] (MATLAB notation) and the step size y = 0.1km. Thus, the label distribution D of every image x is a 111D vector, which satisfies i d To generate the label distribution D, we first determine the distribution type according to the characteristic of the visibility estimation problem. Given an image x labeled by an absolute visibility y, it is reasonable to make the corresponding description degree d y x ∈ D highest in the final distribution D. Meanwhile, since the neighbor visibility looks similar in appearance, we can increase the description degrees which are adjacent to the label y. Naturally, the description degree d y i x should be gradually reduced when the visibility y i is far away from the label y. Based on this observation, we choose the probability density function of one-dimensional Gauss distribution for label transformation.
The one-dimensional Gauss distribution is a type of continuous probability distribution for a real-valued random variable. Its general form is shaped as: where, the parameterȳ is the mean of the distribution, and σ is its standard deviation.
According to Eq. 1, we transform the absolute label y into a probability vector. Since we expect the description degree d y x ∈ D is the highest, we set the parameterȳ as the absolute label y. The parameter σ is a hyper-parameter, which is optimized by a simple grid searching described in the experiment section. After determining these two parameters, we can compute the probability density when the random variable equals y i as its description degree d y i x . In other words, we compute the description degree d Such label distribution D is a discrete distribution by a dense sampling of one-dimensional Gauss distribution. The discretion process makes the sum of the vector D not equal to 1, which achieves an illegal probability density function. To force i d y i x = 1, this vector is numerically normalized to generate the expected label distribution D. The distribution D is used for the following learning process, which makes the distribution D derived from the deep network consistent with the distribution D.

Feature learning
To simulate the procedure of humans judging the visibility, we follow the relative CNN-RNN method [16]. Since the method uses a ranked image pair to train the deep network, its network architecture contains two similar CNN-RNN branches. Instead, our method uses only one CNN-RNN model to extract the visibility-aware feature, which is more efficient.
Specially, the CNN-RNN model imitates the coarse-tofine way to detect the farthest target from the image. The CNN module learns the overall visibility from the global image, while the RNN model simulates the region search to realize the coarse-to-fine attention shift, as shown in Fig. 3. There are also other advanced methods for the region searching, such as temporal CNN [46], LSTM [47] and GRU [48]. The temporal CNN needs to maintain the sequence into the memory, while LSTM and GRU contain more parameters than RNN. Thus, we choose (2021) 10:46 Page 6 of 14 Fig. 3 The RNN module detects the farthest target from the image RNN for its fewer parameters and higher efficiency. In the experiment, we find RNN to work well in practice. By combining the CNN and RNN model, the final global feature can be more sensitive to the farthest region in the image, which is an important cue for visibility estimation. Figure 4 illustrates the architecture of our method. For the CNN module, we follow the design of AlexNet [49] for global feature extraction. Our CNN module contains 7 layers, which includes 5 convolution layers and 2 fully connected layers.
For the RNN module, we construct K layers for each of the first six CNN layers, and 1 layer for the last CNN layer. Totally, there are 6K + 1 states in a sequential manner in the RNN module. Every state predicts a bounding box r t , and the list of the bounding boxes shows the searching process of the farthest region, namely, from the whole image to the farthest region.
At each recursive step, the RNN module first crops the image x into a sub-region c x t based on the predicted bounding box r t−1 , which is the location result of the previous state. Then, the internal state h t is updated by the core network g h according to the sub-region c x t and the historical state h t−1 . The generated state h t encodes the knowledge of searching the farthest region. Finally, we use the internal state h t to predict the next bounding box r t by the location network as r t = g r (h t ), which is a two-layered network. To exchange the information between CNN and RNN, some shortcuts connections are added between the (7 − i)th (i = 0, ..., 6) layer of CNN and the (Ki + 1)th state of RNN.
To train the RNN module, we need to indicate the ground-truth bounding box of every state. To this end, we assume that the list of the bounding boxes is evenly distributed. Accordingly, given the annotated bounding box b of the farthest region, we generate the whole groundtruth list of the bounding boxes B = {b 1 , b 2 , ..., b 6K+1 } by average sampling. During the training phase, we expect to minimize the divergence between the predicted bounding boxes {r 1 , r 2 , ..., r 6K+1 } and the ground truth. Accordingly, we define the objective function of RNN as a location L2 norm loss: To integrate the information of CNN and RNN, the output of the CNN module and the last state h 6K+1 of the RNN module are concatenated into a global vector representation f for visibility estimation.

Distribution learning
To utilize the label distribution D, we integrate it into the network architecture. Naturally, we can use several fully connected layers with a softmax layer to turn the feature f into a distribution directly. However, this yields lots of weights between the feature layer and the output layer, which makes it difficult to obtain the optimal solution. Thus, we follow the design of the conditional probability neural network (CPNN) [50].
As shown in Fig. 4, CPNN contains three fully connected layers. The input of CPNN contains the feature f and a discrete visibility label y. The output of CPNN is a single value p(y|f ) as the conditional probability. To realize the training, the Kullback-Leibler divergence is employed to measure the difference between the estimated distribution and the ground-truth distribution D: Finally, we define the entire objective function as the sum of the location loss and the distribution loss: Given the entire objective function, we simultaneously optimize the CNN-RNN module and the CPNN module through back-propagation with stochastic gradient descent [51]. To accelerate the training, we use the CNN pre-trained by ImageNet. According to the learned model, the predicted visibility could be the one with the maximum description degree.

Prediction fusion
Due to the diverse image quality and the inherent feature noise, the system cannot estimate the visibility from a To this end, we employ Dempster-Shafer theory to fuse multiple predictions derived from different images. There are two steps in Dempster-Shafer theory. First, a set of related propositions are created with their subjective probabilities as the degrees of belief. For visibility estimation, we can create a proposition for every image captured from different views. The proposition is shaped as: the visibility of the image is y. We can also take the output of CPNN as its degree of belief.
Second, the degrees of belief are combined based on mass likelihood and belief function, as shown in Fig. 5. The DS rule of combination corresponds to the normalized conjunction of mass functions. Based on Shafer's description, given two independent and distinct pieces of evidence on the same frame of discernment, their combination is the conjunctive consensus between them. When using the DS rule for multi-view visibility estimation, we can obtain the belief function of every proposition derived from every image. The belief function considers both the relevance and conflict between different views, which makes the fusion result more reliable.
The core of D-S theory is the construction of the proposition set. To realize it for multi-view prediction fusion, a direct way is to create c mutually exclusive propositions for every single visibility, where c is the size of the label space Y. However, we found such propositions cannot always improve the performance, since the quality of prediction fusion is affected by the possible strong conflict between different views [52].
To solve it, we design several fuzzy propositions to reduce the evidence conflict. The fuzzy proposition is corresponding to a subset of the visibility label space, instead of single visibility. To avoid combinatorial explosion, we utilize the continuity of the label space to remove some subsets with low confidence. First, the subset should be a continuous sequence of visibility labels. Second, the length of the sequence should be equal to 3. The limitation of the length is selected empirically. If the parameter is too long, the fuzzy proposition contains very little information. Thus, we set it as 3. Accordingly, our proposition set A contains two types of propositions. The first type of proposition is unambiguous, which is indicated by A j . The proposition A j means that the visibility of multiview images is y j ∈ Y . The second type is fuzzy, i.e., the visibility of the image is one element of a continuous sequence. The fuzzy proposition is denoted by A jk (j < k ≤ j + 2), which means that the corresponding sequence is {y j , ..., y k }. Obviously, we have A i ⊂ A jk if j ≤ i ≤ k. To simplify the notation, we denote A jk as A j in the following session.
To support the fusion of such a proposition set, we first modify the input label of our CPNN to obtain the confidences of all the propositions. The original network can only take every single visibility y j as the input. We expand the input label set by adding the continuous sequence that appears in the fuzzy proposition. To realize the training of the CPNN, we measure the groundtruth description degree d y ij x by the sum of all the related probabilities: The deep network is re-trained by the extending input label set. Based on the deep network, we can obtain the normalized confidences of all the propositions given an image.
Then, we use Dempster's rule to combine the pieces of evidence derived from multiple images. Given the multiview image set V = {View 1 , ..., View n }, our deep network generates the probabilities of all the propositions for each image View i . We denote m i (A j ) as the confidence of the image View i supporting the proposition A j . Thus, we can use m i (A j ) to define the mass function, which is the basic probability assignment function. Based on the mass function, we can compute the belief function Bel(A j ). The belief function describes the confidence of all the images supporting the visibility estimation proposition, which is computed by the following rule: where, K is the normalization constant as: Finally, we choose the proposition with the highest belief as the result of prediction fusion. If this proposition A jk is fuzzy, we then choose its single subset A i (j ≤ i ≤ k) with the highest plausibility as the fusion result.

Implementation detail
For the CNN part, it takes a long time to converge if random initialization is used. Accordingly, we use the ImageNet dataset to pre-train the model due to the similarity of basic characteristics between ImageNet and our dataset. The pretraining process can greatly accelerate the convergence. For the RNN part, we pre-train the model without connection to the CNN layer. To realize the training, Tensorflow is used to implement the models and comparative experiments. The whole model is optimized by Adam optimizer. During training, the batch size is 64, the learning rate is 0.00001, and the epoch is 50. All the experiments are conducted on NVIDIA GeForce RTX 2080Ti.

Experimental setup
The method is evaluated on 3 image sets: FROSI (Foggy Road Sign Images) [53], FRIDA (Foggy Road Image Dataset) [54] and RMID (Real Multiview Images Dataset). FROSI and FRIDA are two synthetic datasets. FROSI contains images of simple road scenes and traffic signs, while FRIDA contains images of the urban road scenes under different weather conditions. The original image is blurred by adding the synthetic frog effects. And every image generates four synthetic images. 70% images of the two datasets are selected as the training set and the other images are used for testing. RMID is a real multi-view image dataset collected by ourselves, which includes 3000 images. These images are grouped according to their capturing time and place. Every group of images only has different camera parameters. The visibility is labeled based on the report of the weather stations, ranging from 1km to 12km. The precision of the visibility is 0.1km. Thus, there are 111 visibility levels. Due to the rare situation of heavy fog, the distribution of visibility is imbalanced. Figure 6 shows several images in RMID. When using RMID to train our deep model, we randomly select 100 images for every visibility level to create a uniform distributed training set, and the other images form the test set.
We divide the three image sets by equally classifying the visibility into four grades: blurry, sub-blurry, sub-clear, and clear. As shown in Table 1, the distributions of the three image sets are very different.
To compare the performance of different methods, we use the mean absolute error (MAE) as the evaluation index, which is defined as: where, y i andŷ i are the ground truth and the predicted visibility of the ith testing image respectively, n is the number of the testing images.

Distribution analysis
We evaluate our choice of Gauss distribution and the parameter of the standard deviation σ . To study the influence of the distribution type, we compare the performance of Gauss distribution, average distribution, and triangle distribution. Figure 7 shows the shapes of the other two distributions. To search the optimal standard deviation σ , we run our method with 5 different values. Figure 8 shows the shapes of the gauss distribution with different σ . To simplify the comparison, we remove the prediction fusion and measure the accuracy of visibility estimation for a single image.
As shown in Table 2, different distributions have a different impact on the prediction results. From the table, we can see that the optimal distribution is the Gauss distribution. Moreover, the value of the standard deviation has a significant effect on the final performance. As shown in Fig. 8, if the value is too large, the Gauss distribution is very similar to the average distribution. If the value is too small, the Gauss distribution is over concentrated, which degenerates into the absolute label. Accordingly, we unify σ 2 = 1.5 in the following experiment since it leads to the best performance.

CNN-RNN analysis
We then explore the influence of the hyper-parameter K in the RNN module. This parameter controls the information exchange between CNN and RNN. To select the optimal value, we run our methods on the set FROSI with three different values. The experimental setting is without the prediction fusion stage, which is as the same as the distribution analysis experiment. The results are shown  The best performance indicator is marked as bold in Table 3. From the table, we can see that the effect of this parameter is relatively stable. In the following experiments, we set the parameter K as 3 since the performance is best.
Comparison with state-of-the-art methods Figure 9 shows the test results of some images in the dataset, including the real value and prediction value. It is found that our result is very close to the ground truth. We also measure the timing of each step for processing the images in the cloud environment. During the training session, it takes 257ms average to update the parameters of the deep model for one image. To achieve model learning, it takes about 3 hours to train 2000 images. During the inference session, it costs 62ms average to estimate the visibility of every single image, and 55ms for prediction fusion. We also measure the network transfer time with the bandwidth 1000 Mbps. The size of every image is about 900 kb on average, thus the transfer delay is about 7ms. As our prediction fusion takes 4 images as input, the whole inference time with cloud computing is 62ms + 55ms + 4 × 7ms = 145ms (the 4 images are estimated in parallel through the deep network). This satisfies the requirement of the real-time monitoring application. We also try to deploy the model in Huawei Atlas 200. After the test, it takes about 214ms average to estimate the visibility of one image, and 275ms for prediction fusion. Thus, the whole inference time is 214ms+275ms = 489ms when using the programmable camera, which is 3.4 times that of cloud computing. This shows that cloud computing is a superior choice in the application scenarios of visibility estimation. Table 4 reports the comparisons of our method and previous state-of-the-art methods on three datasets from the view of prediction accuracy. According to the results in Table 4, our method achieves the best results. By analyzing the estimation results, we can see that label distribution is more effective than the absolute label or the ranking label when the network architecture is the same.

Ablation study
To prove the effectiveness of different designs, we compare our method with all components and alternatives with one of our choices disabled. We run the following variants: No CNN -We remove the CNN module from the network.
No RNN -We remove the RNN module from the network.
No LDL -We use the absolute label instead of the distribution and turn it into a regression problem.
No Fusion -We use only one image for visibility estimation instead of fusing the predictions from differet views.
We run our method and the variants on the three image sets. As shown in Table 4, we can see that our method achieves a significant performance boost compared with other variants. Among all the choices, the CNN module plays the most important role. The LDL, fusion, and RNN modules can also boost performance significantly. This demonstrates our contributions for visibility estimation.
We also compare different prediction fusion methods, such as average fusion, voting fusion, and max fusion. The average fusion method takes the average score of multi-view images as the output. The max fusion method chooses the visibility with the maximum score across the  The best performance indicator is marked as bold multi-view images as the result. The voting fusion method uses the voting strategy to combine the predictions. Every image votes one visibility value with the highest probability. The visibility value with the maximum voting is the output of the fusion. As shown in Table 4, we can see that our prediction fusion method achieves the best performance.

Conclusion
We observe that image-based visibility estimation cannot successfully learn precise models when the labels are ambiguous. To solve this problem, we propose a deep label distribution learning method for visibility estimation. In our method, the visibility of every image is annotated by a label distribution. To learn from such annotation, we integrate CNN, RNN, and CPNN into a unified method, which locates the farthest region in the image and minimizes the difference between the predicted distribution and the ground-truth distribution simultaneously. To realize actual engineering for real-time visibility monitoring, we combine cloud computing and our visibility estimation into a whole framework. The experiment shows that compared with the absolute label or ranking label, label distribution can achieve the best performance for visibility estimation. Meanwhile, our method can obtain visibility from the image efficiently.

Limitation and future work
The robustness and effectiveness of our method have been demonstrated by extensive experiments. It also has some limitations in some special cases. Figure 10 shows two cases. Our method relies on the information of the farthest region. If the region is not notable, the visibility may not be well predicted. Moreover, the backlighting and the local strong light still disturb the prediction. Our method can be improved in several directions in the future. Cur- rently, model learning and prediction fusion are separate. A future direction is to combine deep multi-view learning [58] into the training session, which leads to an end-toend multi-view visibility prediction framework. Another direction is integrating edge computing [1,14] to construct a more efficient and robust visibility monitoring framework.