Big data applications require extensive set of resources like storage, processing power, and communication channels due to their inherent characteristics. To handle this gigantic pile of data, it is common that techniques, frameworks, environments, and methodologies are continuously reviewed, analyzed and developed. This section explores the work done for big data analytics using Cloud computing and the use of Docker as well as Docker Swarm for the purpose of managing and orchestrating clusters for load balancing.
A Microservice based architecture for big data knowledge discovery which aims to acknowledge scalability and efficiency issues in processing is proposed by Singh et al. [16]. Naik et al. [17], have demonstrated the in workings of a model based on big data processing centered on Docker containers in multiple clouds by automatic assignment of big data clusters using Hadoop and Pachyderm. In the development phase, the environment used ensures the accurate working of code but it may fail during the testing or production phase due to environmental changes and/or differences. Containerization comes into play to handle this issue. Hardikar et al. [18] explored several facets of Containerization like automation, deployment, scaling, and load balancing with a focus on Docker as the runtime environment and Kubernetes is deployed for orchestration. The focus is mainly on containerization, but handling the big data microservice is not focused on directly in the study. In microservices, based on neighbourhood divison, a container scheduling approach called CSBND was proposed in [19] to optimize the system performance using response time and load balancing. The research did not handle big data and microservice based applications to be deployed on containers.
Big data analytics
Big Data Analytics deals with discovering knowledge from large datasets popularly known as big data for strategic planning, decision making, and prediction purposes [20]. To analyze these colossal datasets, dynamic environments are required which need to be scalable enough to manage varying workloads as conventional methods often fail to process these large sizes of data. Big Data Analytics is an assortment of tools, technologies, methodologies combined in a system/platform or framework to perform knowledge discovery through processes like data gathering, cleaning, modelling, and visualization [21]. Techniques like machine learning and deep neural networks are utilized to perform the analysis process.
The authors in [20, 22] provide an insight into various machine learning and deep learning algorithms which prove to be beneficial in Big Data Analytics. These processes require sophisticated architecture for storage, processing, and visualization. Cloud computing is considered to be an effective solution for it. The authors illustrated the affinity of Big Data with cloud with respect to its characteristics [23]. A web server load balancing mechanism focused on memory exploitation using Docker Swarm was proposed by Bella et al. [24]. This work focused on web server load balancing, however the service discovery part is not considered in the paper. Big data applications requires extensive use of resources and resource utilization for big data is also not discussed.
Containerization using Docker
To increase the efficiency of methods and optimize development as well as the deployment cost of applications over the cloud, there have been numerous architectures, frameworks, environments, and paradigms examined in the literature extensively. Docker, which is an open source containerization tool, is fast emerging as an alternative for application deployment over any cloud based architecture. Container centric virtualization is a substitute for virtualization done using hypervisor where containers share all resources like hardware, operating system and supporting libraries while maintaining abstraction and isolation [25]. Docker is a well-known lightweight tool providing prompt development and relocation with improved efficiency and flexibility in resource provision [26].
A distinct host can be used to create numerous containers in multiple user spaces, which is unlike VMs [27]. Container-based applications fabricated using Microservice architecture require traffic management and load balancing at high workloads. This issue is handled through container load balancing. A load balancer for a container results in higher availability and scalability of applications for client requests. This ensures seamless performance of Microservice applications running in containers. Tools like Docker Swarm as well as Kuberbnetes provide support to manage and deploy containers. Figure 1 gives an illustration of a distributing application client load to containerized microservices using a load balancer.
Docker Swarm
Management of containers is an important and crucial aspect of containerization. Load Balancing is required to handle requests dynamically. To manage Docker clusters, Docker Swarm, a cluster administration and orchestration tool is used that links and controls all Docker nodes [28]. Docker Swarm offers features like reliability, security, availability, scalability, and maintainability. It helps in the balanced distribution of any load and checks host machines for failed containers. If any failed containers are found, Docker Swarm redeploys it [23]. It is an enhancement of Docker.
Docker Swarm is made up of two types of nodes, manager and worker nodes. All membership and allocation processes are handled by the manager node while worker nodes execute swarm based services in Docker Swarm. The Manager node uses its own IP address and port to expose swarm services to all clients. Requests from clients are channelled to a chosen worker node by the swarm manager's internal load balancing mechanism so that requests are evenly distributed [29]. Although the Docker Swarm load balancing process distributes the load, the ability to monitor resource utilization according to available limits is not provided. This can lead to uneven load distribution making any Big Data Microservice prone to collapse. In this study, we will distribute Microservice based loads in Docker Swarm by checking resource consumption of host machines creating an even load distribution mandated by available limits.
Microservice architecture
Monolithic architectures are the most common conventional architectures used to deploy applications. These architectures work on more or less three basic layers i.e. presentation, business, and data logic in order to handle simple to complex tasks. The architecture is simple and easy to use since everything is under one autonomous deployment unit. However, the architecture may limit the application to scale and make updates a difficult task when a complex task needs to be managed. Microservice architectures aim to minimize the issues that exist in monolithic architectures by dividing the entire application into lightweight and loosely coupled components [30, 31]. Every component has its individual code repository and can be updated independently, making any complex application far more scalable, resilient and efficient. Service Discovery and Load Balancing are two critical as well as fundamental aspects of Microservices. Service Discovery can be defined as a registry of running instances for one or many services. It is required by Microservices to collaborate. Systems’ scalability, throughput, execution time, response time, and performance is largely influenced by load balancing [32].
Container-based virtualization and Microservices make a perfect association as containers provide a decentralized environment and are lightweight in nature. Today, Docker is used to build modules called Microservices [33], to decentralize packages and distribute jobs into distinct, stand alone applications that collaborate with each other. Microservices can be considered as small applications that must be deployed in their individual VM instances to have discrete environments. But to dedicate an entire VM [34] instance to just a part of an application is not an efficient approach. Docker containers require less computing resources when compared to virtual machines, therefore deploying hundreds and thousands of Microservices on Docker containers will reduce performance overhead and will increase the overall efficiency of the applications [35]. In this study, we will distribute the load of Big Data applications inside a Docker Swarm by utilizing resources of host machines. The main objective is to balance the load by checking memory consumption of all host machines based on known memory limits. This research aims at service discovery and server-side load balancing for Big Data applications based on Microservices using Docker Swarm.