Skip to main content

Advances, Systems and Applications

Table 3 Literature summary (AI-Focused Scheduling)

From: A survey of Kubernetes scheduling algorithms

#

Objectives

Methodology/Algorithms

Experiments

Findings

Applications

Limitations

[46]

To develop an effective deep learning cluster resource scheduler.

Predicts model convergence during training via online fitting and creates performance models to calculate training speed as a function of the resources allotted to each job. Deep learning tasks are placed, and resources are dynamically allocated in order to reduce the amount of time needed to complete each task.

Deployed on a deep learning cluster that runs 9 MXNet training jobs on 7 CPU servers and 6 GPU machines.

Optimus performs about 139% and 63% better than comparable cluster schedulers in terms of job completion time and makespan, respectively.

Can be used in production clusters with deep learning workloads.

Further testing and implementation may reveal limitations and future improvements.

[47]

Data processing job scheduling on distributed computing clusters.

Utilize neural networks and reinforcement learning to learn workload-specific scheduling algorithms.

Spark integration prototype on a 25 node cluster.

Comparing Decima to hand-tuned scheduling heuristics, the average job completion time is improved by at least 21%.

Scheduling data processing jobs on distributed compute clusters

Open research studies on resource management and computation optimization in edge computing

[48]

Develop a fair share scheduler for deep learning training on GPU clusters that strikes a balance between the competing demands of efficiency and fairness.

Gandivafair provides performance isolation between users and allocates cluster-wide GPU time fairly among active users. Gandivafair incentivizes users to use older GPUs with a novel resource trading mechanism that maximizes cluster efficiency without affecting fairness guarantees.

Realistic multi-user workloads were used to implement and assess the system in a heterogeneous 200-GPU cluster.

Gandivafair achieves both fairness and efficiency.

Can be used in GPU clusters for deep learning training.

Further testing and implementation may reveal limitations and future improvements.

[49]

Fully exploit and harness the power of big data, as well as to speed up processing times and enhance Kubernetes cluster performance in general.

Presented a container placement strategy based on progress (ProCon). ProCon takes into account both the current resource usage of the workforce and the projection of future resource demand. ProCon decreases completion time and makespan while balancing resource contentions among clusters.

Extensive experiments conducted to test ProCon

ProCon boosts overall performance by 23.0% and can cut completion times for certain jobs by up to 53.3%. It shows a makespan improvement of up to 37.4% over the Kubernetes scheduler by default.

To improve performance of Kubernetes clusters

–

[50]

Create a general-purpose, effective deep learning cluster scheduler, and to get the most out of expensive deep learning clusters.

Presented DL2, a scheduler for deep learning clusters that is driven by deep learning. The supervised learning and reinforcement learning approaches are combined in DL2. During the course of DL jobs' training, offline supervised learning is used to warm up the neural network and reinforcement learning is used to fine-tune it. Online resource allocation decisions for jobs are made by DL2 using neural networks.

Kubernetes was used to develop DL2, which allowed for dynamic resource scalability in DL jobs on MXNet. A thorough analysis was done to compare DL2 with the expert heuristic scheduler and the fairness scheduler (DRF) (Optimus).

In terms of average work completion time, DL2 performs better than DRF by 44.1% and Optimus by 17.5%.

To improve resource scheduling in deep learning clusters

–

[51]

To improve scheduling for deep learning applications in Kubernetes clusters

To optimize resource management for deep learning workloads

SpeCon, a unique container scheduler that is tailored for fleeting deep learning applications, was proposed. The foundation of Scheduler is virtualized containers like Kubernetes and Docker. In order to free up resources for quickly expanding models, algorithms keep track of training progress and speculatively move slow-growing models.

Extensive experiments were performed to evaluate the proposed scheduler

SpeCon reduces the time it takes for each job to be completed by up to 41.5%. It also increases makespan by 24.7% and system performance by 14.8%.

To optimize scheduling for deep learning workloads on Kubernetes

–

[52]

RLSK is a deep reinforcement learning-based job scheduler that can adaptively distribute independent batch processes across many federated cloud computing clusters.

RLSK is based on reinforcement learning and is implemented on Kubernetes

Simulations are conducted to evaluate the performance of RLSK

RLSK outperforms traditional scheduling algorithms

Scheduling independent batch jobs in federated cloud computing clusters.

–

[53]

Create a machine learning cluster scheduling system and increase its efficiency and precision.

A heuristic scheduling approach that takes an ML job's spatial and temporal characteristics into account. When the system is overloaded, this system load control method removes tasks that produce little to no gain in accuracy and shifts tasks from overloaded servers to underloaded servers depending on task priority.

Large-scale simulations based on actual data and actual experiments

When compared to current ML job schedulers, MLFS decreases JCT by up to 53%, makespan by up to 52%, and improves accuracy by up to 64%.

Job scheduling for large-scale machine learning clusters

–

[54]

Presented a learning-based scheduling framework for edge-cloud systems focused on Kubernetes (KaiS) that raises the throughput rate of processing requests over the long term.

In order to provide decentralized request dispatch and dynamic dispatch spaces within the edge cluster, KaiS employs a coordinated multi-agent actor-critic method. In order to reduce the orchestration dimensionality by stepwise scheduling, it additionally employs graph neural networks to embed system state information and combines the results with multiple policy networks.

Experiments were conducted using real workload traces.

Regardless of request arrival patterns or system scales, KaiS was successful in learning the proper scheduling policies. In comparison to baselines, it increased system throughput rate by 14.3% and decreased scheduling cost by 34.7%.

KaiS can be used to improve the performance of Kubernetes-oriented edge-cloud systems.

–

[55]

Create a custom scheduler for the Kubernetes orchestrator that uses a model-based, multi-purpose Multi-Agent System (MAS) platform to divide the scheduling task among the processing nodes.

MAS platform

Kubernetes orchestrator

–

The new scheduling approach is faster than the default scheduler of Kubernetes.

Fog-in-the-loop (FIL) applications

–

[56]

Develop a Kubernetes scheduling plan that blends the LSTM neural network prediction method with the grey system theory.

The method uses the grey system theory and LSTM neural network prediction to optimize the container scheduling algorithm.

The method was tested using experience results.

The algorithm was able to reduce resource fragmentation in the cluster and increase cluster resource utilization.

The method can be used to improve the performance of Kubernetes in managing containers in a cluster.

The limitations of the method are not mentioned in the text. Future work may focus on improving the performance of the algorithm and testing it in more complex scenarios.

[57]

Create a dynamic resource scheduler in Kubernetes for distributed deep learning training.

Combine the methods of DRAGON and OASIS to create a scheduler that supports weighted autoscaling and gang scheduling for its jobs.

Evaluation using a set of Tensorflow jobs

Scheduler increases training speed by over 26% compared to default Kubernetes scheduler

Used for distributed deep learning training in Kubeflow

–

[58]

Suggests a more effective scheduling method for deep learning platforms that can accommodate team cooperation. integrate a Docker-based deep learning platform with the improved Kubernetes scheduling algorithm.

Model users of the team as virtual clusters and routinely check the load on the clusters. Use a Docker-based deep learning platform with an improved Kubernetes scheduling algorithm.

The proposed scheduling algorithm is tested using a deep learning platform and multiple teams of users

The proposed algorithm ensures load balance and meets the needs of users

The proposed scheduling algorithm can be used in deep learning platforms to support multi-team collaboration

–

[59]

Offer a balanced, fine-grained scheduling mechanism for deep learning workloads.

Create a balanced, fine-grained scheduling model that takes into account the DL task's resource consumption characteristics.

It is suggested to build a scheduling system named KubFBS using specialized GPU sniffer and balance-aware scheduler modules.

The proposed system is evaluated using real-world DL tasks and the cluster is a 16-node Kubernetes cluster

KubFBS expedites the completion of DL activities and enhances the cluster's capacity for load balancing.

KubFBS can be used to schedule deep learning tasks in a Kubernetes cluster

Further experiments and evaluations are needed to demonstrate the effectiveness of the proposed system in different scenarios. Future work could also focus on improving the scalability of the proposed system.