Low-power multi-cloud deployment of large distributed service applications with response-time constraints

Distributed service applications make heavy use of clouds and multi-clouds, and must (i) meet service quality goals (e.g. response time) while (ii) satisfying cloud resource constraints and (iii) conserving power. Deployment algorithms must (iv) provide a solution meeting these requirements within a short time to be useful in practice. Very few existing deployment methods address the first three requirements, and those that do take too long to find a deployment. The Low-Power Multi-Cloud Application Deployment (LPD) algorithm fills this gap with a low-complexity heuristic combination of generalized graph partitioning between clouds, bin-packing within each cloud and queueing approximations to control the response time. LPD has no known competitor that quickly finds a solution that satisfies response time bounds. A host execution time approximation for contention is fundamental to achieving sufficient solution speed. LPD is intended for use by cloud managers who must simultaneously manage hosts and application deployments and plan capacity to offer services such as Serverless Computing. On 104 test scenarios deploying up to 200 processes with up to 240 replicas (for scaling), LPD always produced a feasible solution within 100 s (within 20 seconds in over three-quarters of cases). Compared to the Mixed Integer Program solution by CPLEX (which took a lot longer and was sometimes not found) LPD solutions gave power consumption equal to MIP in a third of cases and within 6% of MIP in 95% of cases. In 93% of all 104 cases the power consumption is within 20% of an (unachievable) lower bound. LPD is intended as a stand-alone heuristic to meet solution time restrictions, but could easily be adapted for use as a repair mechanism in a Genetic Algorithm.


Introduction
Applications with many services are increasingly common (e.g., microservice systems) and use multiple clouds or data centres to obtain suitable resources, to access regional data sets, or for reliability. Deployment must respect cloud resources, power consumption, and user response time, and must provide a fast solution for re-computation when adapting to changes. Network delays between clouds are large enough that they must be considered explicitly in a useful deployment solution. It is clear from the literature that solution by Mixed Integer Programming (MIP) or by metaheuristics such as genetic algorithms take too long for larger problems, so heuristics must be used.
The service applications considered here have deployable units called "tasks" which may be virtual machines or containers. A user request results in tasks calling each other, possibly many times, in diverse patterns modeled as random interactions with average numbers of calls. They use the internet for calls between clouds, which introduces important delays which may add up to violate response time requirements. The heuristic Low Power Multi-Cloud Application Deployment (LPD) algorithm takes an application in the form of a directed Application Graph with tasks as nodes and their interactions as edges (Fig. 1a), with a maximum anticipated load level, and finds a deployment of tasks to hosts in multiple clouds (Fig. 1b), including scaled-out replicas as necessary. Under autoscaling the deployment is then feasible (although perhaps not optimal) at lower load levels as well, using fewer replica tasks and hosts.
LPD has four goals. It should: Goal 1. have low total power use by the hosts (ideally, minimum power), Goal 2. satisfy arbitrary constraints on computing and memory capacity (heterogeneous clouds), Goal 3. satisfy constraints on the response time, including accounting for intercloud network delay (here, the mean response time is constrained), Goal 4. provide a solution quickly enough to be useful.
A suitable target maximum solution time depends on how often a deployment must be calculated. It could be a few seconds or minutes; for the examples considered here it is taken to be 1 minute on the grounds that cloud adaptation (such as deployment of a new virtual machine) occurs on a timescale of minutes.
Metaheurstics such as genetic algorithms, particle swarm optimization or stochastic annealing address goals 1-3 by combining them in utility functions that include weighted penalties for delay and other constraints. Soft constraints do not prevent constraint violation without additional, possibly complex effort, so they are excluded here, and this excludes most metaheuristics. Additionally, their solution takes too long, as discussed below, and this applies also to versions which "repair" the solution to enforce actual constraints. Excluding soft constraints the above combination of four goals is not resolved by existing methods, which motivates this work. Table 1 lists the combinations of the first three goals and scalable approaches that satisfy them, apart from MIP and metaheuristics. Some combinations have elementary solutions, indicated with italics.
The LPD approach adds power reduction to both graph partitioning (to allocate tasks to clouds) and bin-packing (to allocate tasks to hosts). Using an approximate bound in the response time constraint reduces the effort to compute it and brings the solution time within the desired range.

Related work
Optimal deployment is a classic problem. There have been recent surveys of optimal deployment in clouds by Zhang et al. [4] (emphasizing efficient algorithms), Helali et al. [5] (emphasizing adaptation by consolidation), Masdari and Zangakani [6] (emphasizing predictive adaptation), and Skaltsis et al. [7], oriented to deploying multi-agent systems. They list a variety of approaches, including • Mixed Integer Programming (MIP), • Genetic Algorithms (GA) and other metaheuristics such as ant colonies, particle swarms, and simulated annealing, that apply varieties of random search, • Application Graph Partitioning (AGP), which focuses on the interactions between clouds, • other approaches including bin-packing, hill-climbing and custom heuristics.
The use of MIP is evaluated in several papers. Malek et al. [8] also gives additional references to MIP. Li et al. [9] combine MIP with bin-packing, to optimize power. Ciavotta et al. [10] use MIP to partition a software architecture and deploy it. The latter study minimized cost while constraining response time (without network delays) as computed by a simple queueing model, and then used a more detailed layered queueing model and a customized local search. MIP can address Goals 1-3 of LPD, and is formulated for this purpose below. However, all these studies found that MIP scaled too poorly for practical use.
GA evolves deployments by modifying a set of initial candidates by random mutations and by combining existing candidates, to optimize a fitness function. The fitness weights together the objectives and constraints, making them soft constraints. GA has been applied for Goals 1-3 in [3,8,11]. The use of other metaheuristics is referenced in these papers and in the surveys. A strength of the fitness function approach is that it can be extended to include additional goals such as reliability, and to deal with multiple fitness functions. It can find sets of Paretooptimal (i.e. non-dominated) solutions for multiple objectives, which goes beyond our goals. In Frincu et al.
[12] the objectives are response time, resource usage, availability, and fault-tolerance; in Guerrero et al. [13] they include failure rates and total network delay as a proxy for response time. Frey et al. [14] optimize deployment over multiple clouds and multiple cloud vendors with three objectives of cost, mean response time, and SLA violation rate. Fitness is evaluated by a simulator, and constraints are enforced by rejecting infeasible candidates. Ye et al. [15] consider four objectives including energy minimization in the cloud and the network.
The inability of unmodified GA to directly enforce constraints leads us to exclude it as an approach. A recent method used in [15] enforces constraints by "repairing" 1,2 2-D bin-packing (with dimensions of capacity and memory) augmented by considering power (e.g. [2]) (used in LPD).

1,3
Graph partitioning among clouds to control delay; for each cloud deploy to the most efficient processor first (as in LPD without capacity constraints).

2,3
Minimize response time (e.g. by graph partitioning accounting for delay and capacity as in [3]), and accept the solution if it meets the response time constraint.
1,2,3 New in LPD: Use graph partitioning methods among clouds to control delay, with two-dimensional bin-packing (capacity and memory) within clouds augmented by decisions based on power. infeasibilities for each fitness evaluation. The repair changes the candidate solution to make it feasible. Similar changes are part of the partitioning algorithm in LPD. While it is not considered here, the potential use of part of LPD for GA repair is discussed later. It is also clear from the experience in the literature that GA is too slow for our fourth goal, for large deployments.
Malek et al. [8] compare MIP, GA and a heuristic and find that both MIP and GA are orders of magnitude too slow. Guerout et al. [16] compare MIP and GA for single-cloud deployment to optimize a utility function that combines four QoS objectives including energy, using a stagewise approach to improve MIP scalability, and find that MIP and GA are both too slow for practical use. Their calculated response times ignore network delays; including them would undoubtedly make both algorithms slower. Alizadeh and Nishi [17] carefully compare MIP solution times by CPLEX [18], and by GA, on a very large MIP unrelated to deployment. Their GA strictly enforces constraints by constraint repair. They find that the solution time of GA is at best an order of magnitude less than for MIP. Since MIP is several orders of magnitude too slow for Goal 4, this emphasizes that GA is unsuitable on the grounds of solution time.
Most papers consider only a single cloud or data centre. Multi-cloud systems have more serious response time concerns because of the network delay for messages that cross between clouds. Application Graph Partitioning (AGP) addresses these delays as costs on the arcs of an application graph as in Fig. 1(a). AGP algorithms such as in Fiduccia-Mattheyses [19] efficiently reduce or minimize the total cost, and can also include host constraints and costs. AGP is used for multi-cloud deployments in [20], and in [3] to minimize bandwidth (rather than delay), in [1] to minimize the sum of network and energy cost, and in [21] (via a metaheuristic) to minimize power in mobile clouds. The approach called Virtual Network Embedding [22] is also an adaptation of AGP. In summary, AGP deals efficiently with network delays but cannot resolve all the remaining goals; it needs to be augmented.
Bin-packing deals directly with capacity constraints (Goal 2) and is combined with performance models in [23] (ignoring power). It is adapted by Arroba et al. in [2] to reduce power (Goal 1) by applying CPU speed and voltage scaling, but they do not address response time.
Custom heuristics are described in [8] and other works; in [8] they are the only option that solved in a useful time, but this work does not address response time.
The response time includes delay due to processor contention and saturation, which are estimated by performance models. A survey of models for clouds is given by Ardagna et al. [24]. Some recent work on deployment that reflects resource contention (always on the mean response time): Aldhalaan and Menasce [25] use a queueing model to place VMs in a cloud, finding the number of replicas by a hill-climbing technique. Ciavotta et al. [10] use a simple queueing approximation in a first stage, and a detailed layered queueing model in a second stage of optimization. Molka and Casale [26] use queueing models with GA, bin-packing and non-linear optimization, to allocate resources for a set of in-memory databases. Wada et al. [27] use queuing models to predict multiple service level metrics and a multi-objective genetic algorithm to seek pareto-optimal deployments. Calheiros et al. [28] and Shu et al. [11] use a queueing model to scale a single task to meet response time requirements, and [11] also models response failures. None of these studies consider power.
Deployment to minimize power has been part of many studies, e.g. energy is included in the utility function in [8]. Wang and Xia [29] apply MIP to VM placement in the cloud to minimize power subject to processing and memory constraints (but ignoring response time and network delays). Tunc et al. [30] use a real-time control rule based on performance monitoring and a Value of Service metric that enforces the response time constraint while controlling for energy consumption. This is close to our stated problem, but they do not include network delays in the response time. Chen et al. [31] use a random search heuristic to minimize the money cost of power and network operations (but not delay).
For deployment tools, Arcangeli et al. [32] describe many tools from the viewpoint of a system operator, including MIP, GA, AGP and others.
To summarize, a heuristic solution for the minimumpower application deployment problem subject to constraints on response time and host capacities is needed because the solution times to find provable optima via methods such as MIP are too long. Metaheuristics like GA also take too long and have difficulty enforcing actual constraints. LPD, developed in the theses of Kaur [33] for a single cloud and Singh [34] for multi-clouds, uses instead a novel combination of graph partitioning and bin-packing to meet all the goals stated for LPD.

Background on techniques used in LPD
LPD uses customized algorithms derived from the following two families.
K-way graph partitioning (KGP) [19] divides a graph into K parts, where each part is a subset of the graph node set, the parts are non-intersecting, and cover the entire node set. A common objective is to minimize the cutweight, which is the sum of the edge weights that connect nodes in different parts. KGP is NP-hard [35] and so does not scale well, but approximate solutions are provided more quickly using multi-level KGP schemes proposed by Karypis et al. [36,37]. In LPD, the edge weights are defined so the cut-weight is the network delay and multilevel KGP is used to reduce the cut-weight until the response time constraint is satisfied, but not necessarily minimized. Graph partitioning algorithms often enforce a balance constraint to distribute the node weights equally among the partitions, but this is not required in the deployment problem.
2-D bin packing described in [1] is an NP-hard algorithm to pack a set of 2-dimensional items into a minimum number of 2-dimensional bins. In LPD, heuristic 2-D bin packing is incorporated into the KGP move evaluation, to reduce power consumption. After partitioning, a more rigorous bin packing is applied to each partition to further reduce power consumption.

The model for multi-cloud application deployments
We consider K heterogeneous clouds, with heterogeneous hosts, and different network latencies between clouds. Communication delays within the clouds are ignored. The set of users is treated as an additional cloud, which has only the users; there is no other deployment to the user cloud. Figure 2 shows the model, and Table 2 defines its parameters. The processing capacity of cloud hosts is defined in ssj_ops according to the SPECpower benchmark for processors [38].
A deployment D of tasks to individual hosts in the clouds must provide a user throughput of λ responses/ sec, and have a latency (response time) of no more than R req msec.

The application model and application graph
The analysis begins from an Application Graph G of vertices (denoted as v i ) representing deployable components, and edges denoted as (v i , v j ), representing their interactions and labelled by the number of messages c ij exchanged, per user response. Optionally, G may be derived from an Application Model as in Fig. 3(a), which shows the components as "tasks" and the inter-task calls that invoke operations, labelled by resource demands.  The Application Model uses the notation of layered queueing models [39,40] in which the components are shown as rectangles (e.g. t0) and their operations as attached rectangles (e.g. e2601). Each invocation of an operation has a mean CPU demand and makes calls as shown by arrows, labeled by the mean calls. Derivation of the graph parameters from a layered queueing model is described in [41]. Service operations may have highly variable resource demands, depending on the nature of the user request, and the data that it accesses. This is modeled by random values for CPU demands and calls, with average values as parameters. Figure 3(b) gives an example of an Application Graph derived from the Application Model of Fig. 3(a). Alternatively, the Application Graph could be derived directly from measurements on the deployment units. Table 3 defines the properties of the Application Graph.
The task processing demand v ops i is measured in operations/s in units of ssj_ops as defined in the SPECPower benchmark [38], executed on a reference host h ref at a nominal user throughput of λ responses/s. To obtain the demand on any other host type it is scaled by the host speed factor to give v ops i /h sf kl .

Power model
Power consumption h pow kl for each host h kl is related to the work it is doing, modeled by a linear least-squares fit to its SPECpower [38] benchmark results, stated as: where h base kl is the base turn-on power in W, h pRate kl in W/ ssj_ops is the slope of the fitted line, and h ops kl is the load. Power Usage Effectiveness, defined as the ratio of cloud power consumption to total host power consumption [42], is assumed for simplicity to be unity for all clouds, so cloud power equals total host power. If it is less (but still the same for all clouds), power can be scaled.

Bound on Total execution delay
The response time is the network delay plus the execution delay. The total execution delay is the time (per user response) spent executing and waiting to be scheduled on some processor. LPD applies a capacity constraint on each host that restricts its CPU utilization below a nominal value U nom . Since execution delay increases with utilization, the delay for utilization = U nom is a bound.
For a given task v i on host h kl the maximum delay is calculated by the well-known processor-sharing approximation to host queueing (see e.g. [43]) as: The total delay bound D Tproc is the sum of time(v i , h kl ) over all tasks, which simplifies to Adding this value to the total network delay gives an upper bound on response time, defined as D RT below.

Deployment model
Using the concepts and notation defined above, a deployment is described as follows: A feasible deployment must not exceed the processing capacity (in ssj_ops) and memory capacity (in MB) of any host and must satisfy the overall response time requirement R req (in msec).

Mixed-integer quadratic programming optimization
The deployment optimization gives a mixed-integer quadratic programming (MIQP) model (quadratic because the network delay depends on the deployment of pairs of tasks). MIQP solutions found using CPLEX [18] were  The low-power multi-cloud application deployment (LPD) algorithm LPD first scales the application by creating replicas of each task to provide sufficient processing power. Then it deploys tasks to clouds by a graph-partitioning algorithm with a first phase that iteratively reduces D RT until D RT ≤ R req , and a second phase that minimizes D pow while maintaining D RT ≤ R req . This is carried out for a set of initial partitions, including random partitions. Finally, the host deployments d ikl within each cloud are determined by a power-aware bin-packing stage which satisfies the capacity constraints for each host and reduces power further.
There are six stages of LPD as described in the Algorithm summary and illustrated in Fig. 4.

LPD algorithm summary
INPUTS: clouds, hosts, users, application, control parameters, network delays. Stage 1: Select a subset of clouds C to host the application. Stage 2: Generate the application graph G from the application model.

LPD stages 1-4: preparation and initial deployment
The first stage selects the clouds for deployment, and stage 2 creates the application graph G. Stage 3 adds replicas to provide sufficient capacity for each task, even if deployed on the slowest host in the system (call its speed factor s min ). For task v i this gives: Each replica receives an equal share of the calls to v i and duplicates the calls made by v i . Stage 3 coarsens the resulting graph to G coarse by repeatedly merging the nodes (tasks) joined by the edge with the largest weight, subject to maximum nodal processing and memory requirements (which ensure that the merged node will fit on any host). It terminates when the largest edge weight is less than the original average edge weight.
Experience showed that the partitioning sometimes found local optima. To improve the solution, Stage 4 creates a set of initial deployments using three strategies: 1. Random: N random initial deployments. 2. Greedy on Power/Delay: with two variants: repeatedly deploy the task with the largest CPU demand on the feasible cloud that has space and (is most powerefficient)/(adds the least delay) (2 deployments), 3. Bin-packing: all combinations of two bin-packing strategies, packing on memory or power, and sorting bins in increasing or decreasing order (8 deployments).

LPD stage 5: partitioning
Partitioning among clouds uses the sub-algorithm described below with a simplified set of constraints. The response time constraint was simplified by computing D Tproc just once based on the slowest host, and then constraining D Tnet by which is the cut weight of the partition. Simplified capacity constraints for processing and memory in each cloud are computed based on the most restricted hosts in the cloud. These simplifications are conservative, in that any feasible solution satisfies the original constraints. Partitioning uses the K-way variant of the Fiduccia-Mattheyses (FM) graph partitioning refinement heuristic [19,35]. It recursively identifies clusters to move from one cloud to another, based on the improvement in a fitness function calculated for each potential move. In phase 1 it obtains feasibility by making moves that reduce D Tnet until Eq. (6) is satisfied, and then in phase 2 it minimizes power while maintaining feasibility. The Partition sub-algorithm uses this additional notation: move(v i , h to, l ) moves task v i to host l on cloud C to , • ∆ RT (move) = amount of response time reduction, • ∆ pow (move) = amount of power reduction, and partitions G coarse among clouds to minimize power subject to response time and capacity constraints.

Partition sub-algorithm
INPUTS: D init , G, C, δ, R req D←D init . Calculate D RT and D pow and sort the tasks in G in decreasing order of v ops i . Repeat: Initialize best power reduction move: move pow ← φ, max pow ← 0. Initialize best latency reduction move: move RT ← φ, max RT ← 0. For each task v i in the sorted task list: For each cloud C to other than C 0 or C from : Find the most power efficient available host h to, l in C to that can accept v i . Calculate the power and latency reductions ∆ pow and ∆ RT for moving v i to h to, l . If ∆ pow > max pow , then update max pow to ∆ pow and move pow ← move(v i , h to, l ). If ∆ RT > max RT , then update max R to ∆ RT and set move RT ← move(v i , h to, l ).
If move pow ≠ φ and does not violate R req , then accept move pow . Else if R req not met and move RT ≠ φ, then accept move RT . If a move has been accepted, make it and update D, D RT ,and D pow . Else (no move is found): exit.

OUTPUT: D.
After partitioning among clouds, Stage 5 uncoarsens G and restores the original constraints. It applies a final 2-D bin-packing step within each cloud (tasks deployed on hosts) to reduce power consumption and satisfy the constraints of individual hosts. It uses two variants of Hybrid-Best-Fit (HBF) two-dimensional bin-packing [44] with dimensions of processing ( v

Stage 6: selecting the final deployment
Among all the deployments found from all the initial deployments, stage 6 returns the one with the lowest power consumption D pow .

LPD design choices
Preliminary experiments described in [34] refined the choice of algorithm steps and parameters, showing that: • Most solutions arise from the random initial deployments. N random = 60 assured a solution. • Coarsening using the stated termination condition reduced runtime and gave better quality solutions for both power and delay. • The final bin-packing using HBF improved the solution in 15% of cases, mostly due to HBF-Mem. Other approaches to bin-packing did not improve on this.

Time complexity of LPD
There are n tasks and |E| edges in G scaled , K clouds each with O(h) hosts, and d initial deployments. The important operations have these time complexities (see [37] for related complexity analyses of the partitioning): Assuming that the number of interactions of each task is bounded, then |E| is O(n), operation (v) is dominant and the time complexity is O(dKn log(n)).

LPD for genetic algorithm solution repair
Stage 4 of the LPD generates numerous infeasible initial deployments. The partitioning in stage 5a first works to render the initial solution feasible, and then to improve the objective function value. It is simple to terminate the partitioning at the first feasible solution. This truncated stage 5a partitioning step can be used as a quick repair mechanism for infeasible intermediate solutions generated by a Genetic Algorithm. This possibility will be investigated in future research.

Evaluation experiments: setup
The quality of LPD deployments was compared to (i) MIQP solutions when they could be obtained, and (ii) a loose power bound. Response time feasibility was verified against the results from the LQNS solver, which gives a more refined calculation of delay.
Comparisons of measure m found by two methods A and B were made using the relative error measure ∆ A : B defined as:

Application test cases
Random test models were generated, structured to provide many variations of the layered architecture typical of cloud service applications such as e-commerce and social media, and of microservices generally. Application models consisted of 4, 8, 18, 24, or 30 tasks and had a wide range of workload characteristics. One hundred four test models were generated by the random model generator lqngen [45]. Model properties ranged widely, to represent a wide range of applications and of deployment difficulty, giving these properties: • Number of application components, including scaled-out replicas: 20-240, • Total processing requirement: 5-84% of the available total multi-cloud capacity, • Total memory requirement: 5-87% of the total multicloud capacity, • Total number of calls per user request: 3-11,050.
Six comparisons were made:

Cloud environment
The test cases considered four clouds as shown in Fig. 2, with parameters in Table 4 below. Table 5 shows the network delays, and Table 6 describes the hosts. The experiments were run with U nom = 0.8, which is a typical target value for maximum utilization, and with 60 initial deployments.

Software and hardware
The algorithm was implemented in Java, using the Java Universal Graph Framework (JUNG) v2.0.1 and Guava v19.0, with Google Core Libraries for Java for the Multimap data structure that represents the deployment. LPD was run on an Intel i5 3570 PC with 3.4 GHz processor, 16 GB memory and with Ubuntu v16.04 Linux, while the MIQP solutions were run on a faster Intel Core i9-9980HK running at 2.40GHz.

Experimental results
The experiments evaluate the effectiveness of LPD and explore its properties on applications of different sizes and requirements. The principal concerns are (1) closeness of the power to the bound on the optimal value, (2) meeting the response time constraint, and (3) algorithm solution time versus the target of 1 minute for practical use in deploying models in this size range. Also examined were (4) the effectiveness of the strategy used by LPD versus two simple alternatives, and (5) the impact of tighter response time constraints on solution effort and quality.

Power consumption: LPD vs. MIQP
MIQP solutions by the CPLEX optimizer [18] give the minimum power, which was compared with LPD. Forty four cases were solved by CPLEX within a time limit of 300 seconds per solution. Fig. 5 shows a histogram of   pow LPD:MIQP . LPD power is equal to MIQP power (within a very small tolerance) in 15 of the 44 cases (34%) and is within 6% in 42 of 44 cases (95%). The largest difference is 10.8%. The overall success rate of LPD in finding nearoptimum low-power solutions for these 44 cases is excellent. For the remaining 60 larger cases a comparison was made to a bound.
The 300-second MIQP limit is well beyond the assumed useful time limit of 60 s and the limited solvable cases demonstrate the poor scalability of the MIQP approach.

Power consumption: LPD vs. a lower bound
A simple lower bound P bound on total power consumption is calculated by assigning fractions of the pooled total workload to the most power-efficient processors across all clouds and taking the sum of their power consumptions. The construction of P bound ignores the granularity of processing by the tasks, and the response time constraints, which will make it optimistic and unattainable, but it provides a comparison for the cases in which the MIQP solver exceeds its time limit.
The histogram of the difference pow LPD:bnd in Fig. 6 shows that the power found by LPD is within 10% of P bound for 22 of the 104 cases and within 10-20% in a further 75 cases. Of the 7 cases with a larger difference, 3 had MIQP solutions with identical power to LPD, leaving just four cases with (possibly) unsatisfactory power solutions.
For the 60 cases where the MIQP timed out, the LPD power averages 14.3% greater than P bound , with 56 of the 60 cases (93%) closer than 20%.

Verification of the response time constraint
In the 104 LPD solutions, the simplified response time approximation D RT LPD always satisfies the response time limit R req . If there are many calls between processes then network delay reduction may take precedence over power reduction and the response time may be close to the constraint. If the response time constraint is too tight a feasible solution may not be found (or even be possible). Figure 7 shows a histogram of the relative amount by which LPD met the response constraint; the solution was always feasible and in some cases the structure of the application messaging, combined with power optimization, gave response times much smaller than the constraint. A more accurate estimate of the response time of the LPD deployment, denoted D RT LQNS , can be found using the Layered Queueing Network Solver LQNS [39]. In the LQNS solution the host utilizations depend on the deployment (rather than assuming the value U nom ) and the solver exploits more sophisticated queueing approximations. Because the actual utilizations can never be higher, the LQNS values are always feasible for an LPD solution. This was verified in the 84 cases for which LQNS could find a solution; in the remaining cases the model was too large for the LQNS solver. The histogram in Fig. 8 shows the relative margin by which RT found by LQNS was smaller than by LPD.

Variants on LPD
Two variants on LPD were created to consider alternative strategies: Variant1: Stop at the first solution that satisfies the response time constraint. This pays most attention to reducing the latency until the response time constraint is met.
Variant2: Continue from the Variant1 solution to a final solution that minimizes response time. This gives absolute priority to a short response time.
The two variants were applied to 16 test cases. All the solutions met the response time constraint, with Variant2 providing the lowest values. However LPD gave the lowest power consumption in every case, exceeding P bound by just 6.5% on average (vs 18.4% for Variant1 and 21.1% for Variant2), confirming that it provides a suitable balance between power and response time.

Algorithm runtime vs constraint tightness
Algorithm runtime is influenced by the difficulty of the deployment due to the tightness of the constraints. To examine this relationship, the response-time constraint of one model was gradually tightened. The response time includes an execution delay of R pt D = 608 msec . Table 7 shows the impact of the reduced values of R req on the LPD runtime T LPD with 52 starting points. At first almost all trials find feasible solutions with increasing T LPD , then T LPD levels off and there are increasing failures. At some

Scalability of CPLEX MIQP vs LPD
Besides constraint tightness, the runtime is affected by the size of the task graph, the number of clouds and hosts, the capacities of the hosts, and network latencies. For a simple evaluation of algorithm scalability, runtime was related to problem "size" expressed as the number of tasks in the graph. Figure 9 summarizes runtime vs. problem size for LPD and the CPLEX MIQP solution on the 104 test cases. For technical reasons, CPLEX was run on a faster machine with an Intel Core i9-9980HK running at 2.40GHz (average CPU mark 15,089), or roughly 3 times faster. Thus the runtime difference between CPLEX and LPD is three times as great as the figure indicates. CPLEX converges to a solution in less than 300 seconds in just 44 cases. For half of these it has a runtime of less than 10 seconds and is faster than LPD; for the other 22 cases LPD is faster than CPLEX. CPLEX is able to provide a first feasible solution (not optimal) within 60 sec of scaled time (time multiplied by three) for 85 of 104 models (81.7%), but with more power consumption than LPD in 64 (75%) of them.
LPD always runs the complete set of 60 initial deployments, taking as long as needed. If that takes less than 10 seconds, then it runs additional random initial deployments until 10 seconds have elapsed. The runtime breaks out from the minimum of 10 seconds at about 120 tasks. LPD succeeds in all cases, with a largest runtime of 100 seconds, and a runtime greater than 60 seconds in just 6 cases. The time of the first feasible solution was not recorded, but in all 6 cases, multiple feasible solutions were reached by 100 seconds, so at least one feasible solution was likely available at the 60 second mark and could be used for deployment.
Overall, Fig. 9 shows that LPD scales well with problem size as compared to the MIQP.

Conclusions
The deployments found for the 104 test cases with four clouds and up to 240 deployable units show that:  Fig. 9.
It was also found that (1) coarsening/uncoarsening of the application graph considerably improves the quality of the LPD solutions and reduces the time to obtain them; (2) random initial deployments provide almost all the final solutions; (3) the execution-time bound of Section 2.4 is essential to obtaining small solution times. Since the bound overestimates the response time, it may eliminate some feasible deployments.
LPD finds a deployment for the maximum expected level of the workload. When the load is lighter, simple autoscaling (which adapts the number of replicas of a task for varying loads) will always be able to meet the SLA requirement, although power optimality is not addressed.
Overall, LPD finds a balance between problem detail, solution accuracy, and run-time.

Threats, issues and extensions
Some aspects of LPD and its evaluation may be, or appear to be, limitations to its usefulness.
An evaluation on an industrial case study might be preferred for evaluation, over the set of randomly created examples used here. However, actual service systems take on a huge variety of sizes, structures and patterns of interaction. The difficulty of obtaining data on real cases would limit their number severely, and a large number of randomly created models spans the range of possibilities better, and gives more confidence that LPD can handle anything that is thrown at it. An even larger number of cases would be even better.
An LQN Application Model is assumed here and could be a burden to create. However it is not actually required, since the input to LPD is the Application Graph, which could be constructed from data obtained by monitoring (that is, from the mean total execution time for each deployable unit, and the mean number of interactions between pairs of units, per User response).
The Users are modeled as a single group. However to model different kinds of users, this group can make a variety of request types to the system.
The response time constraint applies to predicted mean delays, while SLAs often constrain percentiles of delay. To accommodate a target percentile, the measured shape of the response distribution can provide the ratio of the percentile value to the mean, and the requirement can be translated into a requirement on the mean. The Power Usage Effectiveness of the clouds could easily be included in the power calculation, to give total power rather than just power used by the application.
Cloud internal network delays are ignored, and may be significant. They can be included in two ways: either as a fixed mean allowance for every interaction, which introduces a constant additional delay for the application, or by modeling the internal cloud network and performing the Partition algorithm on it for the final deployment, with bin-packing only on individual hosts. This would give one additional partitioning per cloud.
Storage operations are not mentioned, however they can be modeled as services and deployed by LPD.
The choice of 1 minute as a maximum execution time is arbitrary and could be larger (say, 10 minutes) without changing the determination of the non-scalability of MIQP or meta-heuristics. One minute is an attractive target time for practical use, given that scaling-out operations may take only a few minutes.

Extensions
Extensions that could improve LPD are (1) improved heuristics for the initial deployments, and (2) concurrent execution of Stage 5 of LPD. Also the LPD Partition algorithm could be exploited for feasibility repair in a Genetic Algorithm. The Partition algorithm determines a sequence of moves that respect capacity constraints and decrease response time and power, with initial priority to response time. It would be terminated at the first solution that is feasible for response time.
Limitations of the approach that would require substantial extensions are: (1) the Users all have the same network delays to access any cloud, and the same response requirement; (2) execution of each operation is sequential within a single response (parallelism is ignored); (3) limited concurrency within a task is not modeled; (4) host and network failures are ignored.