The proposed methodology was designed to take an application functionally the same but architected in both a monolithic and microservice containerized way to compare the application's function and performance across a wide range of performance dimensions. This was then analyzed to show how the underlying host utilization is better utilized when components of an application are broken apart into services and containerized to be scaled up on the same host to add in additional capacity under higher workloads. In monolithic architecture, if the application function is to book travel arrangements, it cannot just create additional capacity if the login service is in high demand without deploying the entirety of the application to another host.
During the implementation, the expected result of which architecture performed better turned out to be the containerized microservice version of the application; however, it was not a bottleneck that was expected nor foreseen when planning the testing implementation. It was hypothesized that due to the more serialized nature of the monolithic architecture where user workflows occur, eventually, under the right load, one of those services would no longer keep up with demand. Which of the components eventually become the bottleneck was unknown. The workflow on each host at a high level is a frontend that the user interfaces with a backend that processes travel arrangements and talks to the database where all the user, destination, booking, and other stateful information are stored. Whichever bottleneck occurred, it was expected that the underlying host would be CPU bound first since the smaller scale testing suggested the application, once it reached a certain point, became CPU bound, slowing down transactions and making the application unresponsive. For that reason, the AWS application load balancers, the autoscaling groups for both architectures, would be based on average CPU utilization, as will be discussed in detail later. Due to how the monolithic architecture ended up handling garbage collection in the extensive scale testing, the application would end up getting into a state where it would become severely degraded and not exceed the average threshold value that was set up for both application architectures to try and keep the overall cluster utilization at fifty percent.
Consideration was given to using different autoscaling metrics such as network bytes in or request counts per target. However, these metrics would have scaled the microservice cluster in ways that would not prove out the intent or the purpose that this paper was aiming to provide. Consideration was also given and attempted at changing the instance sizes of the AWS EC2 that both architectures were run upon. A test was performed with instances that were using half as many CPUs as outlined in Sect. 3.1 Proposed Methodology, and this did not change the outcome since the issues with garbage collection would have occurred the same way given any number of cores per host.
In the end, while the scaling elements of the methodology did not go as expected for the monolithic architecture, the results and insights gained from the testing are no less significant to the core thesis that monolithic architecture presents challenges when trying to utilize resources under an unpredictable load test.
This section evaluates how containerized microservice architecture better utilizes resources over monolithic architectures; there are many dimensions to consider. This section will explore some of these many different metrics gathered around the performance of both the architectures and the host each application is running on and highlight each type of architecture's key advantages or disadvantages regarding these metrics.
Throughput
Throughput represents the amount of work, typically represented as the number of requests that can be fulfilled per instance per host. As the number of requests grows closer to the maximum throughput of the instance, response time will likely increase alongside queuing, or errors can occur, rendering the application unusable. Throughput is also crucial because if requests start getting rejected or queuing takes place, then timeouts can occur. Timeouts result in retries, which exacerbates the problem by flooding in even more requests as users or other systems keep trying to perform a failed request until it works, effectively resulting in an unintentional distributed denial of service attack.
Figure 8 shows that up until 2 am during the test, the number of requests being served by both application architectures remained nearly identical. At 2 am, when the 5th load generation node was added, a divergence occurred where the monolithic architecture was no longer able to keep up with the total number of requests coming in until 5 am when the load generation nodes were reduced to one per architecture, and then the throughput of both architectures reconverged once again. This also validates that there is no meaningful difference in the number of requests being generated from the load generation clusters to their respective application architectures as mentioned in Sect. 3.1 since the load generation was designed to simulate real users and had some variance in what type of requests were performed each time the load generator made a new request.
The detailed breakdown of load balancer activity in Fig. 9 shows that the monolithic architecture has more active connections but fewer new connections than its microservice counterpart, as shown in Fig. 10. This demonstrates how, when the monolithic architecture reaches its maximum throughput, it cannot take any new connections since it tries to fulfill the current requests from the already active connections.
These test results show the opposite for the microservice architecture in Fig. 10. This results from fulfilling requests by adding additional capacity in the form of additional container instances as needed through rules designed to bring more container instances online on any of the services nearing their maximum throughput. The number of requests per minute for the microservice architecture is nearly double that of the monolithic architecture.
Response time
Response time represents the amount of time it takes to fulfill a request from when the request was initiated until it was completed and returned successfully. Many different factors can affect the outcome of application response time, such as the device that initiates the request, the network latency between the client/server, the capacity of the infrastructure serving the request, and many more.
Looking first at the Monolithic architecture overall in Fig. 11, the response time spikes up substantially once, reaching about 400 requests per minute when the fifth load generation instance was activated. Response time got so high that the number of successful requests dropped considerably. The monolithic architecture could not keep pace with the load being generated against it—the peak response time when load generation at its highest resulted in requests taking nearly 50 s.
The microservice architecture handled load increases much better in comparison. Figure 12 shows a few dips in the total throughput of the microservice architecture above. However, it responded to the increasing load before reaching a point where response time dramatically increased. When response time did increase, the microservice-based architecture quickly added units of capacity in the form of more replicas of the service under the most load. Despite serving nearly 4 × the number of requests under peak load, the microservice architecture’s response time average never exceeded four seconds. The average overall was around 300 ms compared to the monolithic, which spiked up to around 50 s.
In addition to the load generation traffic being produced on the application architecture, a synthetic monitor was set up to test the application from locations outside of the AWS cloud to measure the response time as if a real user were accessing the application from different parts of the USA. These tests were set up to run from Oregon, Chicago, Cheyenne, Los Angeles, South Carolina, and Texas. As shown in Fig. 13, this external monitoring validates the results gathered from monitoring agents installed onto the application hosts and shows the same spike in response time around 2 am and returning to average around 5 am for the monolithic architecture. These synthetic monitors are the equivalent of a simulated user visiting the application using a web browser. They record response time and availability along with wc3 timings and the load times of each element on a given URL. These synthetic monitors can also simulate user input and be multi-step, but only single URL monitors were used for this work.
The microservice architecture agent-based monitoring results were also validated using synthetic external requests. A small spike in response time was observed around 2 am as the requests increased but quickly returned to a more normal baseline as more microservice replicas were added, shown in Fig. 14.
Errors
While many types of errors can occur in an application while being used, the testing focused on bad requests (HTTP 4xx errors) and server errors (HTTP 5xx errors). 4xx errors indicate that the server did respond to a request. However, it will not process the request, usually due to something client-side, such as the request being malformed and invalid syntax, among others. 5xx errors indicate that the server is unavailable to process the request at all. 5xx errors are more severe than a 4xx error since 4xx errors may be retried and successful. In contrast, a 5xx error typically means the system is partially or entirely down and unable to process the transaction at all.
In Fig. 15, we can see the number of 4xx errors split by load balancer. While there are periods where some 4xx errors occur for the microservice architecture, these are short, isolated periods. The monolithic architecture sees a steady climb in 4xx errors after 2 am until 5 am, when demand on the application was the highest. While errors are not a good thing, the server is still sending a message back saying the requests were received means that the application was not down, which is consistent with other perspectives that have yet to be analyzed.
Unlike 4xx errors, which can occur for organic reasons and in low quantities, they may not need to be investigated. Any amount of 5xx errors is considered something not to be ignored. Small quantities of these errors are shown in Fig. 16 for short period microservice architecture. In contrast, the monolithic architecture sees these spikes as higher and higher as the load on the application increases. The server cannot send any response back to the simulated users coming from the load generators.
CPU Utilization
CPU is an important dimension when trying to factor in the allocation of resources. It becomes a complex resource that can be allocated and used in several different ways within a datacenter. For instance, it is critical to know the software requirements that are to be run for clock speed, number of cores, memory cache values, and even the register size of the processor. Other important factors include whether an application will be running on a dedicated machine versus a virtual machine, where virtual resources can be changed or modified later as required.
Per the proposed methodology, CPU utilization percentage was the determining factor when either application cluster was determined to scale up or down in the total number of cluster nodes.
Figure 17 shows that the average CPU utilization is shown broken apart per node instance in the monolithic cluster.
In Fig. 18, the bars represent the average of all running monolithic cluster frontend instances at a given time. In contrast, the line represents the average CPU utilization of the monolithic version of the EasyTravel application Frontend responsible for most application-related CPU utilization on each cluster node. One thing that becomes immediately obvious in either in (Figs. 17 or 18) is that the average utilization does not cross our threshold to scale the cluster (50%). The application's throughput suffered even though it was shown in the previous sections after a certain amount of load was generated. In Sect. 3.1, it was expected that the monolithic cluster would not scale as efficiently as the microservice-based cluster because each monolithic node would only have one running instance of all the application's different components. This bottleneck was expected to stretch the node instance(s) resources and trigger the node management technique set to keep the cluster utilization of either architecture to fifty percent overall utilization and then add or remove cluster nodes to achieve that goal. As the monolithic cluster nodes, average utilization approached fifty percent, the application's overall throughput dropped, and the cluster could no longer handle the number of requests it received. There are two instances around 2 am, and 4:15 am where the CPU utilization criteria did add in an additional cluster node at 2 am, and at 4:15 am, there were two additional cluster nodes added. We can see slight increases in the application's throughput at both points, but the new nodes immediately become just as overwhelmed as the already running nodes. While explored in greater detail in a subsequent Sect. (4.2.6), the nodes did not spike higher in CPU utilization and triggered additional cluster nodes to be added as a result of the garbage collection of objects.
Figure 19 shows that the average CPU utilization is broken apart per node instance in the microservice cluster.
In Fig. 20, the bars represent the number of running microservice cluster frontend instances during the timeframe. In contrast, the line represents the average CPU utilization of the microservice version of the EasyTravel application. A few things immediately stand out when compared to the monolithic architecture representations from Sect. 3.2 Design. First, the average CPU utilization per frontend instance is significantly lower than its monolithic counterpart. Second, it can be seen that the total number of running frontend instances is far greater than the monolithic counterpart. This change in the number of instances is quite intuitive and was to be expected. The microservice architecture splits the application into three separate containers: the frontend, backend, and database. These components run their own containerized services. Each can be scaled to have a more significant number of instances running on each microservice cluster node per the values set up in the methodology to produce additional supply when demand increases. This is in stark contrast to the monolithic architecture, which only can scale by adding in additional cluster nodes. The overall CPU of the frontend instances for the microservice architecture remains relatively low overall. In contrast, the number of instances approaches close to 60 at times. In contrast, the maximum number of frontend instances for the monolithic architecture is six since it is bound to the number of instances running is constrained by the total number of running monolithic cluster nodes.
Another stark contrast seen in the microservice-based architecture versus the monolithic architecture is that the overall cluster node CPU utilization keeps rising with increased load and demand put on the application. As hypothesized, this results in the scaling of overall cluster nodes' number to increase as expected in the methodology. The issues and limitations of the monolithic version of the architecture are overcome with the microservice architecture by adding additional front-end capacity on a cluster node before the running containers cannot handle the requests.
Availability/Operation
Perhaps one of the most important metrics to consider when testing or analyzing applications is availability. Application availability is the measure in which an application is determined to be available and operational/functional and used to fulfill or complete its purpose. It is one thing for an application to load, but if a user can browse a site like Amazon.com yet cannot checkout, that is a significant issue. In Sect. 2.2, the analysis which was performed done looked at some internal server metrics such as the throughput of the application and the architectures' overall performance response time. Here, the analysis will look at some of the synthetic testing set outside of the AWS availability zone where all the testing occurred. Since the load balancer for each architecture was publicly available over the internet, a set of synthetic monitors was set up to monitor each architecture homepage's availability. If it could be loaded, the load time lengths and some other load times were, and some other metrics about its overall performance from six different datacenter locations in the United States, as shown in Fig. 21. These tests were set up to run from Oregon, Chicago, Cheyenne, Los Angeles, South Carolina, and Texas.
Here in Fig. 21, the availability is shown as a percentage for both application architectures. As with the previous sections, once the monolithic architecture starts to have problems, it is again seen to reduce the site’s overall availability. The microservice instance remains fully available throughout the duration of the test.
Figure 22 shows the availability has been broken apart by location/datacenter. Also visualized are generalized statistics around downtime, visually complete (a point-in-time metric that measures when the visual area of a page has finished loading), and total load duration, to name a few, for the microservice cluster.
Looking at the availability graphs broken apart by location/datacenter for the monolithic architecture in Fig. 23, it is much more interesting to dissect and interpret. It is seen as different overall with stats such as downtime, visually complete, and total load duration. However, when looking at the downtime for the individual locations represented as the bar graphs' red sections per location, the downtime is not concentrated at certain times; otherwise, all locations would have the outage graphs lined up with one another. These non-overlapping sections show it can be reached from one of many locations meaning that despite the monolithic architecture being less responsive and available overall, it was still processing transactions the entire duration of the test. However, if a given transaction was successful or failed, it was a function of the system's resources.
Garbage collection/suspension time
In the interest of brevity for this section, only some very cursory and brief overviews of garbage collection will be explained. This section will present how it affected the application testing performed and its negative impact on the application and virtual machine performance. To fully understand topics covered in this section more thoroughly and their importance to the results of the testing done for this work, it is recommended to the reader to read into other papers and articles exclusively about garbage collection in detail. It is also expected that the reader has fundamental knowledge of modern computer architecture, including CPU, memory, and concepts of how objects are created, modified, and destroyed by any of the following: the host OS, guest OS, containers, and applications running on any prior mentioned systems. In essence, garbage collection in the context of computer architecture refers to how objects in memory are managed once they are no longer in use by any program. This essentially means how to destroy these objects to free up computer memory for re-use. There are many ways garbage collection can be handled, and there is no one size fits all solution for all applications. What may be optimal for one application could very well be inefficient for another. When a poor garbage collection strategy is used, it can cause considerable performance problems for an application. Regardless of the garbage collection strategy used, there will be a period on the CPU where the application processing must be suspended, and garbage collection occurs. This stop in CPU is referred to simply as suspension. How long garbage collection occurs is referred to as garbage collection time and is expressed as a value of time, whereas suspension is expressed as a percentage. As an analogy, consider if you and other drivers drive down a straight multilane road with traffic lights along with it. You can think of the number of times traffic had to stop at a light rather than being able to go straight through, as your suspension percentage, and the amount of time you were stopped at the light as your garbage collection time. All traffic, regardless of the number of lanes, must follow these rules, which would be cores or threads of a CPU for this analogy. The more times you stop or, the longer you are stopped, the more time it takes to get where you are going, or in a computer's case, the worse the application's performance.
In Fig. 24, the garbage collection time for the microservice-based application, the garbage collection time per interval is relatively low due to having a much higher number of instances (177 with microservices architecture versus 6 with monolithic architecture) to spread the workaround. At the peak times during the highest loads, it was only a few seconds.
Similarly, in Fig. 25, the percentage of time spent in garbage collection (suspension) is less than one percent, and at the peak, loads never went higher than four percent since the microservice architecture had many more instances to spread the workaround than the monolithic architecture: 177 versus 6 respectively.
Figure 26 shows when the service is spent executing on a CPU over time. Due to limitations in how the data can be exported, there is little context given. The critical thing in this graph is that the areas that took up the most time on the CPU in the microservices architecture were background services and the services required to run the application, such as the frontend component. The garbage collection time was very little overall time compared to everything else.
Figure 27, a stark contrast to the microservice suspension time, shows that the application is being stopped half of the time for garbage collection to take place. Remember, all cores are suspended when garbage collection is taking place. Since the cores are being suspended, this does not directly affect our overall CPU utilization numbers. Our test methodology uses CPU utilization to increase the number of monolithic cluster nodes based on CPU utilization. This suspension is the primary reason why the monolithic cluster did not scale the number of nodes up as the load increased in the predicted way.
In Fig. 28, garbage collection time in the monolithic instance is predictably much higher than what was seen in the microservice architecture, measured in minutes rather than seconds. This can be attributed to having many more instances to spread the workaround.
Coming now to Fig. 29 for the monolith architecture, it can be seen in the graph the amount of time a service is spent executing on a CPU over time is expressed in minutes rather than seconds as in the microservice architecture. The reason for including this visualization is that the tool used to capture this data can hide time spent in background tasks and garbage collection time. While this makes no significant difference in the microservice architecture, there is a significant change when the same CPU utilization graph is displayed with the garbage collection time removed.
In Fig. 30, the monolithic architecture with the background/garbage collection CPU usage hidden from the application components is now measured in seconds. It is similar to what is shown in the same visualization of the microservice architecture (Fig. 26). This difference is likely due to how the application must create, then destroy the object in memory before handling more transactions. The microservice architecture most likely does not have this same bottleneck due to its ability to create more instances to handle more transactions. Even if the system uses the same garbage collection strategy, the container operating the single service is suspended during garbage collection rather than the entire application.
Application scaling
As mentioned in this section, the monolithic scaling did not operate as intended per the methodology. It was theorized that the monolithic architecture would not handle as many transactions per cluster node due to its inability to increase the capacity of specific services needed on the same host or hosts within the same cluster as the microservice-based architecture. As a result of this, the monolithic cluster's expected outcome would have a higher node count for the same or a smaller number of transactions than its microservice counterpart. With the garbage collection issues, the monolithic architecture experienced keeping the CPU threshold just below the level of autoscaling. The implementation ended up showing something different than was predicted.
In Fig. 31, one can observe the cluster node count of all the running nodes for each cluster. The microservice and monolithic load generation cluster scale identically, so the light blue line paints overtop the orange node count for the microservice load generators. Until around 2 am the lines for the microservice application cluster (yellow) and the monolithic application cluster (purple) also paint overtop one another as they are the same. Just after 2 am, the monolithic cluster increases its node count from three to four, yet about an hour later, the average CPU of the cluster falls below the threshold. It drops back down to three for a while before adding and removing a couple of nodes around the test's peak load times. The microservice application cluster behaves precisely as predicted, with both the number of cluster nodes and a number of microservice instances rising as needed when the load generation increases over time.
In Fig. 32, the Pod scaling within the microservice application cluster reveals some fascinating things about the scaling behavior and perhaps its application. Initially, the Kubernetes horizontal Pod autoscaler in EKS was set to have one instance of the database running and two instances of each of the frontend and backend components EasyTravel. Like the cluster itself, criteria around CPU utilization of the Pods were put in place to scale the number of Pods to keep the average CPU target of all the Pods of any given type at 80%. The CPU value was set much higher here than the cluster nodes because of the speed at which more Pods can be created. Since there is no virtual or physical machine to turn on, load dependencies give a ramp-up/warm-up time for then as long as there is capacity within the cluster for a Pod to be created; it is nearly instantaneous, and works can begin immediately. The least noteworthy thing to be seen here is the backend service, which, until the loads started to reach the peak, never increased, and even when they did increase, it was not as drastic as the other two services. The database service seemed to scale very linearly as the load increased, so did the database Pod instance. Finally, and by far, the most interesting, is the frontend service scaling. The frontend Pods, for the most part, stay about the same. They spike up very high on several occasions and stay that way for around twenty minutes before scaling back down. This data could not be correlated where this spike in frontend service Pod count occurs to any other metric. It does not follow load generation, which is much more linear in which the EasyTravel database Pod instance count seems to have a much stronger correlation. There are also no spikes in response time, requests, errors, or other metrics that suggest a need for more running instances of that type of Pod. Out of all the data gathered, the biggest question left unanswered is what produced that behavior.