The random placement pessimistically as sumes that job schedulers are agnostic to resource disaggregation

Our second analysis focuses on behavior of jobs to quantify the probability that a job will have to span more racks to find resources compared to the minimum number of racks the job can oc cupy based on its requested number of nodes. In addition, this way we capture the correlation of different resource types that are assigned to the same job. In particular, we sample 210 randomly chosen timestamps from our dataset. For each timestamp, we record which jobs are executing, their resource utilization, and their size in nodes. Because KNL jobs lack memory bandwidth measure ments, we focus only on Haswell jobs. For each job’s resource utilization metric, such as memory occupancy, we measure the maximum utilization among all nodes reserved to the job through out the job’s execution. We use a job’s maximum node utilization to account for the worst-case scenario. Then, we execute 16 of the following random experiments for each randomly chosen timestamp and report the maximumprobability across experiments. In each random experiment, we look at all jobs running at the chosen timestamp and assign them to nodes. Though job placement that prioritizes resource disaggregation would likely yield better results, we leave this as future work, since such job placement policies are still emerging. Therefore, in each experiment , we allocate racks of Haswell nodes and assign them to jobs in a random fashion. For each job, the number of nodes is the same as when it was running on Cori. For each randomly chosen rack, we reserve resources to cover the job’s utilization at the randomly chosen timestamp. If the job still has resource requirements remaining, then we continue by allocating another random rack. At the end of each experiment,cannabis growing equipment we record the percentage of jobs that had to allocate resources from more racks than the minimum number of racks they could be placed in based on their size in nodes.

These are the jobs that had to span more racks because of a lack of resources. This analysis is performed for each resource type separately. Results are shown in Figure 14 . Without reducing resources, there is still some worst-case probability for a job to span more racks than the minimum to allocate memory capacity because of unfavorable random job placement. However, the average across our random experiments re mains near zero . With a 20% reduction, the same worst-case probability becomes 11%, with a 50% reduction the probability becomes 22.3%, and with 80% reduction the probability becomes 56%. For NIC and memory bandwidth, for up to a reduction of 85% the probability is near zero. For a reduction of 95%, the probability for NIC bandwidth is 12.2% and for memory bandwidth 15.5%. While these results are sensitive to our assumptions and random choices of our algorithm, they indicate that intra-rack disaggregation suffices the majority of the time except when reducing resources aggressively.To illustrate potential benefits and thus further motivate intra-rack resource disaggregation, we use per-node statistics to calculate an average resource utilization of each rack. We do this for the node-to-rack mapping in Cori but also for a large set of randomized mappings of nodes to racks, because Cori’s scheduler is agnostic to re source disaggregation. This increases the computational complexity of this analysis, so we only perform it across four days of our data set. For each mapping, we use node statistics to calcu late per-rack utilization for each resource, by average across the four-day sampling period. We then take the maximum utilization for each resource among all the node-to-rack mappings and all racks within each mapping, to capture the worst-case rack utilization for each metric. We use that maximum to derive how much we could reduce in-rack resources and still satisfy the worst-case average rack utilization in our chosen four-day period. This does not indicate that no job will have to cross a rack boundary to find resources, or that application performance will be unaffected. Instead, this analysis focuses on resource utilization statistics. Memory band width reduction is based on the per-node theoretical maximum of 136 GB/s from the eight memory modules.

The percentage we can reduce each resource and still satisfy the worst-case average rack utilization is shown in Table 2. As shown, except for memory capacity in KNL racks, the other resources can be reduced substantially. Resource disaggregation-aware job scheduling may further improve these findings.Based on Section 4, there are also opportunities for rack-level disaggregation in ML workloads and GPU-accelerated systems. We observe a strong variability of resource requirements among different neural networks and therefore application domains but also among inference and training . Inference usually requires higher CPU-to-GPU ratios than training and uses less GPU memory. In contrast, training leads to high GPU utilization in terms of computation and memory but lower CPU utilization. However, this also varies with the workload. Job schedulers can use a disaggregated system by allocating more CPUs and less GPU memory for inference and give the remaining resources to a training job, which requires lots of memory and generally less CPU time. While we observe that GPU utilization is generally high, CPU resources and network bandwidth are underutilized most of the time. Although we cannot determine peak bandwidth demands with our methodology, we see that the average bandwidth utilization over longer periods is low. As we strong-scale training to multiple nodes, the bandwidth demands increase but other resources become less utilized, such as GPU memory . A disaggregated system allows us to provision unused GPU memory for other jobs, and since bandwidth depends on the scale, we can give more bandwidth to large-scale jobs and less bandwidth to small-scale jobs.Sampling a production system has significant value, because it demonstrates what users actually execute and how resources are used in practice. At the same time, sampling a production system inevitably has practical limitations. For instance, it requires privileged access and sampling infrastructure in place. In addition, even though Cori is a top 20 system that executes a wide array of open-science HPC applications, observations are affected by the set of applications and hardware configuration of the system. Therefore, HPC systems substantially different than Cori can use our analysis as a framework to repeat a similar study. In addition, sampling typically does not capture application executable files or input data sets.

Therefore, reconstructing our analysis in a system simulator is impossible but also impractical due to the vast slowdown of a simulator for large-scale simulations compared to a real system. Similarly, sampling a production system has no method to differentiate when application demands exceed available resources as well as the resulting slow down. For this reason and our 1s sampling period, our study focuses on sustained behavior and cannot make claims for the impact of resource disaggregation to application performance.While it is hard to speculate how important HPC applications will evolve over the next decade, we have witnessed little change in HPC fundamental algorithms during the previous decade. It is those fundamentals that currently cause imbalance that motivates resource disaggregation. Another consideration is application resource demands relative to available resources in future systems. For instance, if future applications require significantly more memory than the memory available per CPU today, then this may motivate full system disaggregation of memory, especially if there is significant variability across applications. Similarly, if a subset of future applications request non-volatile memory , then this may also motivate full system disaggregation of NVM, similar to how file system storage is disaggregated today.However,cannabis drying trays future systems may have larger racks or nodes with more resources, strengthening the case for intra-rack resource disaggregation. When it comes to specialized fixed-function accel erators, a key question is how much data transfer they require and how many applications can use them. This can help determine which fixed-function accelerators should be disaggregated within racks, hierarchically, or across the system. Different resources can be disaggregated at different ranges. Ultimately, the choice should be made for each resource type for a given mix of applications, following an analysis similar to our study.Future work should explore the performance and cost trade off when allocating resources to applications whose utilization is dynamic. For instance, providing enough memory bandwidth to satisfy only the application’s average demand is more likely to degrade the application’s performance, but increases average resource utilization. Future work should also consider the impact of resource dis aggregation to application performance, which should also consider the underlying hardware to implement resource disaggregation and the software stack. Job scheduling for heterogeneous HPC systems should be aware of job resource usage and disaggregation hardware limitations. For instance, scheduling all nodes of an application in the same rack benefits locality but also increases the probability that all nodes will stress the same resource, thus hurting resource disaggregation.

By the end of the 20th century, the most pervasive world-changing technology was the internet because of how it revolutionized the daily productivity of modern society. Mark Schueler, a Ph.D. student from Southampton University illustrates the explosive, “Growth of the Internet,” from its inception until more recent years in Figure 1, showing just how rapidly new technology can pique the public’s interest when it positively influences the majority. The World Economic Forum estimates about 2.5 billion people are connected to the internet today; a third of the world’s population. It is projected that 4 billion users will be connected by 2020, more than half the global population. With so much of the world’s population currently having little or no internet connectivity, this poses the question: Can the infrastructure that society counts on to carry all this digital traffic keep up with the accelerating demand? There is a growing need to produce the most computing power per square foot at the lowest possible cost of energy and resources. More recently, the electricity used by data centers has garnered the most intense interest, partly because of the importance of these facilities to the broader economy and because the power used by the individual data centers rivals some large industrial facilities. In a report to Congress from the United States Environmental Protection Agency discussing data center electricity use leading up to 2006 , it is quickly apparent how the electricity consumption nearly tripled from year 2000 to year 2006. Had people chosen to disregard improving energy efficiency and technology within U.S. data centers, then the domestic energy conversion would have nearly doubled by 2011. Fortunately, the data center industry has remained aware of the vast energy required to operate their infrastructure and has experimented with improving data center operation as recent data would suggest. In 2008, data center electricity demand was approximately 69 billion kWh or 1.8% of the total 2008 U.S. electricity sales. It comes as no surprise with how connected everyone is to the internet that nowadays data centers are likely to be increasingly large, powerful, energy-intensive, always running and out-of-sight. Because of the sheer size and numbers of servers involved, data centers are loaded with energy inefficiencies. Considering a traditional data center connected to the electric grid, less than 35% of the energy from the fuel source that is supplied to the power plant is delivered to the data center. The most significant inefficiencies result from power plant generation losses and transmission and distribution losses. Figure 3 outlines the process of power loss through the transportation and distribution network of the electric grid starting from the fuel source and ending with the power supplied to the consumer. It is evident that the largest inefficiency stems from the power generated at the power plant level with additional losses associated with the transmission and distribution to the data center, where the data center receives roughly 30% of the total energy that could have been supplied ideally from the fuel source. Considering the operations within a data center, there are further losses associated with the infrastructure required for daily reliable operation systems. The additional power consumed by the cooling, lighting, and energy storage, means approximately less than 17.5% of the energy supplied to the power plant is ultimately delivered to the servers.Recognizing the significant energy losses at the power plant level, Microsoft made a commitment in May 2012 to make their operations carbon neutral: to achieve net zero emissions for their data centers, software development labs, offices, and employee business air travel.