The James Webb Space Telescope has provided impressive views of the cosmos since the first images were revealed in July, but it also provides data to other science projects, including cosmology projects such as those at Durham University in the North East of England.
Durham is part of the Distributed Research Infrastructure Using Advanced Computing (DiRAC), established to provide supercomputing facilities for theoretical modeling and HPC-based research in particle physics, astronomy and cosmology at multiple sites UK universities.
The Durham part of the system, COSMA (COSmology MAchine), was built with lots of memory per compute core, making it ideal for memory-intensive workloads such as large cosmological simulations. The latest system, COSMA8, includes 360 compute nodes, each with two 64-core AMD Epyc processors and one terabyte (1TB) of memory.
This is used for large-scale simulations of the universe, where models can be modified based on various theories of how the universe evolved to the position we see today, and predictions from models compared to real data from James Webb space. Telescope to measure how well they represent reality.
“We start with the Big Bang, then we propagate these simulations through time and evolve the universe over billions of years and see how it changes,” COSMA HPC Service Manager Dr. Alastair Basden told us.
“So for things that we don’t really understand yet, like dark matter and dark energy, we’re able to adjust the input parameters, like gravity behaves differently over long distances and things like that. And we’re able to change all of those settings and then try to match them to what we see in the James Webb images once they’ve been calibrated and scaled.”
Durham has multiple generations of COSMA running concurrently, and this incarnation and the previous one were designed in conjunction with Dell to achieve the optimal configuration for the current workload. The last cluster is known as COSMA8.
“We have about 8GB of RAM per core on each node. If you look at a more conventional HPC system, they’ll have about a quarter of that, and that means running the same simulations that we can run here, you’d need four times as much of cores to get the results in the same amount of time, so it’s a bespoke design for those cosmology simulations,” Basden said.
Another feature of COSMA8 is a high-performance NVMe-based checkpoint storage system based on the Luster file system. This is a common feature of HPC deployments, allowing a large workload that requires a long runtime to store its state as it goes, so it doesn’t have to start from scratch. in case of failure.
“It’s a very fast filesystem, about 400 GB per second, capable of sucking up data checkpoints. So the simulation will continue, it will flush a checkpoint every few hours or something. , so if something goes wrong, you’ve got a point at which we can restart this simulation,” Basden explained.
COSMA8’s core file system is built on Dell PowerEdge R6525 and R7525 rack servers and PowerVault ME484 JBOD enclosures, all connected to a 200 Gbps InfiniBand fabric.
Validation of technology including liquid cooling
But it seems that Dell’s relationship with the Durham team goes beyond that of a vendor or systems integrator, as the university often gets early access to new or experimental technologies, which allows both parties to see how well they hold up when put to work. , according to Tim Loake, UK Vice President of Dell’s Infrastructure Solutions Group.
“Durham is one of our key partners, so it’s one of our HPC centers of excellence, in that we help them try out some of our new technologies, as well as test them and we provide feedback,” Loake said.
“We are giving Alistair and the Durham team access to our labs and pre-release products, and getting their feedback to help us leverage the knowledge and experience they have gained through the operating a very high-end HPC system and injecting them into the development of our products, as well as bringing new technologies to them,” he explained.
As an example, Dell introduced a switchless interconnect in the system of a company called Rockport Networks. This distributes traffic management to smart endpoint NICs in each node that are linked together through a passive hub called SHFL.
Another area where Durham has been instrumental in validating the technology is in liquid cooling, according to Loake. This was installed as part of COSMA8 in early 2020 and extended about a year later.
“It was probably the largest direct liquid cooling system we’ve deployed, certainly in the UK and probably all of Europe when we first launched it,” Loake said.
“Obviously, direct cooling is now becoming more common in many data centers and HPC systems, but from a directional perspective, that’s a big part of what we’ve learned from our work with Alistair and the Durham team that were then brought into the design of the product that we’re now getting out with the next generation of Power Edge servers,” he added.
This deployment used direct liquid cooling, where coolant circulates through heat sinks attached to components that generate the most heat, such as the CPU.
However, interest is now turning to immersion cooling, where the entire system is immersed in a dielectric fluid that conducts heat but not electricity.
“Total immersion is something we’re very interested in, and we’re currently trying to get funding for an immersion system,” Basden said.
“Part of the benefit of immersion cooling is that you take out all the fans, all the moving parts, so you don’t put any spinning drives in there either, it has to be a flash system pure, and no moving parts means the need for maintenance is hopefully greatly reduced,” Loake said.
However, most immersion cooling systems Dell works with have the ability to lift an individual node out of the fluid, if access is needed, he added.
“Think of it like a 42U rack tilted sideways and then you can just pull a node up like you pull it out the front of a normal rack, but they go up and obviously the liquid drains into the bath. The rest of the systems are unaffected and you can do whatever maintenance you need,” he said.
Other technologies tested are FPGA accelerators and Nvidia Bluefield data processing units (DPUs), while Dell is also looking at other types of processors to see if the performance is purely raw cores or if we can get different performance. , or more performance. per watt.
According to Basden, some of the technologies they test are being evaluated for immediate future use in projects, while others are looking further afield. One of these is the Excalibur project which is part of the UK’s preparations for exascale computing.
“It’s mostly software efforts to get the code ready to run on large exascale systems, but a small part is also hardware, researching what new hardware might have potential in an HPC system,” he said. declared.
This includes Rockport interconnect technology as well as Liqid’s composable framework that allows GPUs to be assigned to different nodes. Liqid develops a software layer that brings together components connected via a PCIe 4.0 structure.
The latter is something useful for Durham’s setup due to the nature of the workloads it runs, according to Basden. Due to the large memory footprint and the dynamic nature of the computations, cosmology code tends not to be well suited for GPU acceleration, but some computations may benefit from this support, and thus the composable framework allows for failover them to a node if necessary.
“For some simulations that require a large number of GPUs, perhaps on a smaller number of nodes, they may have that,” he said.
At the moment, composable infrastructure is only implemented in a small prototype system, but “it’s one of those things that if we were ever going to do this in a future large-scale system, we would need having built the confidence first that it was going to work,” Basden explained. ®
#Durham #Uni #Dell #codesign #systems #model #universe