The first exascale supercomputer has a hardware failure every day

In short: Frontier, the world’s strongest supercomputer, is on-line however nonetheless removed from operational. Its director has confirmed stories that it’s experiencing a system failure each few hours, however insists that is par for the course.

Frontier is in a category of its personal. It has 9,408 HPE Cray EX235a nodes, every powered by an AMD Trento 7A53 Epyc 64-core CPU geared up with 512 GB of DDR4, and 4 AMD Intuition MI250X GPUs / accelerators every geared up with 128 GB of HBM2e. Summed, the system has 602,112 CPU cores and eight,138,240 GPU cores in whole, and 4.6 PB of each DDR4 and HBM2e.

In Might, Frontier joined the TOP500 as the primary supercomputer to interrupt the exascale barrier after it accomplished the HPL benchmark with a rating of 1.102 ExaFlops/s. Since then, the Oak Ridge Nationwide Laboratory in Tennessee, which manages the supercomputer, has been readying it for scientific analysis scheduled to start out in January.

Nevertheless, there have been stories that the launch of Frontier may very well be waylaid by extreme {hardware} failures. Searching for solutions, Inside HPC organized an interview with the Program Director at Oak Ridge, Justin Whitt. Within the interview, he confirmed Frontier was experiencing every day system failures however asserted that was inevitable in such a big system.

“Imply time between failure on a system this measurement is hours, it is not days,” he mentioned. “So you should be sure to perceive what these failures are and that there isn’t any patterns to these failures that you should be involved with.” Whitt added that going a day and not using a failure “can be excellent.”

“Our objective continues to be hours.”

says Justin Whitt, Program Director on the OLCF

There have been rumors that the {hardware} issues had been being brought on by the brand new AMD Intuition MI250X, however Whitt refuted them. The MI250X is AMD’s strongest GPU/accelerator, and it solely sells it to pick out companions. It has 220 CUs containing 14,080 cores clocked at 1700 MHz in a 500 W bundle.

“The problems span a variety of completely different classes, the GPUs are only one,” Whitt remarked. “It has been a fairly good unfold amongst widespread culprits of elements failures which have been a giant a part of it. I do not assume that at this level that we now have a variety of concern over the AMD merchandise,” he added.

“We’re coping with a variety of the early-life type of issues we have seen with different machines that we have deployed, so it is nothing too out of the odd.”

Whitt conceded that the unprecedented scale of Frontier had made nice tuning it “somewhat bit more durable” however mentioned they had been nonetheless following the schedule set again in 2018-19 regardless of delays brought on by the pandemic.

Head over to Inside HPC to learn the complete interview.

Source link