For one federal agency with a critical national security mission, outdated technology was proving to be problematic.

Speed to decision is critical, but long processing times were becoming an issue. The agency needed a high-performance computing solution to ensure it could keep pace with fast-changing global dynamics.

Sirius, a CDW company, came through with a modern solution. By finding creative ways to deliver upgrades in a classified environment, Sirius dramatically accelerated processing speeds while reducing the physical footprint of the agency’s infrastructure. This put it on a more solid footing for the future.

An Urgent Need for Upgrades

In recent years, the changing military landscape created new pressures on the agency, which was tasked with handling an ever-increasing flow of data in support of timely decision-making.

“As the customer’s mission has increased, and the amount of data involved has increased, it had to change how it handles and analyzes that data,” says Jeff Grunewald, data center practice manager at Sirius.

The agency had a sprawling infrastructure in place to support that workload. It included almost 2,000 CPUs with dual-core servers in 60 equipment racks, with some 15,000 cores available for computing linear math equations in support of its national security mission.

However, that infrastructure hadn’t been kept current. “A data center like that typically would receive anywhere from $5 million to $50 million every two to three years to refresh, just to stay up to speed and to be relevant,” says Sirius account manager Alonzo George. “For over a decade, it had been receiving very minimal funds, enough to just keep the lights on.”

Unable to refresh its computers at the needed pace, the agency fell behind the technology curve. In some cases, it had 7-year-old machines in place. “When they run the workloads against those assets or resources, sometimes it would take months for that workload to finish,” George says. “That’s just not fast enough to protect national security.”

Nor could those linear-computing machines support the new and emerging artificial intelligence workloads critical to deriving insights from vast data sets.

Agency leaders knew that a modern approach could help them meet their mission more effectively. “We now have ways of processing in parallel, and this was an opportunity for the customer to upgrade their technology, to be able to do things faster,” Grunewald says.

The agency needed a turnkey solution that would integrate into its existing environment for management reasons but not rely on any of the existing infrastructure. The aim was to be completely free of any backward compatibility issues.

The solution would need to improve the speed and responsiveness of all its computational workloads. It would have to operate at the highest levels to deliver the most valuable insights as fast as possible. “They were looking to go from months or years to minutes,” George says.

It would also need to go beyond raw computing power. In the world of high-performance computing, “there’s the compute, the storage, the accelerator, the management — there’s a lot of different pieces,” says Grunewald. “All that has to be considered when you’re architecting something like this.”

Sirius took all of this into account as it brought together its solution.

“We allowed developers from each of the manufacturers that were involved to remotely log in and tune the system for the specs of the customer, so that it ran exactly how they expected it before they even showed up onsite.” — Alonzo George, Account Manager, Sirius

A Modern Approach

The team designed an NVIDIA DGX pod with eight DGX A100 systems, a 200-gigabit InfiniBand HDR fabric and NetApp EF600 series storage with about a petabyte of ultrafast NVME storage, all using the IBM Spectrum Scale file format system.

“That meant that all of the nodes within the pod could address all of the storage anytime, with the most performance,” George says.

“We dropped in a pod, which is basically a self-contained computational node with four equipment racks. Each equipment rack is 42 rack units, which are 6 feet tall. In two of those racks, we dropped in eight DGX A100 systems,” George says, adding that each of those contained eight NVIDIA GPUs: “So that’s 64 GPUs, and those all run internally at 800Gb.”

The solution “is purpose-built and configured for high-performance workloads,” he says. It delivers ultralow latency, lower than even the latency that one might see on a Wall Street trading floor, routinely posting response times in the single-digit nanosecond range.

The team integrated all of this with the existing system using a 100Gb ethernet backbone. This enables operators to move data and workloads as needed, while the system runs independently of anything else within the data center.

A Unique Challenge

In crafting the solution, the Sirius team faced a special hurdle: The agency works in a disconnected, classified environment. It’s a dark network completely isolated from the public internet.

“That means you can’t get to any software developers, you can’t get to any firmware or software patches, you can’t get to anything else,” George says. Still, it would be necessary to bring together a range of personnel representing various vendors to bring the end product to life.

To overcome this hurdle, the team first organized the build outside the classified space. With the system up and running in that environment, “we allowed developers from each of the manufacturers that were involved to remotely log in and tune the system for the specs of the customer, so that it ran exactly how they expected it before they even showed up onsite,” George says.

“We had developers from Sweden and other European nations logging in to tune it. All the manufacturers were involved,” he adds. “Then the customer logged in to it and ran on it remotely to make sure it was tuned the way they liked for what they were intending to do.”

Then the team shut it all down, packed it up into suitcases, loaded it onto a truck and shipped it to the client location. “They installed it onsite, cabled it, labeled it, powered it and made sure to run all of the same benchmarks we ran before we shipped the system to verify the performance onsite was exactly the same when we delivered it,” George says.

The customer first performed parallel tests, running identical workloads side by side on the old infrastructure and on the new system to ensure it met expectations. Parallel testing “gave the customer an opportunity to see how the new solution performed” versus its existing setup, Grunewald says. “When you have something that works and you’re going to do something to make it work even better, you have to convince yourself of that.”

Sirius was able to go from design to production in just 120 days instead of the 12 to 18 months it might take for an agency to implement such a solution on its own using conventional procurement processes. Strong vendor relationships enabled it to use a creative procurement mechanism that prioritizes certain types of national security projects to accelerate the acquisition process.

“That makes all of your equipment go to the front of the line for any type of logistics supply chain,” George says. “We also had the vendors pre-build all of their solutions ahead of time, in advance of actually receiving the order, so that it would ship almost immediately.”

4

The number of equipment racks needed to house one federal agency’s new high-performance computing solutions, down from 60

Moving Toward HPC

For agencies looking to embrace high-performance computing, a few key takeaways emerged from this project.

HPC Tackles Sprawl: Agencies can look to HPC to drive a more efficient physical infrastructure. In this case, the agency reduced its physical footprint from 60 racks down to just four.

Orchestration Is Critical: HPC has many moving parts, with multiple technology providers typically involved. A strong partner with deep vendor relationships, one that can coordinate the deliverables to ensure all the pieces come together, is vital to success.

HPC Accelerates Decisions: When quick, correct decisions are critical to the mission, HPC is a game changer. This agency, for example, nearly quadrupled processing speeds as it sought to meet time-sensitive mission requirements.

A Leap into the Future

The new solution gives the agency a massive jump forward in its ability to meet its mission. Operators now have vastly improved speeds, while administrators benefit from having a markedly reduced infrastructure footprint.

“We eclipsed all the computing power in their entire data center fivefold. It was literally five times their entire data center, in a compact solution, literally 7 percent of the rack space,” George says.

The new solution sits in four racks instead of 60, and it took the agency from 56Gb to 200Gb on the fabric, while the storage moved into ultrafast NVME instead of the solid-state and spinning disk drives it had been using. “NVME can do more than double or quadruple the performance on the storage side, and with this capability, they can actually do those AI workloads,” George says.

All this has had a practical impact. “When you’re dealing with national defense, the faster you can get your answers, the better off you are,” Grunewald says. “With very complex calculations and very complex data, the faster you can analyze it, the faster you can get the answers you’re looking for, which means that you can better support the national security mission.”

As an added benefit, the success of the project has helped the agency jump-start modernization across the rest of its infrastructure.

“The agency essentially eclipsed everything it had in its data center and moved into a new era of modern high-performance computing. As a result, it is now ripping and replacing everything else in its data center to try and match this,” George says. “Because of the impact of this system, it also received additional congressional funding to do that modernization.”

Story by Adam Stone, who writes on technology trends from Annapolis, Md., with a focus on government IT, military and first-responder technologies.

Photography by Getty Images: klmax; Gerville; montsitj

Discover the basic elements of a high-performance computing cluster for federal agencies.

MKT53299

Federal Agency Taps High-Performance Computing for National Security