Amazon Web Services, in a direct collaboration with AI firm Anthropic, is building an enormous supercomputing cluster known as “Project Rainier.” This venture, backed by an $8 billion investment from Amazon into Anthropic, aims to provide unparalleled computational power for developing future AI. The system is already being brought online this year across several US locations, marking a significant escalation in the race for AI dominance.
Unprecedented Scale and Power
The sheer magnitude of Project Rainier is staggering. One site alone, located in Indiana, will eventually consist of thirty 200,000-square-foot data centers, which together will draw a massive 2.2 gigawatts of power. This sprawling, multi-site network is engineered to operate as a single, unified environment for training Anthropic’s most advanced AI models.
A portion of this powerful new infrastructure is already being utilized by the AI firm to gain a critical advantage in its development efforts.
Also Read: 9,000 Microsoft job cuts explained A calculated bet on AI
A Strategic Shift to Custom Silicon
Project Rainier distinguishes itself by moving away from the industry’s reliance on GPUs. The entire supercluster will be powered by Amazon’s own custom-designed Trainium2 AI accelerators, representing the largest-ever deployment of this proprietary technology.
Gadi Hutt, an engineering director at Amazon’s Annapurna Labs, noted that the focus extends beyond simple chip speed to achieving optimal “good put”—the real-world, effective throughput of the entire system, factoring in reliability and uptime.
The Trainium2 chip is a complex piece of hardware, combining two 5nm compute dies with high-bandwidth memory. It is specifically designed to handle both the training and inference stages of AI development, a vital capability for sophisticated methods like reinforcement learning.
While a single Trainium2 may not exceed the performance of a top-tier Nvidia B200 on every individual metric, Amazon’s strategy centers on the integrated system’s overall efficiency and cost-effectiveness.
An Architecture Built for the Future
The fundamental building block of Project Rainier is the Trn2 instance, each containing 16 Trainium2 accelerators. These instances offer a compelling alternative to competitor systems, especially for training tasks that involve sparse data.
Trn2 | DGX B200 | |
CPUs: | 2x 48C Intel Sapphire Rapids | 2x 56C Intel Emerald Rapids |
System Mem: | 2TB DDR5 | Up to 4TB |
Accelerators: | 16x Trainium2 | 8x B200 GPUs |
HBM: | 1536GB | 1440GB |
Memory BW: | 46.4TB/s | 64TB/s |
Dense FP8: | 20.8 petaFLOPS | 36 petaFLOPS |
Sparse FP8: | 83.2 petaFLOPS | 72 petaFLOPS |
Within each Trn2 instance, the accelerators are connected in a 4×4 2D torus via AWS’s high-speed NeuronLink v3. This switchless interconnect design minimizes latency and power usage, enabling the systems to be air-cooled.
Also Read: These popular Spotify artists don’t actually exist and fans had no idea
To achieve even greater scale, four Trn2 instances are combined into an “UltraServer,” creating a 64-chip domain with a 3D torus interconnect.
Trn2 UltraServer | DGX GB200 NVL72 | |
Accelerators: | 64x Trainium2 | 72x Blackwell GPUs |
HBM: | 6.1TB | 13.4TB |
Memory BW: | 186TB/s | 576TB/s |
Interconnect BW: | 68TB/s | 130TB/s |
Dense FP8: | 83.2 petaFLOPS | 360 petaFLOPS |
Sparse FP8: | 332.8 petaFLOPS | 720 petaFLOPS |
The complete “UltraCluster” will be formed by linking tens of thousands of these UltraServers with Amazon’s custom EFAv3 network. Looking forward, Amazon has already hinted at its next-generation Trainium3 chips, which promise a fourfold performance boost, suggesting that Project Rainier’s immense capabilities are set to grow even further.