Chinese Startup's GPGPU Chip Targeted at AI, Data Center Applications
At Hot Chips, Chinese startup Biren has emerged from stealth, detailing a sizable, general-purpose GPU (GPGPU) chip meant for AI training and inference in the data center. The BR100 is composed of two identical compute chiplets, built on TSMC 7nm at 537mm2 each, plus four stacks of HBM2e in a CoWoS package.
\”We were going to build larger chips, therefore we needed to be creative with packaging to create BR100's design economically viable,\” said Biren CEO Lingjie Xu. \”BR100's cost could be measured by better architectural efficiency in terms of performance per watt and performance per square millimeter.\”
The BR100 can achieve 2 POPS of INT8 performance, 1 PFLOPS of BF16, or 256 TFLOPS of FP32. This really is doubled to 512 TFLOPS of 32-bit performance when utilizing Biren's new TF32+ number format. The GPU will also support other 16- and 32-bit formats but not 64-bit (64-bit is not widely used for AI workloads outside of scientific computing).
Using chiplets for that design meant Biren could break the reticle limit but retain yield advantages that include smaller die to lessen cost. Xu said that in contrast to a hypothetical reticle-sized design based on the same GPU architecture, the two-chiplet BR100 achieves 30% more performance (it is 25% larger in compute die area) and 20% better yield.
Another advantage of the chiplet design would be that the same tapeout may be used to make multiple products. Biren also offers the single-chiplet BR104 on its roadmap.
The BR100 will be OCP accelerator module (OAM) format, while the BR104 will come on PCIe cards. Together, 8 × BR100 OAM modules will form \”the most effective GPGPU server on the planet, purpose-built for AI,\” said Xu. The organization can also be working with OEMs and ODMs.
PETAFLOPS-CAPABLE
High-speed serial links between your chiplets offer 896-GB/s bidirectional bandwidth, which allows the two compute tiles to function like one SoC, said Biren CTO Mike Hong.
As along with its GPU architecture, Biren has additionally created a dedicated 412-GB/s chip-to-chip (BR100 to BR100) interconnect called BLink, with eight BLink ports per chip. This really is used to connect to other BR100s inside a server node.
Each compute tile has 16 × streaming processor clusters (SPCs), connected with a 2D mesh-like network on chip (NOC). The NOC has multi-tasking capability for alt width=\”640\” height=\”283\”>
Each SPC has 16 execution units (EUs), which may be split into compute units (CUs) of 4, eight, or 16 EUs.
Each EU has 16 × streaming processing cores (V-cores) and something tensor core (T-core). The V-cores are general-purpose SIMT processors with a full-set ISA for general-purpose computing-they handle data preprocessing, handle operations like Batch Norm and ReLU, and manage the T-core. The T-core accelerates matrix multiplication and addition, plus convolution-these operations from the majority of an average deep-learning workload.
Biren has also invented its very own number format, E8M15, so it calls TF32+. This format is meant for AI training; it has the same-sized exponent (same dynamic range) as Nvidia's TF32 format however with five extra bits of mantissa (in other words, it's five bits more precise). What this means is the BF16 multiplier can be reused for TF32+, simplifying the style of the T-core.
Xu said the organization has already submitted results to the next round of MLPerf inference scores, that ought to be accessible within the next couple weeks.