July_AI Chip Topics|AI Inference Chip Trend Analysis by the Edge of Hundreds of Controversies(Up)
As mentioned earlier, if AI agents are to be popularized on a large scale in the future, it is not enough to rely on "large-scale models" alone, and it is necessary to integrate computing capabilities at different levels, such as the cloud, edge, and endpoints, etc. Once LLMs are uploaded to the network, the cost of inference often exceeds the cost of training, and the scale of inference in a single day may be as high as hundreds of millions of tokens, and the computational efficiency and power consumption of chips will be magnified and examined at that time. Taking NVIDIA's Blackwell as an example, it is claimed that it can reduce the energy consumption and OPEX of LLM inference by 25 times, which highlights the importance of specialized hardware (e.g., GPUs, ASICs) in the inference scenario. In addition, cloud inference requires milliseconds of response latency, while edge devices are challenged by power consumption and thermal constraints. In summary, I believe that the two most critical challenges in today's inference technology are as follows, and will be the deciding factors in the competition for inference chips:
Challenge1: Throughput(Throughput) and Data Delay(Latency)balance optimization
Challenge2: Prefill(Prefill)Stage and decode(Decode)Stage specialization
Throughput in the first point refers to the amount of time a system can accomplish in a single hour.