
Large Language Models (LLMs) are revolutionizing the way we interact with computers. From composing realistic dialogue like ChatGPT to generating creative text formats like Bard, these AI models are rapidly growing in popularity 1. However, this progress comes at a significant cost – the immense computational power required to train these models translates to a massive energy footprint.
The Power-Hungry Beasts of AI
Consider the colossal training needs of GPT-3, the model behind ChatGPT. OpenAI reportedly used 10,000 Nvidia N100 GPUs running for a month 2. These high-performance GPUs can consume between 500-700 watts each 3. Factoring in additional power for networking and cooling, the total power consumption could reach a staggering 10 Megawatts (MW) – enough to rival a small city according to the US Energy Information Administration (EIA) 4.
Imagine the environmental impact as more companies and institutions develop their own LLMs. Training needs could potentially reach the power demands of entire cities, especially with plans for even larger models utilizing 32,000 or even 64,000 GPUs 5.
Beyond the Meter: The Carbon Cost
The high energy consumption directly translates to a large carbon footprint. Power plants generating this much electricity often rely on fossil fuels, releasing harmful greenhouse gasses like carbon dioxide into the atmosphere. A report by the Cutter Consortium emphasizes that CO2 emissions are the primary concern when considering the environmental impact of LLMs 6.
Research Efforts to Reduce Power Consumption
Significant research is underway to address the high-power usage of AI data centers. Two main areas of focus include:
- Reducing the size and complexity of LLM neural networks 7 8 9 10.
- Improving the power efficiency of GPUs 11 12 13.
However, given the exponential increase in LLM complexity, measured by the number of parameters (e.g., ChatGPT-2: 1.5B, ChatGPT-3: 175B, ChatGPT-4: 1.76T), additional solutions are needed.
Shared AI Data Centers: A Potential Solution
A promising avenue for addressing the power consumption issue is to explore shared AI data centers. Unlike the current model where organizations build dedicated GPU data centers for their own use, sharing resources could enable smaller players to train large models by pooling resources from multiple data centers owned by different entities. This could democratize AI development and reduce the environmental impact of building and maintaining numerous large-scale data centers.
LLM Training Characteristics
AI training applications, such as Large Language Models (LLMs) like ChatGPT based on the Transformer concept 14, exhibit distinct characteristics. These models consist of interconnected neural networks with a vast number of neurons (or weights) 14, exemplified by ChatGPT’s 175 billion neurons.
Traditionally, neural network training involves running training data in a feed-forward phase, calculating the output error, and then using backpropagation to adjust the weights. However, the immense size of LLMs necessitates parallelization to accelerate processing.
Several parallelization techniques are employed:
- Data parallelization: The training data is divided into batches, with each batch used to train the model on a different GPU 15.
- Model parallelization: The neural network is segmented into parts, each assigned to different GPUs for processing 16. This is often done at the neural network layer level.
- Further parallelization: The model and data can be further subdivided into smaller units, depending on the model’s size, to enhance parallelism.
Extensive research is ongoing to identify the optimal types and combinations of parallelization 17 18 19 20. A key aspect of network traffic in AI training is the cyclical nature of processing: phases of intense data transfer to GPUs are followed by phases where some GPUs wait for others to complete their tasks.
In data parallelization, all GPUs train on their data batches simultaneously and then wait for updated weights from other GPUs before proceeding. In model parallelization, GPUs simulating different layers of a neural network may experience waiting times for other GPUs to complete their layer-specific computations.
During these waiting periods, network latency and latency variation can significantly affect training efficiency. Since GPUs are a costly resource, minimizing their idle time is essential due to their expense in terms of both procurement and power consumption.
Datacenter Architecture
Due to the critical importance of minimizing latency in AI networks, traditional TCP/IP architecture is generally avoided. The latency introduced by TCP, coupled with its high CPU usage, significantly increases the overall cost of the architecture. Instead, AI networks predominantly employ IP/UDP with credit-based congestion control mechanisms, as demonstrated in 21.
The absence of TCP’s reliability mechanisms means that any packet loss necessitates reprocessing the previous AI training run, a highly time-consuming process. Hence, packet loss is extremely undesirable. To mitigate this, AI networks often use credit-based architectures, where upstream nodes send credits indicating buffer availability, ensuring a lossless transmission system.
InfiniBand and Ultra Ethernet are prime examples of data center networks designed for AI workloads. Both utilize Remote Direct Memory Access (RDMA) 22, allowing the network interface card (NIC) to directly write into GPU memory, bypassing the CPU and achieving microsecond-level latency.
In contrast, wide-area IP/MPLS networks, built on TCP, are characterized by their ability to drop packets and considerably higher per-node latency. Consequently, interconnecting AI data centers using current wide-area network technologies is not a practical solution.
Overcoming Challenges in Inter-Data Center Communication
By directly interconnecting AI data centers using dedicated wavelengths over wide-area networks, we can effectively address the limitations of traditional networking for AI training workloads. Here’s how this approach tackles the key challenges:
- Latency Mitigation: While AI data centers typically experience 10-20 microseconds of latency due to their spine-leaf architectures with 3-5 switches between GPUs 23, direct wavelength connections in metro regions (under 50km) introduce a propagation delay of around 150 microseconds due to the speed of light. This increased latency remains manageable, especially with asynchronous algorithms designed to tolerate higher latencies in data parallelization 24. However, model parallelization may require strategic grouping of GPUs to ensure low latency within groups and accept higher latency between groups.
- Congestion Control: AI datacenter typically use IP/UDP with credit-based congestion control. With a direct wavelength to the other datacenter, these credit-based islands would be connected together without ever requiring their traffic to go over the TCP/IP based wide area networks.
- Enhanced Security: The direct wavelength connection creates an isolated environment between AI data centers, inherently preventing attacks originating from the wider IP network since these networks are not directly interconnected.
- Scalability: AI data centers can generate immense traffic volumes. A 10,000 GPU data center with 40Gbps links per GPU and 10% peak occupancy could generate 40Tbps of traffic, rivaling the peak traffic of AT&T’s global network 25. To accommodate such high volumes, Communication Service Providers (CSPs) may need to deploy all-optical metro networks utilizing ROADMs (Reconfigurable Optical Add-Drop Multiplexers) and the L-band spectrum to support a larger number of wavelengths.
The Role of Communication Service Providers (CSPs)
The CSPs would act as a facilitator between players with a smaller number of GPUs. When a group of GPUs become available, they would signal to the CSP their availability and the CSP may connect them to the user which requires those GPUs with a direct optically switched wavelength. Only the CSPs may do this function as they have the ability to connect the two together via wavelengths and they know which wavelength may be available for the specific connection. The CSP may be able to do this function via Optical GMPLS 26 which allows the datacenter (or the network user) to request a wavelength to a destination and the network may respond automatically. In the metro network in most cases there would be no need for optical (3R) repeaters as the distances would be short, so if the bandwidth for the wavelength changes, it may be accommodated without changes in the network. This may also require flex-grid ROADMs 27 which allow wavelengths with different speeds in the same optical network.
Conclusion
Addressing the environmental impact of AI training is a pressing concern. CSPs can play a crucial role in enabling shared AI data centers, facilitating resource sharing, and minimizing the carbon footprint of AI development. By leveraging their network infrastructure and expertise, CSPs can empower smaller players to participate in AI innovation and contribute to a more sustainable and inclusive AI ecosystem.
- Ubs: Chatgpt may be the fastest growing app of all time, 2023. ↩︎
- Strubell, Emma, et al. “Energy and policy considerations for deep learning in NLP.” arXiv preprint arXiv:1906.05218 (2019). ↩︎
- Amodei, Dario, et al. “Concrete problems in AI safety.” arXiv preprint arXiv:1606.06565 (2016). ↩︎
- Ampere (microarchitecture). ↩︎
- U.S. Energy Information Administration / Electricity. ↩︎
- We’re getting a better idea of AI’s true carbon footprint. ↩︎
- Frankle, J., & Carbin, M. (2019). The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations (ICLR). ↩︎
- Jacob, B., Kligys, S., Chen, B., Zhu, M., Tang, M., Howard, A., … & Kalenichenko, D. (2018). Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2704-2713). ↩︎
- Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531. ↩︎
- Du, N., Li, Y., Lu, Y., Zhou, J., Xue, N., Liu, H., … & Zhou, M. (2022). GLaM: Efficient Scaling of Language Models with Mixture-of-Experts. In International Conference on Machine Learning (pp. 5513-5527). PMLR. ↩︎
- Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., … & Wu, H. (2017). Mixed precision training. arXiv preprint arXiv:1710.03740. ↩︎
- Han, S., Mao, H., & Dally, W. J. (2016, October). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In ICLR (Poster). ↩︎
- Venkataramani, S., Ranjan, A., Raghunathan, A., & Roy, K. (2014, June). AxNN: energy-efficient neuromorphic systems using approximate computing. In 2014 47th IEEE/ACM International Symposium on Microarchitecture (MICRO) (pp. 27-38). IEEE. ↩︎
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008). ↩︎
- Goyal, P., Dollár, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., … & He, K. (2017). Accurate, large minibatch SGD: Training ImageNet in 1 hour. arXiv preprint arXiv:1706.02677. ↩︎
- Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2019). Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053 ↩︎
- Z. Jia, M. Zaharia, and A. Aiken. Beyond data and model parallelism for deep neural networks. SysML, 2019. ↩︎
- L. Zheng, Z. Li, H. Zhang, Y. Zhuang, Z. Chen, Y. Huang, Y. Wang,Y. Xu, D. Zhuo, E. P. Xing, J. E. Gonzalez, and I. Stoica. Alpa: Automating inter- and Intra-Operator parallelism for distributed deep learning. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559–578, Carlsbad, CA, July2022. USENIX Association. ↩︎
- C. Unger, Z. Jia, W. Wu, S. Lin, M. Baines, C. E. Q. Narvaez,V. Ramakrishnaiah, N. Prajapati, P. McCormick, J. Mohd-Yusof,X. Luo, D. Mudigere, J. Park, M. Smelyanskiy, and A. Aiken. Unity:Accelerating DNN training through joint optimization of algebraic transformations and parallelization. In16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages267–284, Carlsbad, CA, July 2022. USENIX Association. ↩︎
- M. Wang, C.-c. Huang, and J. Li. Supporting very large models using automatic dataflow graph partitioning. In Proceedings of the Fourteenth EuroSys Conference 2019, EuroSys ’19, New York, NY,USA, 2019. Association for Computing Machinery. ↩︎
- Y. Zhu, H. Eran, D. Firestone, C. Guo, M. Lipshteyn, Y. Liron,J. Padhye, S. Raindel, M. H. Yahia, and M. Zhang. Congestion control for large-scale rdma deployments. In Proceedings of the 2015ACM Conference on Special Interest Group on Data Communication, SIGCOMM ’15, page 523–536, New York, NY, USA, 2015.Association for Computing Machinery. ↩︎
- Remote direct memory access. ↩︎
- Differences between InfiniBand and Ethernet Networks. ↩︎
- Large Scale Distributed Deep Networks. ↩︎
- AT&T Business FAQ ↩︎
- RFC 3471 – Generalized Multi-Protocol Label Switching (GMPLS) Architecture: Provides an overview of the GMPLS architecture and its application to various network technologies, including optical networks. ↩︎
- What is a Flex-grid ROADM? ↩︎