Huawei Cloud at KubeCon EU 2024: Unleashing the Intelligent Era with Continuous Open Source Innovation
PARIS, March 25, 2024 /PRNewswire/ -- At KubeCon + CloudNativeCon Europe 2024, held in Paris on March 21, Dennis Gu, Chief Architect of Huawei Cloud, pointed out, in a keynote speech titled "Cloud Native x AI: Unleashing the Intelligent Era with Continuous Open Source Innovation", that the integration of cloud native and AI technologies is crucial for driving industry transformation. Huawei Cloud plans to keep innovating open source projects and collaborating with developers to bring about an intelligent era.
AI poses key challenges to the cloud native paradigm.
In recent years, cloud native technologies have revolutionized traditional IT systems and accelerated digital advancements in areas such as the Internet and government services. Cloud native has introduced new possibilities, such as lightning-fast sales and agile operations, like DevOps, through microservice governance. These changes have had a significant impact on people's lives, and the rapid growth and widespread adoption of AI, including large-scale models, have become core to industry intelligence.
According to an Epoch survey in 2023, the compute required for basic models has been increasing by 10 fold every 18 months, which is five times faster than the growth rate predicted by Moore's Law for general compute. The emergence of this "New Moore's Law" due to AI and the prevalence of large-scale AI models presents challenges for cloud native technologies. In his speech, Dennis Gu outlined the following key points:
- Low average GPU/NPU utilization drives up the cost of AI training and AI inference.
- Frequent failures of large model training clusters decrease training efficiency.
- The complex configuration of large-scale models results in demanding high AI development requirements.
- Deploying large-scale AI inference carries the risk of unpredictable end-user access delays and involves potential data privacy issues.
Huawei Cloud AI innovation offers developers ideas to tackle challenges.
The increasing sizes of AI models demand more compute, which creates challenges for cloud native technologies but also creates opportunities for innovation in the industry. Dennis Gu shared stories about Huawei Cloud's AI innovation, offering developers a reference point to tackle the challenges.
Huawei Cloud used KubeEdge, a cloud native edge computing platform, to create a multi-robot scheduling and management platform. With this platform, users can use natural language commands to tell the platform what to do, and the system will coordinate multiple robots at the edge to accomplish complex tasks. The system is designed with a three-part architecture (cloud, edge node, and robot) to address challenges such as natural language comprehension, efficient scheduling and management of multiple robots, and cross-type robot access management. It uses large models to execute natural language commands and performs traffic prediction, task assignment, and route planning. The three-part architecture greatly enhances the flexibility of the robot platform, enhances management efficiency by 25%, reduces the time required for system deployment by 30%, and cuts down the time needed to deploy new robots from months to days.
For one leading content sharing platform in China, which has over 100 million active users per month, its primary service is the recommendations on the homepage. This feature is powered by a model with almost 100 billion parameters. To train this model, the platform uses a training cluster with thousands of compute nodes, including hundreds of ps and workers for a single training task. So, there is a lot of demand for better topology scheduling, high performance, and high throughput. Volcano, an open source project, enhances the support for AI or machine learning workloads on Kubernetes and offers a range of job management and advanced scheduling policies. Volcano incorporates algorithms like topology-aware scheduling, bin packing, and Service Level Agreement (SLA)-aware scheduling, resulting in a 20% improvement in overall training performance and a significant reduction in O&M complexity for the platform.
Serverless AI is at the forefront of cloud native development.
Many enterprises and developers face the challenge of running AI applications efficiently and reliably while minimizing operation costs. Huawei Cloud has developed a solution to this problem by identifying the key requirements of cloud native AI platforms and introducing a new concept called Serverless AI.
During his speech, Dennis Gu explained that Serverless AI is designed to simplify complex training and inference tasks by intelligently recommending parallel policies, making it easier for developers to use. It also includes an adaptive GPU/NPU automatic expansion function that dynamically adjusts resource allocation based on real-time workload changes, ensuring efficient task execution. Additionally, there is a fault-free GPU/NPU cluster in Serverless AI, freeing developers from concerns that hardware faults may interrupt services. Most importantly, Serverless AI is compatible with mainstream AI frameworks, allowing developers to easily integrate their existing AI tools and models.
Serverless AI is also a very significant development for cloud service providers. Serverless AI provides multiple benefits like improved GPU/NPU utilization, more efficient hybrid workloads for training, inference, and development, and green computing through better energy efficiency, so you can save money on electricity. Furthermore, Serverless AI enables GPU/NPU sharing among multiple tenants in difference spaces or at different time, improving the resource reuse rate. The most significant aspect of Serverless AI is its ability to provide guaranteed Quality of Service (QoS) and SLAs for both training and inference tasks, ensuring stable and high-quality service.
Serverless AI uses a flexible resource scheduling layer that is built on a virtualized operating system. This layer encapsulates essential functions of application frameworks into the application resource mediation layer. Dennis Gu presented the reference architecture for Serverless AI. He thinks that this architecture design allows Serverless AI to automatically drive large-scale AI resources. This includes accurately analyzing resource usage patterns, sharing resources from heterogeneous hardware pools, and ensuring fault tolerance during AI training tasks through GPU/NPU virtualization and load live migration. Additionally, multi-dimensional scheduling and adaptive elastic scaling improve resource utilization.
At the sub-forum, technical experts from Huawei Cloud noted that the AI or machine learning workloads running on Kubernetes have been steadily increasing. As a result, numerous companies are constructing cloud native AI platforms over multiple Kubernetes clusters that spread across data centers and a diverse range of GPU types. Karmada and Volcano can intelligently schedule GPU workloads across multiple clusters, supporting fault transfer, and ensuring consistency and efficiency within and across clusters. They can also balance resource utilization across the entire system and the QoS of workloads with different priorities to address the challenges of managing large-scale and heterogeneous GPU environments.
Karmada offers immediate, reliable automatic application management in multi-cloud and hybrid cloud scenarios. An increasing number of users are using Karmada to create adaptable and effective solutions in production environments. Karmada was officially upgraded to the CNCF incubation project in 2023, and the community is looking forward to more partners and developers joining in.
Volcano Gang Scheduling is a solution for AI distributed training and big data scenarios and addresses the issues of endless waiting and deadlock in distributed training tasks. With task-topology and I/O-aware scheduling, the transmission delay of distributed training tasks is minimized, improving the training performance by 31%. Additionally, minResources resolves resource contention between the Spark driver and executor in high-concurrency scenarios, optimizes the degree of parallelism, and improves the performance by 39.9%.
Dennis Gu believes that the key to improving AI productivity lies in the agility of cloud native technologies and the innovation of heterogeneous AI computing platforms. Huawei Cloud is dedicated to open source innovation and aims to work with industry peers to usher in a new era of intelligence.
Photo - https://mma.prnewswire.com/media/2370741/Dennis_Gu_Chief_Architect_Huawei_Cloud.jpg
Share this article