100 Trillion Parameter AI Training Models

Recommender AI systems are an important component of Internet services today: billion dollar revenue businesses like Amazon and Netflix are directly driven by recommendation services. AI recommenders get better as they get bigger. Several models have been previously released with billion parameters up to even trillion very recently. Every jump in the model capacity has…
100 Trillion Parameter AI Training Models


Recommender AI systems are an important component of Internet services today: billion dollar revenue businesses like Amazon and Netflix are directly driven by recommendation services.

AI recommenders get better as they get bigger. Several models have been previously released with billion parameters up to even trillion very recently. Every jump in the model capacity has brought in significant improvement on quality. The era of 100 trillion parameters is just around the corner.

Complicated, dense rest neural network is increasingly computation-intensive with more than 100 TFLOPs in each training iteration. Thus, it is important to have some sophisticated mechanism to manage a cluster with heterogeneous resources for such training tasks.

Recently, Kwai Seattle AI Lab and DS3 Lab from ETH Zurich have collaborated to propose a novel system named “Persia” to tackle this problem through careful co-design of both the training algorithm and the training system. At the algorithm level, Persia adopts a hybrid training algorithm to handle the embedding layer and dense neural network modules differently. The embedding layer is trained asynchronously to improve the throughput of training samples, while the rest neural network is trained synchronously to preserve statistical efficiency. At the system level, a wide range of system optimizations for memory management and communication reduction have been implemented to unleash the full potential of the hybrid algorithm.

Cloud Resources for 100 Trillion Parameter AI Models

Persia 100 trillion parameter AI workload runs on the following heterogeneous resources:

3,000 cores of compute-intensive Virtual Machines


8 A2 Virtual Machines adding a total of 64 A100 Nvidia GPUs


30 High Memory Virtual Machines, each with 12 TB of RAM, totalling 360 TB


Orchestration with Kubernetes


All resources had to be launched concurrently in the same zone to minimize network latency. Google Cloud was able to provide the required capacity with very little notice.

AI Training needs resources in bursts.

Google Kubernetes Engine (GKE) was utilized to orchestrate the deployment of the 138 VMs and software containers. Having the workload containerized also allows for porting and repeatability of the training.

Results and Conclusions


With the support of the Google Cloud infrastructure, the team demonstrated Persia’s scalability up to 100 trillion parameters. The hybrid distributed training algorithm introduced elaborate system relaxations for efficient utilization of heterogeneous clusters, while converging as fast as vanilla SGD. Google Cloud was essential to overcome the limitations of on-premise hardware and proved an optimal computing environment for distributed Machine Learning training on a massive scale.

Persia has been released as an open source project on github with setup instructions for Google Cloud —everyone from both academia and industry would find it easy to train 100-trillion-parameter scale, deep learning recommender models.

Read More

Total
0
Shares
Leave a Reply

Your email address will not be published.

Related Posts
Half of Twitter’s Spending Has Been Wasted $TWTR $TSLA
Read More

Half of Twitter’s Spending Has Been Wasted $TWTR $TSLA

Half of Twitter’s spending has been wasted or ineffective. It will be super easy for Elon Musk (or anyone competent) to make Twitter far more profitable. Twitter has been spending 25% of its revenue on Research and Development and 23% on marketing. However, the number of Twitter users has been flat since 2014. Elon Musk…
An Introduction to the Billion-Dollar Web 3.0 Industry
Read More

An Introduction to the Billion-Dollar Web 3.0 Industry

After functioning for years in the centralized system, the internet is about to experience a breakthrough with the help of blockchain and its decentralization nodes. The move to Web 3.0 represents the next step in the internet, with new data rights, activities, and normal avenues of action becoming the standard. Web 3.0 technology gives rights…
Drought Reveals Fifth Mob Body Near Vegas
Read More

Drought Reveals Fifth Mob Body Near Vegas

Five skeletons have now been found in the receding waters of Lake Mead near Las Vegas. The water level at Lake Mead is lower than any time since the 1930s. The first body was found in a barrel in the spring. The person inside was dead of a gunshot wound decades ago, according to authorities.…
Intonal Festival 2022: Fulu Miziki
Read More

Intonal Festival 2022: Fulu Miziki

The Kinshasa group perform at the Malmö festival with their orchestra of DIY instruments constructed from salvaged items. In the second in a series of highlights from Malmö’s Intonal Festival, we present a performance from multidisciplinary Kinshasa collective Fulu Miziki. The Afro-Futurist collective, whose name roughly translates as “music from the garbage,” are famed for their…