IBM Resets the Bar for Deep Learning Performance

Image from Pixabay

Artificial intelligence (AI) and other computing forms that mimic human intelligence elicit so much interest and controversy that it’s easy to forget their reliance on highly complex and robust systems. That point underlies the breakthrough in Distributed Deep Learning (DDL) software announced last week by IBM Research.

Saying IBM has blown away what people can expect from and achieve with Deep Learning might seem like an overstatement. However, comparing the company’s results to other DDL efforts make it difficult to come to any other conclusion.

Deep Learning: The backstory

Concepts around AI arose among computing professionals in the 1950s but gained traction in popular science fiction stories and films far more quickly than it did in the real world. Isaac Asimov’s I, Robot, HAL in 2001: A Space Odyssey, R2-D2 and C-3PO in Star Wars, and the cyborg assassins in The Terminator films all set a high bar for what might be eventually achieved by blending human and machine characteristics.

But the real world moves more slowly and messily than science fiction, especially when it comes to ground-breaking, computationally-intensive (i.e. pricey) projects. Early AI examples relied on cumbersome software routines that enabled systems to perform simple tasks. The emergence of machine learning (ML) enabled major leaps forward by using algorithms and large data sets to train systems to analyze/learn about information, and determine next steps or form predictions.

Deep Learning (DL) complements ML and the pair share some similarities (making them easy to confuse) but is also uniquely separate. Rather than relying on algorithms like ML does, DL projects leverage neural networks (computational processes and networks that mimic functions of the human brain) and huge data sets to “train” systems for AI.

Not surprisingly, DL training processes are similar to those people use to learn new languages, mathematics and other skills. For example, language students utilize a variety of repetition-based tools, including group recitation of verb/noun forms, flash cards, games and rote memorization to steadily grow their knowledge and capabilities.

Similarly, DL uses automated processes to train AI systems, quickly repeating “lessons” thousands or tens of thousands of times. Plus, GPU-based neural networks enable systems to ingest multiple threads of information and share information across the system, enabling them to “multi-task” in their training far better than human students can. As a result, AI systems to be trained for areas like enhanced speech and visual recognition, medical image analysis and improved fraud detection.

Dr. Hillary Hunter, the IBM Fellow who led the company’s DDL team effort, compared this issue to the parable of the “Blind Men and the Elephant.” Each describes the elephant according to the individual body part he touches but those partial experiences lead them all to a fundamentally misunderstanding.

Hunter noted that given enough time, the group might “share enough information to piece together a pretty accurate collective picture of an elephant.” Similarly, information can be shared and synched across multiple graphics processors (GPUs) to develop AI capabilities.

IBM’s DDL code fixes GPU shortcomings

GPU advances have drastically reduced the cost of neural networks and DL, bringing them to the fore of AI development. However, GPU-based systems also suffer some significant, fundamental challenges, including scaling limitations and bottlenecks that hinder ever-faster GPUs from effectively synching and exchanging information beyond small configurations.

An example: While popular DL frameworks, including TensorFlow, Caffe, Torch and Chainer can efficiently leverage multiple GPUs in a single system, scaling to multiple, clustered servers with GPUs is difficult, at best.

That doesn’t DL scaling is impossible. In June, Facebook’s AI Research (FAIR) team has posted record results for best scaling for a cluster with 256 NVIDIA P100 GPUs. The team used a ResNet-50 neural network model with a small ImageNet-1K dataset (with about 1.3 million images) and a large (8192 images) batch size. With that configuration, FAIR achieved respectable 89 percent scaling efficiency for a visual image training run on Caffe.

In addition, the effectiveness of DL training systems used for visual image training is measured according to dataset size, image recognition accuracy and the length of time required to perform a training run. Microsoft was the previous record holder in this instance. Utilizing a ImageNet-22k dataset, the company’s AI team achieved 29.8 percent recognition accuracy in a training run that took 10 days to complete.

How does IBM Research’s new DDL achievement compare? The team utilized a cluster of 64 IBM Power System servers with 256 NVIDIA P100 GPUs to perform numerous image recognition training runs on Caffe. For a ResNet-50 model using the same dataset as the Facebook team, the IBM Research team achieved a near-perfect 95 percent scaling efficiency.

The same IBM Power/NVIDIA cluster was used to train a ResNet-101 neural network model similar to the one used by Microsoft’s team (with an ImageNet-22k dataset and a batch size of 5120). Despite the larger and considerably more complex dataset, the low communications overhead IBM cluster achieved a scaling efficiency of 88 percent, a smidgen less than FAIR’s best effort with a far smaller, simpler dataset.

But IBM also bested Microsoft’s results, delivering 33.8 percent, image recognition accuracy. Most impressively, the IBM team completed the more complex ResNet-101 Caffe training run in just 7 hours compared to the 10 days Microsoft required. In other words, IBM Research’s DDL library offered near-perfect clustered system scaling, supported highly complex datasets and neural networks, and delivered notably better training results in a tiny fraction of the previous world record time.

Final analysis

Sumit Gupta, VP of IBM’s Machine Learning and AI organization, detailed the team’s efforts using traditional software and a single IBM Power Systems server (a “Minsky” S822LC for HPC) with four NVIDIA P100 GPUs) to train a model with the ImageNet-22k dataset using a ResNet-101 neural network.

As Gupta joked, the 16 days required for the training run was “longer than any vacation I’ve taken.” Kidding aside, the results also highlight how inherent single system limitations result in data scientists working with DL face delayed time to insight and, in turn, limited their productivity. It’s hardly surprising that those barriers inevitably impact the enthusiasm for AI and related projects.

IBM’s DDL library is available in preview version in the latest V4 release of IBM’s PowerAI software distribution (running on its Power Systems solutions). The Power AI V4 release includes integration of the DDL library and APIs into TensorFlow and Caffe, and IBM Research has also developed a prototype integration in Torch.

Given the remarkable advances in scaling efficiency, training results and overall performance offered by these DDL innovations, it’s clear that IBM Research has blown away many of the inherent limitations of deep learning, along with customers’ expectations. By doing so, IBM has altered the artificial intelligence market landscape for the better.

Charles King: Charles King, Pund-IT’s president and principal analyst, has deep communications expertise that makes him a valuable and trusted asset for clients. In addition, Charles regularly speaks with the mainstream and technical media on topics from emerging IT products to continuing market trends.
Related Post