NVIDIA Corp. (NVDA) Shareholder/Analyst Conference (Transcript)
NVIDIA Corp. (NASDAQ:NVDA) Shareholder/Analyst Conference Call March 21, 2023 1:00 PM ET
Company Participants
Simona Jankowski – Vice President of Investor Relations
Colette Kress – Executive Vice President & Chief Financial Officer
Jensen Huang – Co-Founder, Chief Executive Officer & President
Conference Call Participants
Toshiya Hari – Goldman Sachs
C.J. Muse – Evercore
Joe Moore – Morgan Stanley
Tim Arcuri – UBS
Vivek Arya – Bank of America
Raji Gill – Needham
Simona Jankowski
Hi, everyone, and welcome to GTC. This is Simona Jankowski, Head of Investor Relations at NVIDIA. I hope you all had a chance to view [indiscernible] this morning. We also published the press releases and calls detailing today’s announcement. Over the next hour, we will have an opportunity to unpack and discuss today’s event with our CEO, Jensen Huang; and our CFO, Colette Kress, in an open Q&A session with financial analysts.
Before we begin, let me quickly cover our safe harbor statement. During today’s discussion, we may make forward-looking statements based on current expectations. These are subject to a number of significant risks and uncertainties, and our actual results may differ materially. For a discussion of factors that could affect our future financial results and businesses, please refer to our most recent Form 10-K and 10-Q and the reports that we may file on Form 8-K with the Securities and Exchange Commission. All our statements are made as of today based on information currently available to us. Except as required by law, we assume no obligation to update any such statements. We’ll start with a few a brief comments by Jensen, followed by your Q&A session with Jensen and Colette Kress.
And with that, let me turn it over to Jensen.
Jensen Huang
Hi, everybody. Welcome to GTC. GTC is our conference for developers to inspire the world on the — or the possibility of accelerated computing and to celebrate the work of researchers and scientists that use it. And so please be sure to check in on some of the conference sessions that we have. It covers some really amazing topics. The GTC keynote highlighted several things. And let me — before I go into the slides, what I’m going to do is Colette and I will just cover basically the first slide, the rest of the slides we provided to you for reference.
And but let me make a couple of comments first. At the core of computing today, the fundamental dynamics at work is, of course, influenced by one of the most important technology drivers in the history of any industry, Moore’s Law and has fundamentally come to a very significant slowdown. You could argue Moore’s Law has ended. For the very first time in history, it is no longer possible using general-purpose computing CPUs to gain the necessary throughput without also the corresponding amount of increase in cost or power. And that lack of decreasing of power effectively or decreasing of cost is going to make it really hard for the world to continue to sustain increased workloads while maintaining sustainability of computing.
So one of the most important factors, dynamics in computing today is sustainability. We have to accelerate all the workloads we can so that we can reclaim the power and use whatever we reclaim to invest back into growth. And so the first thing that we have to do is to not waste power to not — to accelerate everything we possibly can and was really focused on sustainability. I gave several examples of the workloads that we used to highlight how in many cases, we can accelerate an application by 40, 50, 60, 70 times, 100 times while in the process decreasing powered by an order of magnitude decrease in cost by a factor of 20. This approach is not easy. Accelerated computing is a full stack challenge. NVIDIA accelerated computing is full stack. I’ve talked about that many — in many sessions in the past. It starts from the architecture to the system to the system software, to acceleration libraries to the applications on top.
We’re a data center scale computing architecture. And the reason for that is because once you refactor an application to be accelerated, the algorithms are highly paralyzed. Once you do that, you can also scale out. So the — one of the benefits of accelerated computing from the work that we do, you can scale up, you can also scale out. The combination of it has allowed us to bring million x acceleration factors to many applications domain, of course, one of the very important ones is artificial intelligence.
NVIDIA’s accelerated computing platform is also a multidomain. This is really important because data centers, computers are not single-use devices. What makes computers such an incredible instrument is its ability to process multiple types of applications. NVIDIA’s accelerated computing has multi-domain, particle physics, fluid dynamics, all the way to robotics, artificial intelligence, so on and so forth, computer graphics, image processing, video processing. All of these types of domains consume an enormous amount of CPU cores [ph] today, enormous amounts of power. We have the opportunity to accelerate all of them and reduce power, reduce cost.
And then, of course, NVIDIA’s accelerated computing platform is cloud to edge. This is the only architecture that is available in every cloud. It’s available on-prem by — just about every computer maker in the world. and it’s available at the edge for inferencing systems or autonomous systems, robotic self-driving cars, so on and so forth. And then lastly, one of the most important characteristics about NVIDIA’s accelerated computing platform is although we do it full stack, we design it and architect it data center scale. It’s available from cloud to edge. It is completely open, meaning that you can access it from literally any computing platform from any computing maker anywhere in the world.
And so this is one of the most important characteristics of a computing platform. And it’s because of its openness because of our reach, because of our acceleration capability that the positive — the virtuous cycle, the positive virtual cycle of accelerated computing has now been achieved. Accelerated computing and artificial intelligence have arrived. We talked about 3 dynamics. One of them is sustainability. I just mentioned. The second is generative AI. All of the foundational work that has been done over the last 10 years, in the beginning, really big breakthroughs in computer vision and perception led to industrial revolutions in autonomous vehicles, robotics and such, that was just the tip of the iceberg.
And now with generative AI, we have gone beyond perception to now the generation of information, no longer just the understanding of the world, but to also make recommendations or generate content that is of great value. Generative AI has triggered an inflection point in artificial intelligence and has driven a step function increase in the adoption of AI all over the world and very importantly, a step function increase the amount of inference that will be deployed in all the world’s clouds and data centers.
And the third thing that I mentioned discussed in the keynote was digitalization. This is really about taking artificial intelligence to the next phase, the next wave of AI, where AI is not only operating on digital information, generating text and generating images. But AI is operating factories and physical plants and autonomous systems and robotics. In this particular case, digitalization has the real opportunity to automate some of the world’s largest industries. And I spoke about the digitalization of one particular industry. I gave examples of how Omniverse is the digital physical operating system of industrial digitalization, and I demonstrated how Omniverse was used from the very beginning of product conception, the architecture, the styling of product designs, all the way to collaboration of the design, the simulation of the product, the engineering of the electronics to the setting up the virtual plants all the way to digital marketing and retail.
In every aspect of a physical products company, digitalization has the opportunity to automate to help them collaborate to bring the world of physical into the world of digital, and we know exactly what happens to that. Once you get into the world, the digital, our ability to accelerate workflows, our ability to discover new product ideas, our ability to invent new business models, tremendously increase. And so I spoke about digitalization. There were 5 takeaways that we spoke about in the keynote. And we’ll talk about today and if you have questions in any of these areas, we love to entertain them.
The first, of course, is that generative AI is driving accelerating demand for NVIDIA platforms. We came into the year full of enthusiasm with the Hopper launch. Hopper was designed with a transformer engine that was designed for large language models and what people now call foundation models. The transformer model — transformer engine has proven to be incredibly successful. Hopper has been adopted by just about every cloud service provider that I know and available from OEMs. And what is really signaling the increase in demand of Hopper versus previous generations and the accelerating demand for it really signals an inflection for AI because it used to be for AI research which — and now with generative AI moving into the deployment of AI into all of the world’s industries and very importantly, a very significant step function in the inference of these AI models.
So generative AI is driving accelerating demand. The second thing is we talked about our new chips that are coming to the marketplace. We care deeply about accelerating every possible workload we can. And one of the most important workloads of course, is artificial intelligence. Another important workload to accelerate is the operating system of the entire data center. You have to imagine that these giant data centers are not computers, but they’re fleets of computers that are orchestrated and operated as one giant system. So the operating system of the data center, which includes the containerization, the virtualization, networking storage and very importantly, security, the isolation and in the future, the confidential computing of all of these applications is operating software-defined layer as a software layer that runs across the entire data center fabric. That software layer consumes a lot of CPU cores.
And frankly, I wouldn’t be surprised if for many, depending on the type of data centers that are being operated, I wouldn’t be surprised if 20%, 30% of the data centers power is just dedicated to the networking, the networking and the fabric and all of the virtualization and the software-defined stacks, basically the operating system stack. We want to offload, accelerate the operating system of modern software-defined data centers. And that processor is called BlueField. We announced a whole bunch of new partners and cloud data centers that have adopted BlueField. I’m very excited about this product. I really believe that this is going to be one of the most important contributions we make to modern data centers.
Some companies designed their own, most companies won’t have the resources to design something of this complexity and cloud data centers will be everywhere. We announced Grace Hopper, which is going to be used for one of the major inference workloads, vector databases, data processing, recommender systems. Recommender systems, as I’ve spoken about in the past, is probably one of the most valuable and most important applications in the world today and a lot of digital commerce and a lot of digital content is made possible because of sophisticated recommender systems. Recommender systems are moving to deep learning, and this is a very important opportunity for us.
Grace Hopper was designed specifically for that and give us an opportunity to get a 10x speed up in recommender systems in large databases. We spoke about Grace. Grace is now in production. Grace is also sampling. Grace is designed for the rest of the workload in a cloud data center that is not possible to accelerate. Once we accelerate everything, what is left over is software that really wants to have very strong, single-threaded performance. And the single-threaded performance is what Grace was designed for.
We also designed Grace not to just be the CPU of a fast computer, but to be the CPU of a very, very energy-efficient cloud data center. When you think about the entire data center as 1 computer, when the data center is the computer, then the way you designed the CPU in the context of an accelerated data center AI-first, cloud-first data center, that CPU design is radically different. We designed Grace CPU, excuse me [indiscernible] just slightly out of reach. The Grace CPU is designed. This is the entire computer module. This isn’t just the CPU, but this is the entire computer module of a great super chip. And this goes into a passively cool [ph] system and you could rack up a whole bunch of Grace computers into a cloud data center because it is so energy efficient and yet so performing for single-threaded operation. We’re really excited about Grace and it’s sampling now.
Let’s see. We spoke a lot about generative AI and how it’s a step function increase in the amount of inference workload that we’re going to see. And one of the things that’s really important about inference coming out of the world’s data centers is that it really wants to be accelerated on the one hand. On the other hand, it is multimodal, meaning that there are so many different types of workloads that you want to inference. Sometimes you want to inference, you want to bring inference and AI to video, and you augment it with generative AI. Sometimes it’s images — producing beautiful image and helping to be a co-creator.
Sometimes you’re generating text, very long text. So the prompts could be quite long so that you can have a very long context or it could be generating very long text, writing very long programs. And so these applications, each one of them, video, images, text, and, of course, also vector databases, they all have different characteristics. Now the challenge, of course, is in the cloud data center, on the one hand, you would like to have specialized accelerators for each one of those modalities or each one of those diverse generative AI workloads.
On the other hand, you would like your data center to be fungible because workloads are moving up and down. They’re very dynamic. New services are coming on, new tenants are coming on. People use different services during different times of day and yet you would like your entire data centers to be utilized as much as possible. The power of our architecture is that it is one architecture. You have one architecture with 4 different configurations. They all run our software stack which means that depending on the time of day, if one is under provisioned or underutilized, you can always provision that class of that configuration of accelerators to other workloads.
And so this fungibility in the data center gives you the ability — our architecture, one architecture, inference configurations, inference platform gives you the ability to accelerate various workloads to its best of your ability and then not have to perfectly precisely predict the amount of workload because the entire data center is flexible and fungible. So one architecture, 4 configurations. One of our biggest areas of collaboration and collaboration partnership is Google Cloud, GCP.
We’re working with across a very large area of accelerated workloads from data processing for Dataproc [ph], Spark RAPIDS to accelerate Dataprocesses, which represents — data processing probably represents some 10%, 20%, 25% of cloud data center workloads. It’s probably one of the most intensive CPU core workloads. We have an opportunity accelerating it, bring 20x speed up, bring a lot of cost reduction that customers can enjoy. And very importantly, a lot of power reduction that’s associated with that. We’re also accelerating inference with the Triton server. We’re also accelerating their generative AI models. Google has a world-class pioneering large language models that we’re now accelerating and putting onto the inference platform, L4.
And of course, streaming graphics and streaming video, we have an opportunity to accelerate that. So our 2 teams are collaborating to take a large amount of workloads that could be accelerated in generative AI and other accelerated computing workloads and accelerating it with the L4 platform, which has just gone public on GCP. So we’re really excited about that collaboration, and we have much more to tell you soon. The third thing that we talked about was acceleration libraries. As I mentioned before, accelerated computing is a full stack challenge. Unlike a CPU where software is written and it’s compiled using a compiler and its general purpose, so all code runs. That’s 1 of the wonderful advantages and the breakthroughs of a CPU, this general purpose.
The acceleration aspect of it, if you want to accelerate workloads, you have to redesign the application, you have to refactor the algorithm altogether, and we codify the algorithms into acceleration libraries. Acceleration library, all the way to linear algebra to FFT to data processing that we use to fluid dynamics and particle physics and computer graphics and so on and so forth, quantum chemistry, inverse physics for image reconstruction, so on and so forth.
Each one of these domains require acceleration libraries. Every acceleration library requires us to understand the domain, work with the ecosystem, create an acceleration library, connect them to applications in that ecosystem and power and accelerate the domain of use. Every single — we’re constantly improving the acceleration libraries we have so that the installed base benefits from all of our increased optimizations for all of their investments of capital already, their infrastructure already. So you buy NVIDIA systems and you benefit from acceleration for years to come. It’s not unusual for us on the same platform to increase the performance anywhere from 4x to 10x after you’ve installed it over its life.
And so we’re delighted to continue to improve the libraries and bring new features and more optimization. This year, we optimized and released 100 libraries and 100 models — 100 libraries and models so that you can have better performance and better capability. We also announced several very important new libraries. One new library that I’ll highlight is cuLitho. Computational lithography is an inverse physics problem that calculates the — that processes — calculates the [indiscernible] equation as it goes through optics and interacts with the photoresist on the mask. This ability to do basically inverse physics and image processing makes it possible for us to use wavelengths of light that are much, much larger than the final pattern that you want to create on a wafer.
It’s a miracle in fact, if you look at modern microchip manufacturing. In the latest generation, we’re using 13.5-nanometer light, which is near x-ray, it’s extreme ultraviolet and yet using 13.5 nanometer light, you could pattern a few nanometers, 3-nanometer, 5-nanometer patterns on wafer. I mean that’s basically like using a fuzzy light, a fuzzy pen to create a really fine pattern on a piece of paper. And that ability to do so requires magical instruments like ASMLs, magical instruments, computational libraries from Synopsys, the miracle of the work that TSMC does and this field of imaging called computational lithography. We’ve worked over the last several years to accelerate this entire pipeline. It is the single, largest workload in all of EDA today, computationally intense, millions and millions of CPU cores are running all the time in order to make it possible for us to create all of these different masks.
This step of the manufacturing process is going to get a lot more complicated in the coming years because the magic that we’re going to have to bring to future lithography is going to get increasingly high. And machine learning and artificial intelligence will surely be involved. And so the first step for us is to take this entire stack and accelerate it. And over the course of the last 4 years, we’ve now accelerated computational lithography by 50 times. Now of course, that reduces the cycle time and the pipeline and the throughput time for all of the chips in the world that are being manufactured, which is really quite fantastic because these are $40 billion, $50 billion investments in the factory. If you could reduce the cycle time by even 10%, the value to the world is really quite extraordinary.
But the thing that is really fantastic is we also save an enormous amount of power. In the case of TSMC and the work that we’ve done so far, we have the opportunity to take megawatts, tens of megawatts and reduce it by factors of 5 to 10. And so that reduction in power of course makes manufacturing more sustainable, and it’s a very important initiative for us.
So cuLitho, I’m very excited about. Lastly, I’ll talk about the single largest expansion of our business model in our history. We know that the world is becoming heavily cloud-first. And cloud gives you the opportunity to engage a computing platform quickly, instantly through a web browser. And over the last 10 years, the capabilities of clouds have continued to advance to the point where it started with just CPU and running Hadoop or MapReduce or doing queries in the very beginning to now, they are high-performance computing, scientific computing systems, AI supercomputers in the cloud.
And so we are going to partner with all of the world’s cloud service providers. And starting with OCI, we’ve also announced cloud partnership with Azure and GCP. We’re going to partner with the world’s leading cloud service providers to implement — to install and host NVIDIA AI, NVIDIA Omniverse and NVIDIA DGX Cloud in the cloud. The incredible capability of doing so is, on the one hand, you get the fully optimized multi-cloud stacks of NVIDIA AI and NVIDIA Omniverse. And you have the opportunity to enjoyed in all of the world’s clouds in its most optimized configuration. And so you get all of the benefits of NVIDIA software stack in its most optimal form. You have the benefit of working directly with NVIDIA computer scientists and experts.
So for companies who have very large workloads and who would like to have the benefit of acceleration, the benefits of the most advanced AI we now have a direct service where we can engage the world’s industries. It’s a wonderful way for us to combine the best of what NVIDIA brings and to best of all the CSPs. They have incredible services for security for cloud, for security, for storage, for all of the other API services that they offer, and they very well could be likely already the cloud you’ve selected. And so now for the very first time, we have the ability to combine the best of both worlds and bring NVIDIA’s best to — and combine it with the CSPs best and make that capability available to the world’s industries.
One of the services that we just announced is that platform as a service, NVIDIA AI, NVIDIA Omniverse and Infrastructure as a Service, NVIDIA DGX Cloud. We also offered — announced a new layer. We have so many customers that we work with, so many industry partners that we work with to build foundational models. And if a customer of an enterprise, if an industry would like to have access to foundational models, the most obvious and the most accessible thing is to work with world-leading service providers like OpenAI or Microsoft and Google. These are all examples of AI models that are designed to be highly available, highly flexible and useful for many industries.
There are companies that want to build custom models that are based specifically on their data. And NVIDIA has all of the capabilities to do that. And so for customers who would like to build custom models based on their proprietary data, trained and developed and inference in their specific way whether it’s the guardrails that they would like to put implement or the type of instruction tuning they would like to perform or the type of proprietary data sets that they would like to have retrieved, whatever the very specific requirements that they have in language models, generative image models in 2D, 3D or video or in biology, we now have a service that allows us to directly work with you to help you create that model fine-tune that model and deploy that model on NVIDIA DGX cloud. And as I mentioned, the DGX cloud runs in all of the world’s major CSPs. And so if you already have a CSP of your choice, I’m pretty certain that we’ll be able to host it in it, okay?
And so NVIDIA cloud services is going to expand our business model and we offer Infrastructure as a Service, DGX Cloud, Platform as a Service, NVIDIA AI, NVIDIA Omniverse, and we have new AI services that are designed to be custom, essentially the foundry of AI models that are available to the world’s industries and all of it in the world — in partnership with the world’s leading CSPs. So that’s it. Those are the announcements that we made. We have a lot to go through. Thanks for joining GTC.
And with that, Colette and I will answer questions for you.
Question-and-Answer Session
A – Simona Jankowski
Thank you, Jensen. Let me welcome our financial analysts to the Q&A session. We’re going to be taking questions over Zoom. [Operator Instructions] And our first question is from Toshiya Hari with Goldman Sachs.
Toshiya Hari
Thank you very much for hosting this follow-up. Jensen, I guess I had 1 question on the inference opportunity. Obviously, you dominate the training space, and you’ve done so for many, many years now. I think on the inference side, the competitive landscape has been a little bit more mixed given incumbency around CPUs. But obviously, very encouraging to see you introduced this new inference platform. I guess with the criticality of recommender systems that you spoke to, [indiscernible] LLMs and your work with Google, it seems like the market is moving in your direction. How should we think about your opportunity in inference, call it, in 3 to 5 years versus where you stand today? And how should we think about Grace playing a role there over the next couple of years?
Jensen Huang
Yes, Toshi, thank you First of all, I’ll work backwards. In 3 to 5 years, the AI supercomputers that we are building today, which is unquestionably the most advanced computers the world makes today. It is, of course, of gigantic scale. It includes computing fabrics like NVLink, computing — large computing, large-scale computing fabrics like InfiniBand and very sophisticated networking that stitches it all together. The software stack, the operating system of it, the distributed computing software, it’s just computer science at the limits.
And so there, what’s really going to be quite exciting is how AI super computer [ph] is going to go beyond research and extending into essentially AI factories because these AI models that people develop are going to be fine-tuned and improved basically forever. And I believe that every company will be an intelligence manufacturer. At the core of all of our companies, we produce intelligence. And the most valuable data we have are all proprietary. They’re inside the walls of this company. And so we now have the capability to create — to build AI systems that helps you curate your data, package your data together that could then be used to help you train your proprietary model, custom model, which can accelerate your business. That system, that AI training system is continuous. Second, inference. Inference has largely been a CPU-oriented workload. And the reason for that is because most of the inference in the world today are fairly lightweight. They might be recommending something related to shopping or a book or a query or so on and so forth. And these kind of recommendations are largely done on CPUs.
In the future, there are several reasons why even video is processed on CPUs today. In the future, what is likely to happen are 2 fundamental dynamics that are inescapable at this point. And it was inevitable for quite a long time. It is now inescapable. One of them is just sustainability. You can’t continue to take these video workloads and process them on CPUs. You can’t take these deep learning models even if the quality of service was a little bit lesser good using CPUs to do it, it just burns too much power. And so the first reason why we have to accelerate everything is for sustainability. We have to accelerate everything because Moore’s Law has ended. And that that sensibility has now permeated just about every single cloud service provider because the amount of workload that they have that requires acceleration has increased so much. And so their attention to acceleration, their alertness to acceleration has increased. And secondarily, just about everybody is at power limited — power limits. And so in order to to grow in the future, you really have to reclaim power through acceleration and then put it back to growth.
And then the second reason is generative AI has arrived. We’re going to see just about every single industry, benefiting from, augmenting from co-creators, co-pilots, that accelerates everything we do from the tech we create, chat bots, we interact with, spreadsheets we use, PowerPoint and Photoshop and so on and so forth, they’re all going to be — you’re going to be augmented by, you’re going to be accelerated by, inspired by a co-creator or a copilot. And so I think that the net of it all is that AI for training, AI supercomputers will become AI factories. And every company will have either on-prem or in the cloud. And secondarily, just about every interaction you have with computers in the future will have some generative AI connected to it. And therefore, the amount of inference workload will be quite large. My sense is that inference will on balance be larger than — larger than inference — larger than training. But training is going to be right there with it.
Simona Jankowski
Our next question comes from CJ Muse with Evercore.
C.J. Muse
I guess to my question, I’d like to focus on Grace. In the past, you’ve mostly discussed the benefit of Grace and Hopper combined. Today, you’re also focusing a bit more on Grace on a stand-alone basis than what I was kind of expecting. Can you speak to whether you’ve changed your view on your expected service CPU share gain outlook? And how should we think about potential revenue contributions over time, particularly as you think about Grace standalone, Grace superchip and then obviously Grace Hopper combined.
Jensen Huang
I’ll start from the punchline and work backwards. I think Grace will be a big business for us, but it will — it will be nowhere near the scale of accelerated computing. And the reason for that is because we genuinely feel that every workload that can be accelerated must be accelerated. And everything from data processing, and of course, computer graphics to video processing to generative AI. Every workload that can be accelerated, must be accelerated, which basically leaves workloads that can’t be accelerated, meaning the converse of that. Another way of saying that is it’s single-threaded code. That single-threaded code because Amdahl’s law still prevails. Everything that is left becomes the bottleneck. And because the single-threaded code is largely related at this point to data processing, fetching a lot, moving a lot of data, we have to design a CPU that is really good at 2 things. Well, let me just say 2 things plus a design point.
The 2 characteristics that we really, really want for our CPU is one that has extremely good single-threaded performance. It’s not about how many cores you have, but it’s about how good the single-threaded cores you do have. And number one. Number two, the amount of data that you move has to be extraordinary. This one module here, this one module here moves 1 terabytes per second of data. That’s just an extraordinary amount of data that we move and you want to move it, you want to process that data with extremely low power, which is the reason why we innovated this new way of using cellphone DRAM enhanced for data center resilience and used it for our servers.
It’s cost effective because obviously, cell phone volume is very high. The power is 1/8 the power. And moving data is going to be so much of the workload that is just vital to us that we reduce it. And then lastly, we designed the whole system instead of building just a super fast CPU core — CPU, we design a super fast CPU node. By doing so, we can enhance the ability for data centers that are powered limited to be able to use as many CPUs as possible. I think that the net of it all is that accelerated computing will be the dominant form of computing in the future because Moore’s Law has come to an end. But what is going to remain are going to be heavy data processing, heavy data movement and single-threaded code. And so CPUs will remain very, very important. It’s just a design point would be different than the past.
Simona Jankowski
Our next question will come from Joe Moore with Morgan Stanley.
Joe Moore
I wanted to follow up on the inference question. This cost per query is becoming a major focus for the generative AI customer. And they’re talking about pretty significant reductions in the quarters and years ahead. Can you talk about what that means for NVIDIA? Is this going to be an H-100 workload for the longer term? And how do you guys work with your customers to get that cost down?
Jensen Huang
Yes, there’s a couple of dynamics that are moving at the same time. On the one hand, models are going to get larger. The reason why they’re going to get larger is because we wanted to perform tasks better and better and better. And there’s every evidence that the capability, the quality and the versatility of a model is correlated to the size and model and the amount of data that you train that model with. And so on the one hand, we want it to be larger and larger, more versatile. On the other hand, there are so many different types of workloads. Remember, you don’t need the largest model to inference every single workload. And that’s the reason why we have — we have 530 billion parameter models [ph]. We have 40 billion parameter models. We have 20 billion parameter models and even 8 billion parameter models. And these different models are created in such a way that some of them — the large — you always need a large model and the reason why you need a large model is at the very minimum, the large model is used to help improve the quality of the smaller models, okay? It’s kind of like you need a professor to improve the quality of the student and improve the quality of other students and so on and so forth.
And so because there’s so many different use cases, you’re going to have different sizes of models. And so we optimize across all of those. You should use the right-sized model for the right-size application. Our inference platform extends all the way from L4 to L40. And one of the ones that I announced this week is this incredible thing. This is the Hopper H100 NVLink, we call it H10 0NVL. This is basically 2 Hoppers connected with NVLink. And as a result, it has 180 gigabytes — 190 gigabytes, almost 190 gigabytes of HBM3 memory. And so this 190 gigabyte memory gives you the ability to inference modern, large-sized inference language models all the way down to, if you would like to use it, in very small configurations, this dual H100 system solution lets you partition down to 18. Is it 18? 16 different, correct me if I’m wrong later. 16 or 18, what we call multiple instance GPUs MIGs.
And those miniature GPUs, fractions of GPUs could be inferencing different language models or the whole thing could be connected or 4 of these could be put into a PCI Express server, a commodity server, that can then be used to distribute a large model across it. This has already reduced because the performance is so incredible. This has already reduced the cost of language inferencing by a factor of 10 just from A100. And so we’re going to continue to improve in every single dimension, making the language models better, making the small models more effective as well as making each inference more cost-effective and with new inference platforms like NVL.
And then very importantly, the software stack. We’re constantly improving the software stack. Over the course of the last couple of 2, 3 years, we’ve improved it so much. I mean, orders of magnitude in just a couple of years. And so we’re expecting to continue to do that.
Simona Jankowski
Our next question will come from Tim Arcuri with UBS.
Tim Arcuri
Jensen, I think I thought I heard you say that Google’s inferencing large language models on your systems. I wanted to confirm that that’s what you were saying. And I guess, does that mean that they’re using the new L4 platform? And if they are, is that brand new? So in other words, they were using TPU, but they’re now using your new L4 platform? Just curious more details there.
Jensen Huang
Our partnership with GCP is a very, very big event. And it is an inflection point for AI, but it’s also an inflection point for our partnership. We have a lot of engineers working together to bring the state-of-the-art models that Google has to the cloud. And L4 is a versatile inference platform. You could use it for video inferencing, image generation for generative models, text generation for large language models. And I mentioned in the keynote, some of the models that we’re working on together with Google to bring to the L4 platform. And so L4 is going to be just a phenomenal inference platform. It is very energy efficient. It’s only 75 watts. The performance is off the charts, and it’s so incredibly easy to deploy. And so this — the — between the L4 on the one end, I’ll show it to you. Between L4 — this is an L4. This guy here is an L4, and this is the H100, okay? So this is the L4. And this is between these 2 processors is about 700 watts. And this is 75 watts.
And so this is the power of our architecture. One software stack can run on this as well as this. And so depending on the model size, depending on the quality of service, you would like to deploy, you could have these in your infrastructure and they’re fungible. And so I’m really excited about our partnership with GCP and the models that we’re going to bring to the inference platforms on GCP is basically across the board.
Simona Jankowski
Our next question will come from Vivek Arya with Bank of America.
Vivek Arya
Thank you, Jansen and Colette for a very informative event. So I had a near-term and a longer-term question. Near term, just curious about availability of Hopper, how we’re doing in terms of supply? And then long-term, Jensen, we heard about a range of software and service innovations. How should we track their progress, right? So the last number I think we heard in terms of software sales was about a few hundred million. So about 1% of your sales. What would you consider success over the next few years? What percentage of your sales do you think could come from software and subscriptions over time?
Colette Kress
So let me first start, Vivek, with your statement regarding supply on our H100. Yes, we do continue building out our H100s for our demand that we’ve both seen this quarter. But keep in mind, we’re also seeing stronger demand from our hyperscale customers for all of our data center platforms as they focus on generative AI. So even in this last month, since we’ve talked about earnings, we’re seeing more and more demand. So we feel confident that we will be able to serve this market as we continue to build the supply, but we feel we’re in a good space at this time.
Jensen Huang
I think that software and services will be a very substantial part of our business. However, as you know, we serve the market at every layer. We’re a full-stack company, but we’re an open platform, meaning that if a company would like — if a customer would like to work with us at the infrastructure level at the hardware level, we’re delighted by that. If they would like to work with us at the hardware plus library level, we’re delighted by that; the platform level, we’re delighted by that.
And if a customer would like to work with us all the way at the services level or at any of the level, all inclusive, we’re delighted by that. And so we have the opportunity to grow all 3 layers. The hardware layer is, of course, already a very large business. And as Colette mentioned, that part of our business, generative AI is driving acceleration of that business. And at the platform layer, these 2 layers are just being stood up as cloud services. For companies that would like to have an on-prem that we’re going to be based on subscription. However, as we all know that today, with the world being multi-cloud, you really need the software to be on cloud as well as on-prem. And so the ability for us to be multi-cloud, hybrid cloud is a real advantage and real benefit for our 2 software platforms. And that is just beginning.
And then lastly, our AI foundation services are just just announced and just beginning. I would say that the model that we presented last time includes our sensibility that we’re talking about today. We’ve been talking about laying the foundations and the path towards today. This is a very big day for us and the launch of probably the biggest business model expansion initiative in the history of our company. And so, I think the $300 million of platform and platform software and AI software services that today has just been pulled in. But I still think that it’s — the size of it is consistent with what we’ve described before.
Simona Jankowski
Our next question will come from Raji Gill with Needham.
Raji Gill
Just a question from a technological perspective regarding the relationship between memory and compute. As you mentioned, these generative AI models are creating huge amounts of compute. But how do you think about the memory models? And do you view memory as a potential bottleneck? So how do you solve the memory disaggregation problem? That would be helpful to understand.
Jensen Huang
Yes. Well, it turns out in computing, everything is a bottleneck. And if you push to the limits of computing, which is what we do for a living, we don’t build normal computers. As you know, we build extreme computers. And when you build the type of computers we build, processing is a bottleneck, so the actual computation is a bottle neck, memory bandwidth is a bottleneck, memory capacity is a bottleneck, networking or the computing fabric is a bottleneck, the networking is a bottleneck, utilization is a bottleneck. Everything is a bottleneck. We live in a world of bottlenecks. I was surrounded by bottles. And so the thing that that is true, as you were mentioning, is the amount of memory that we use, the memory capacity that we use is increasing tremendously.
And the reason for that is, of course, most of the generative AI work that we do in training the models require a lot of memory, but inferencing requires a lot of memory the native — the actual inferencing of the language model itself doesn’t necessarily require a lot of memory. However, if you have — if you want to connect it to a retrievable model that augments the language model, augments the chatbot with proprietary, very well curated data that is custom to you, proprietary to you, very important to you, maybe it’s health care records, maybe it’s about a particular type of a domain of biology, maybe has something to do with chip design. Maybe it’s AI — it’s a database that has all of the domain knowledge of NVIDIA and what makes NVIDIA click and where all of our proprietary data is embedded inside the walls of our company can now be using a large language model, we could create these data sets that can then augment our language model. And so increasingly, we need not just large amounts of data, but we need large fast data. Large amounts of data, there are many ideas for that. Of course, all of the work that’s done with SSDs, all of the work that people are doing with CXL and basically affordable, attached disaggregated memory.
All of that is fantastic, but none of that is fast memory. That’s affordable memory. That’s large amounts of accessible hot memory but none of it’s fast memory. What we need is something like what Grace Hopper does. We need a terabyte per second of access to 0.5 terabyte of data. And if we had a terabyte per second to 0.5 terabyte of data, if you wanted to have a petabyte of data in a distributed computing system, just imagine how much bandwidth we’re bringing to bear. And so this approach of very high speed, very high capacity data processing is exactly what Grace Hopper was designed to do.
Jensen Huang
I really appreciate that. I believe that data centers in the next 5 to 10 years, if we start from 10 years and work our way back or even 5 years and work our way back, we’ll basically look like this. There will be an AI factory inside. And that AI factory is working 24/7. That AI factory will take data input, it will refine the data and it will transform the data into intelligence. That AI factory is not a data center. It’s a factory. And the reason why it’s a factory is it’s doing 1 job.
That 1 job is either refining, improving and enhancing a large language model or a foundation model or a recommender system. And so that factory is doing the same job every single day. Engineers are constantly improving it, enhancing it, giving new models, new data to create new intelligence. And so every data center will have number 1 an AI factory. It will have an inference fleet. That inference fleet will have to support a diverse set of workloads. And the reason for that is because we know that video represents some 80% of the world’s Internet today. And so video has to be processed. It has to generate text. It has to generate images. It has to generate 3D graphics.
The images and 3D graphics will populate virtual worlds. And these virtual worlds will be — will run on diverse type of computers. And these Omniverse computers will, of course, simulate all of the physics inside. It will simulate all the autonomous agents inside. It will enable and connect different applications and different tools and it would be able to do essentially virtual integration of plants, digital twins of fleets of computers, self-driving cars, so on and so forth.
And so there’ll be types of virtual world simulation computers. All of these types of inferencing systems, whether it’s 3D inferencing in the case of Omniverse or physics inferencing in the case of Omniverse to all of the different domains of generative AI that we do, each one of the configurations will be optimal for the domain, but most of them will be fungible, meaning that each one of the architecture should be able to receive and offload the work from something that’s over provisioned — oversubscribed and pick up some of the workload, okay? So the second part is the inference workloads. Every single one of the nodes will have SmartNICs on it, like a DPU, a data center operating system processing unit. And that is going to offload offload and isolate.
It’s really important to isolate it because you don’t want the tenants of the computer which all are basically inside. You have to think about the world in the future as Zero Trust. And so all of the applications and all of the communications has to be isolated from each other. They’re either isolated by encoding, they’re isolated by virtualization. And the operating system is separated from the — the control plane is separated from the compute plan. The control plane, the operating system of the data center will run, be offloaded, accelerated on the DPU, on the BlueField, okay? So that’s another characteristic.
And then lastly, whatever is left, that’s not possible to accelerate because your — the code is just ultimately single-threaded. Whatever is left, you need to run it on a CPU that is the most energy efficient that you can possibly do, not at the CPU level only, but at the entire compute node really. And the reason for that is because people don’t operate CPUs, they operate computers. And so it’s nice that the CPU is energy efficient at the core. But if the rest of the data processing and the I/O and the memory, it consumes a lot of power then what’s the point.
And so the entire compute node has to be energy efficient. Many of those CPUs will be — a lot of them will be x86 and a lot of them will be ARM. And I think these 2 CPU architectures will continue to grow in the world’s data center because ideally, we’ve reclaimed power through acceleration, which gives the world a lot more power to grow into. And so that acceleration, reclaim, then grow 3-step process is really vital to the future of data centers.
I think this represents a canonical data center, of course, different sizes and scales. You now know — you can now see as we — this question kind of reveals our mental image of what a data center does and which also explains why it’s so vital that we — the one thing I forgot to say is really vital is all of this is being connected to 2 types of networks. There’s 1 type of network that’s the computing fabric, NVLink and InfiniBand are computing fabrics. And they’re really intended for distributed computing, moving a lot of data around, orchestrating the computation of all these different computers.
And then another layer of networking Ethernet, for example, for the control, for the multi-tenancy, for the orchestration, workload management, so on and so forth, the deployment of the service to the users. And that’s done on Ethernet. The switches, the NICs, super sophisticated, some of it in copper, some of it direct drive, some of it is long reach fiber. And all that layer, that fabric is vitally important. Now you see — why it is that we invest in what we do.
When we think about a data center scale and we start from the computation, the acceleration of it as we continue to advance it at some point, everything becomes a bottleneck. And whenever something becomes a bottleneck and we have a very specific viewpoint about the future, and nobody else is building it in that way or nobody else could build it in that way, we would tackle the endeavor and go remove the bottleneck for the computing industry.
One of those important bottlenecks, of course, is NVLink another one is InfiniBand, another, the DPU, the BlueField. I just talked to you about Grace and how it removes bottlenecks for single-threaded code and very large data processing code. And so this entire mental model of computing I think in some degree will be implemented very, very quickly in the world CSPs. And the reason for that is very, very clear.
The 2 fundamental drivers of computing in the near future. One of them is sustainability, acceleration vital to that; and the second is generative AI, AI computing is vital to that. I want to thank all of you for joining GTC. We had a lot of news for you to consume and appreciate all the excellent questions. And very importantly, I want to thank all the researchers and scientists who took the risk and who had the faith in the platform that we were building that over the last 2.5 decades as we continue to advance accelerated computing, have used this technology and used this computing platform to do groundbreaking work.
And it’s because of you and all of your amazing work that has really inspired the rest of the world to jump on to accelerated computing. I also want to thank all of the amazing employees of NVIDIA for just an incredible company that you’ve helped build and ecosystem that you’ve built. Thank you, everybody. Have a great night.