1
00:00:00,400 --> 00:00:07,359
If I were a shovel manufacturer during the gold rush, what sort of shovel would

2
00:00:04,720 --> 00:00:12,639
I build for myself? Well, that's exactly the question that I'm answering today as

3
00:00:09,440 --> 00:00:15,120
we tour the Nyx supercomput at NVIDIA

4
00:00:12,639 --> 00:00:22,000
headquarters here in Santa Clara. Behind this door are 1,192

5
00:00:18,560 --> 00:00:24,960
B200 GPUs. Absolute monsters rated for

6
00:00:22,000 --> 00:00:30,240
up to,200 watts of power each. And they've been used for everything from AI

7
00:00:27,199 --> 00:00:33,040
research to DLSS upscaling for gamers to

8
00:00:30,240 --> 00:00:36,079
good old-fashioned NLPF drag racing. You know, to put the other AI up starts in

9
00:00:34,880 --> 00:00:40,719
their place. So, what are we waiting for? Let's gear up and get inside. Uh

10
00:00:38,559 --> 00:00:49,760
oh, shoot. What was the password again? Oh, that's right. S E G [music] U E for

11
00:00:45,840 --> 00:00:52,320
our sponsor, MSI. Their Crosshair A16 HX

12
00:00:49,760 --> 00:00:57,920
AI is an 18-in high-end gaming laptop that comes with a 50 series card and a

13
00:00:54,719 --> 00:00:59,600
240 Hz display. Experience smooth gaming

14
00:00:57,920 --> 00:01:04,720
on the go with the link in the video description. Are you tired of boring

15
00:01:01,680 --> 00:01:06,799
screwdrivers? Boom. Four new colors of

16
00:01:04,720 --> 00:01:13,520
the brand new LT transparent screwdriver. Plasma purple, cryo teal,

17
00:01:10,560 --> 00:01:16,640
molten orange, and carbon black. They're see-through. They're awesome. and

18
00:01:15,119 --> 00:01:21,040
they're probably going to sell out because for a limited time only. If you

19
00:01:19,040 --> 00:01:25,600
buy through YouTube shopping at the link in the video description, you will get

20
00:01:22,960 --> 00:01:29,840
20% off your driver, but only until December 30th. Collect them all or maybe

21
00:01:28,000 --> 00:01:34,000
pair them with our brand new bid case and all over print hoodie. This

22
00:01:31,920 --> 00:01:38,560
particular data center space is used for a combination of R&D and NVIDIA internal

23
00:01:37,119 --> 00:01:42,400
production, meaning that it gets reconfigured on a pretty regular basis

24
00:01:40,871 --> 00:01:47,600
[music] as NVIDIA works through validation of new chip or server

25
00:01:44,960 --> 00:01:51,759
designs. That's why the raised floor has these plumbing access hatches for

26
00:01:49,520 --> 00:01:55,920
highdensity water cooled deployments, even though at the moment it's set up

27
00:01:53,280 --> 00:02:00,320
for air cooling to accommodate the DGX B200 racks that currently line the

28
00:01:58,240 --> 00:02:04,880
floor. We're in the first of two rooms that are linked by this fiber optic

29
00:02:02,159 --> 00:02:09,599
cabling, each of which contains two cats or cold air containment units. We're

30
00:02:07,920 --> 00:02:13,360
going to go in one of them in a second, but before we do, I want to talk a

31
00:02:11,680 --> 00:02:18,319
little bit about some of the details that would be easy to miss in here, like

32
00:02:16,000 --> 00:02:22,239
this decibel meter here that probably illustrates why it's better for us to

33
00:02:20,000 --> 00:02:26,720
talk out here. The deployment is absolutely peppered [music] with

34
00:02:23,760 --> 00:02:31,440
sensors. obvious ones like temperature on both sides of the rack like this one

35
00:02:28,640 --> 00:02:38,400
right here and also less obvious ones like humidity and air pressure. Humidity

36
00:02:35,440 --> 00:02:43,360
management is very important. Too low and static electricity becomes a

37
00:02:40,160 --> 00:02:45,519
problem. Too high and your servers start

38
00:02:43,360 --> 00:02:52,239
to sweat and then you're going to start to sweat. Air pressure or maybe more

39
00:02:49,440 --> 00:02:57,599
accurately air flow comes down to cooling. The DGX V200 systems here are

40
00:02:56,000 --> 00:03:01,920
of course equipped [music] with their own cooling fans and the same goes for

41
00:02:59,920 --> 00:03:08,239
the accompanying advanced tablet networking equipment. But for we are

42
00:03:04,560 --> 00:03:10,400
talking about up to 14,000

43
00:03:08,239 --> 00:03:14,800
watts of power consumption in each of these units.

44
00:03:12,239 --> 00:03:20,000
Let's just say that these fans can use all the help that they can get. So,

45
00:03:17,040 --> 00:03:24,959
fresh air gets actively blown up through the floor, creating positive pressure

46
00:03:22,640 --> 00:03:32,239
inside this room where it gets pushed through our DGXs and out into the hot

47
00:03:28,239 --> 00:03:35,040
aisle where uh things definitely get a

48
00:03:32,239 --> 00:03:40,640
lot hotter. I mean, I was getting kind of chilly in there, but here I would be

49
00:03:38,000 --> 00:03:44,560
sweating in a matter of minutes. But in spite of the challenges, the systems are

50
00:03:42,640 --> 00:03:49,200
running perfectly. And NVIDIA pointed out that some of these nodes are running

51
00:03:46,319 --> 00:03:54,000
full tilt on undisclosed workloads as we speak. Now, because NVIDIA uses spaces

52
00:03:51,920 --> 00:03:58,319
like these for validation, it's important that they get all the little

53
00:03:55,760 --> 00:04:02,720
details right so customers can just take this blueprint, copy it, and paste it

54
00:04:00,640 --> 00:04:06,080
into their own facility and trust that it's going to run at scale. The

55
00:04:04,319 --> 00:04:10,480
networking racks, for instance, are strategically positioned to minimize the

56
00:04:08,400 --> 00:04:15,360
length of fiber optic cabling between the four cats that make up this cluster.

57
00:04:13,200 --> 00:04:19,519
Two in this room and two in the next one over. This is both to maintain signal

58
00:04:17,440 --> 00:04:24,720
integrity. They found any higher than 100 to 150 m can be problematic at these

59
00:04:22,400 --> 00:04:29,120
speeds, but it's also to maintain the integrity of the ceiling. See, it turns

60
00:04:26,800 --> 00:04:33,759
out that when you start bundling up hundreds of fiber optic cables, they can

61
00:04:31,759 --> 00:04:38,639
get pretty heavy. And the longer the runs, the more strain they put on the

62
00:04:35,759 --> 00:04:44,160
ceiling. It also helps save on cost. This I don't have much to say about

63
00:04:40,320 --> 00:04:45,680
other than Oh my god, isn't it

64
00:04:44,160 --> 00:04:49,919
beautiful? Okay, I lied. I do have some stuff to

65
00:04:47,440 --> 00:04:55,280
say. This cable management is not just for looks, but rather to maintain air

66
00:04:52,320 --> 00:05:00,240
flow. When you have this many cables, you actually do need to bundle them. And

67
00:04:57,759 --> 00:05:05,919
it also helps facilitate maintenance in the event of a broken or damaged cable.

68
00:05:03,120 --> 00:05:11,520
They run extra fiber in each bundle that can be terminated as needed, and it is a

69
00:05:08,639 --> 00:05:16,320
lot easier to find that when things are organized. [music] Careful thought goes

70
00:05:13,680 --> 00:05:19,919
into rack serviceability, too. While the networking racks use traditional

71
00:05:18,160 --> 00:05:25,440
vertical PDUs for their power distribution, the DGX racks use three

72
00:05:22,800 --> 00:05:28,880
top-mounted PDUs that help balance the three [music] phases of power coming in.

73
00:05:27,039 --> 00:05:32,479
On the subject of power, we skipped over these at the SFU data center, but uh

74
00:05:31,120 --> 00:05:36,720
have you ever wondered what a >> 415 volt 100 amp

75
00:05:34,240 --> 00:05:40,240
>> power plug looks like? Wonder no longer. Another key advantage of putting all the

76
00:05:38,479 --> 00:05:45,600
PDUs up top is it makes it a little easier to get the DGX units in and out.

77
00:05:42,960 --> 00:05:49,280
They uh turn this one off and agreed to pull it out of the deployment just to

78
00:05:47,120 --> 00:05:52,639
show us how it's done. But when I asked if we could tear it down, they were

79
00:05:50,880 --> 00:05:56,320
like, "Huh?" Not because they have anything to hide,

80
00:05:54,240 --> 00:06:00,400
but just cuz they thought this soon to be decommissioned and recycled

81
00:05:57,919 --> 00:06:07,120
engineering unit that we can really get our greasy meat mitts into would be a

82
00:06:02,800 --> 00:06:09,520
lot more [music] fun. Woo!

83
00:06:07,120 --> 00:06:18,639
If you ever wondered what the heat sink would look like on a 1200 W GPU, wonder

84
00:06:14,880 --> 00:06:20,720
no more. How many heat pipes is this?

85
00:06:18,639 --> 00:06:25,280
It's like a forest of heat pipes in there. Oh, they're flat. That makes so

86
00:06:23,600 --> 00:06:29,120
much sense cuz you'll get more air flow through the fins. And I mean, that's

87
00:06:26,880 --> 00:06:35,440
some pretty hard working air dealing with not one but two rows of these GPUs.

88
00:06:32,800 --> 00:06:40,960
Not to mention this middle heat sink here that seems to be for the Envy Link

89
00:06:38,400 --> 00:06:48,319
switching equipment. I mean, I knew you need some beefy to allow these GPUs to

90
00:06:43,840 --> 00:06:50,880
pull the 192 gigs of HBM 3 memory that

91
00:06:48,319 --> 00:06:55,199
they have, but still, I'd never seen the cooling for it. Another thing I've never

92
00:06:53,199 --> 00:06:59,840
gotten to see, or at least never gotten to film, is NVIDIA's proprietary SXM

93
00:06:57,919 --> 00:07:03,280
interface. I know because I was looking for a B-roll shot of one of these a

94
00:07:01,599 --> 00:07:07,680
little while ago, and I realized I've never actually held one up on camera.

95
00:07:05,120 --> 00:07:12,639
So, now I've done it. And they also gave us this to show what Okay, not the same

96
00:07:10,080 --> 00:07:16,880
generation, but a similar GPU might look like without the heat sink. Oh, we've

97
00:07:15,280 --> 00:07:20,240
also got to look at the SXM interface that goes back to the rest of the system

98
00:07:18,560 --> 00:07:27,280
and also carries power for these [music] GPUs. These are 1,200 W each, but these

99
00:07:24,000 --> 00:07:30,080
cards go as high as 1,400 W. I mean, how

100
00:07:27,280 --> 00:07:34,560
high would the cooler be at that point? JK, it would be water cooled. We were

101
00:07:32,800 --> 00:07:39,280
hoping to see some water cooled machines today, but I guess NVIDIA has to save

102
00:07:37,195 --> 00:07:44,319
[music] something for next time. I got to say though, if the new stuff is even

103
00:07:41,440 --> 00:07:48,240
just equally cool to the last gen H100 water cooled machines we saw at the SFU

104
00:07:46,319 --> 00:07:53,039
Fur Supercomput, I'm sure it's absolutely mind-blowing. All of which is

105
00:07:50,880 --> 00:07:56,879
really cool, but what do they need all of this for? Well, the folks who are

106
00:07:55,360 --> 00:08:00,639
responsible for getting us access to everything today are actually from the

107
00:07:58,400 --> 00:08:04,960
GeForce gaming team. So, one of their big uses for these internal compute

108
00:08:02,479 --> 00:08:10,479
resources is, of course, deep learning superers sampling or DLSS. In a

109
00:08:07,599 --> 00:08:15,919
nutshell, DLSS allows your GPU to render your game at a lower resolution, then

110
00:08:12,960 --> 00:08:19,759
use deep learning or AI to upscale each output frame to your monitor's

111
00:08:17,440 --> 00:08:23,440
resolution. The benefit of this is that you can run at a higher frame rate for

112
00:08:21,520 --> 00:08:27,759
improved animation smoothness. But the drawback is that you can't create

113
00:08:25,680 --> 00:08:31,840
something from nothing and upscaled images struggle to achieve the same

114
00:08:29,280 --> 00:08:36,240
fidelity as a native rendered image. With that said, DLSS has improved a lot

115
00:08:34,479 --> 00:08:40,000
over the years and I got a chance to sit down and chat with Edward Leu from that

116
00:08:38,159 --> 00:08:43,839
team who talked us through some of the processes that they use to develop new

117
00:08:41,680 --> 00:08:49,200
features and new fixes. The first thing he pointed out is that while this hot

118
00:08:46,560 --> 00:08:54,160
new AI model took a thousand GPUs two weeks to train or whatever, makes for a

119
00:08:51,519 --> 00:08:58,399
good headline, it overlooks most of the actual cost and time, which is in the

120
00:08:56,720 --> 00:09:02,880
test training runs that take place before the hero run. His team is

121
00:09:01,040 --> 00:09:06,320
constantly iterating on the data they're feeding into DLSS and [music] the

122
00:09:04,560 --> 00:09:11,600
waiting, and they're evaluating new innovations in the AI space. Sometimes

123
00:09:09,120 --> 00:09:15,680
it's more surgical, like, "Oh, hey, uh, we noticed Cyberpunk has an issue with

124
00:09:13,360 --> 00:09:18,880
cars having like three or four bumpers as you're driving around. How can we

125
00:09:17,519 --> 00:09:24,560
address this with the current model?" That kind of thing can be turned around sometimes in a matter of weeks or months

126
00:09:22,560 --> 00:09:29,040
using the resources that we just saw. Though, uh, he was quick to point out

127
00:09:26,640 --> 00:09:33,600
that Jensen, if you're watching, there's no limit to how many GPUs his team could

128
00:09:31,279 --> 00:09:37,360
use to speed up the process. I told him I'd say that because the faster they can

129
00:09:35,440 --> 00:09:43,360
test a new data set, the faster they can tune it and ship it. Other times, the

130
00:09:41,040 --> 00:09:46,959
changes are more transformative, pun intended, like the move from a

131
00:09:44,959 --> 00:09:51,360
convolutional neural network to the more accurate transformer model that runs

132
00:09:48,959 --> 00:09:55,600
best on NVIDIA's newest cards. This can require basically a complete tear down

133
00:09:53,360 --> 00:10:00,000
and doover of the entire pipeline and can take a year or more. But I mean,

134
00:09:58,080 --> 00:10:03,360
hey, NVIDIA has bet pretty much their entire future on being able to stay at

135
00:10:01,760 --> 00:10:07,760
the forefront of their 21st century shovel technology. So, I guess that's

136
00:10:05,120 --> 00:10:11,040
the price you pay. But what does all this mean for traditional rendering?

137
00:10:09,360 --> 00:10:15,707
Well, the folks here believe very strongly that in time DLSS will not only

138
00:10:13,440 --> 00:10:19,440
be as good, it will be better than [music] traditional. In fact, Edward

139
00:10:17,600 --> 00:10:23,760
pointed to these slides from a talk that he gave a number of years ago, showing

140
00:10:21,360 --> 00:10:29,279
that what we think of as native rendering is already a pretty imperfect

141
00:10:26,640 --> 00:10:33,200
approximation. Here it is compared to a ground truth image, which is rendered at

142
00:10:31,519 --> 00:10:38,959
a much higher resolution than downsampled to 1080p. Native, it's

143
00:10:36,480 --> 00:10:44,320
clearly worse. And as you can see in this comparison, even back then with

144
00:10:40,959 --> 00:10:46,240
DLSS 2.0, So there were situations where

145
00:10:44,320 --> 00:10:52,000
a model trained on these higher resolution or more accurately higher

146
00:10:49,040 --> 00:10:56,972
sampled ground truth images could result in the GPU reconstructing an output with

147
00:10:54,800 --> 00:11:01,440
DLSS that's closer to ground truth [music] than the native image was. I

148
00:10:59,440 --> 00:11:04,560
asked, by the way, how are these ground truth images created? [music] and he

149
00:11:02,959 --> 00:11:09,519
said that typically it's actually just on run-of-the-mill gaming hardware, but

150
00:11:06,720 --> 00:11:13,839
instead of running at 60 or 120 frames per second, they'll sometimes be running

151
00:11:11,440 --> 00:11:19,839
tens or thousands of pixel samples to the point where we're talking more like

152
00:11:15,839 --> 00:11:22,480
60 to 120 seconds per frame or even

153
00:11:19,839 --> 00:11:26,320
more. Man though, if DLSS could consistently reconstruct that ground

154
00:11:24,800 --> 00:11:30,399
truth image from a lower resolution input every time, I'm sure no one would

155
00:11:28,480 --> 00:11:34,640
ever turn it off. Unfortunately for Edward and his team though, no one

156
00:11:32,320 --> 00:11:38,720
notices when DLSS is working perfectly. It's when it

157
00:11:36,640 --> 00:11:44,560
trips over itself that we tend to notice, and it still does do so

158
00:11:42,160 --> 00:11:48,720
fairly regularly. However, with that said, from my own experience, it has

159
00:11:46,880 --> 00:11:53,040
continued to improve at a pretty solid clip since its debut, and the rest of

160
00:11:50,640 --> 00:11:57,839
the industry has pretty much accepted that AI accelerated image enhancement is

161
00:11:55,760 --> 00:12:03,839
the future, whether every gamer wants it or not. So sales of shovels will likely

162
00:12:01,279 --> 00:12:07,760
continue until gamer morale improves. Good luck everyone. I hope it's not a

163
00:12:05,760 --> 00:12:13,200
bubble. This is a nice office. I'd hate for something to happen to it. Just like

164
00:12:10,160 --> 00:12:15,120
this is a nice segue to our sponsor,

165
00:12:13,200 --> 00:12:19,040
Squarespace. If you're building a brand or a business, it's important to have a

166
00:12:17,120 --> 00:12:23,279
website. Designing your own site can feel like a bit of a daunting task, but

167
00:12:21,120 --> 00:12:27,040
it really doesn't have to. Squarespace makes it easy to get your message across

168
00:12:24,959 --> 00:12:31,760
to potential customers and subscribers in a clean and digestible way. Their

169
00:12:29,360 --> 00:12:35,920
design intelligence tool utilizes AI and works with your own creativity to create

170
00:12:33,600 --> 00:12:40,399
a theme and style for your site that matches your personality. You can also

171
00:12:38,320 --> 00:12:46,079
use Squarespace to directly invoice your customers with payment options like a

172
00:12:42,560 --> 00:12:48,000
direct debit, Apple Pay, CLA, and more.

173
00:12:46,079 --> 00:12:53,200
And use their analytic tools to track sales, strategize, and continue to build

174
00:12:50,399 --> 00:12:56,959
your brand. There's millions of URLs available, and Squarespace's domain tool

175
00:12:55,279 --> 00:13:01,200
will help you search for the right address just for you. We've even used

176
00:12:59,279 --> 00:13:05,040
Squarespace for some of our own websites here. Start building your website today,

177
00:13:03,200 --> 00:13:10,160
and you'll receive 10% off your first purchase by visiting squarespace.com/LTT.

178
00:13:08,399 --> 00:13:15,519
If you guys enjoyed this video, maybe go check out the time we I don't know,

179
00:13:13,279 --> 00:13:20,079
something invidated. Let's do a throwback video. How about

180
00:13:17,519 --> 00:13:26,480
when I checked out the launch of G-Sync? The production values were lower, but

181
00:13:23,040 --> 00:13:26,480
hey, it was fun.
