1
00:00:00,000 --> 00:00:07,240
when you make as many videos as we do you need a lot of fast reliable storage

2
00:00:04,040 --> 00:00:09,760
and our main editing server wanic has

3
00:00:07,240 --> 00:00:13,400
checked all of those boxes for years it's a great little server it's built

4
00:00:11,639 --> 00:00:18,439
out of high quality components and it even looks cool but as our team is grown

5
00:00:16,279 --> 00:00:25,160
we've reached the point where even a minute one single minute of downtime

6
00:00:21,640 --> 00:00:27,800
costs over $50 and that's just in

7
00:00:25,160 --> 00:00:32,880
payroll now practically speaking the way to mitigate that is by adding redundant

8
00:00:30,679 --> 00:00:37,559
now our drives are already redundant we've got 20 drives in there with data

9
00:00:34,680 --> 00:00:43,280
striping but the problem is they all sit in one single server I'm sure you can

10
00:00:40,920 --> 00:00:48,920
see where this is going it's been over a year in the making but it's finally here

11
00:00:45,960 --> 00:00:56,000
wanic final form and I'm calling it wanic 10 because it's the last W ever

12
00:00:52,399 --> 00:00:58,120
avability W you this like 10 times

13
00:00:56,000 --> 00:01:03,320
nobody even knows what high availability means it means it's lus just go ahead

14
00:01:00,480 --> 00:01:06,439
unplug one do it go for it well okay I should probably tell you the stakes

15
00:01:04,839 --> 00:01:10,400
before you do that each of these two grand twin boxes has four entire servers

16
00:01:09,119 --> 00:01:17,080
inside of them that were provided by super micro who sponsored this whole thing and they're set up with WCA a

17
00:01:14,080 --> 00:01:19,720
redundant NVMe first file system in this

18
00:01:17,080 --> 00:01:24,159
config it should sustain two entire servers dropping out without anyone even

19
00:01:21,960 --> 00:01:27,840
noticing except that we moved the entire team onto it last night without telling

20
00:01:25,840 --> 00:01:31,720
anyone and it's the middle of the work day with a ton of high priority videos

21
00:01:29,400 --> 00:01:37,520
in progress do you really want to test it right now I like I haven't tried that

22
00:01:33,880 --> 00:01:39,960
all right here we go okay what could go

23
00:01:37,520 --> 00:01:43,260
wrong I mean a

24
00:01:47,880 --> 00:01:54,680
lot naturally a huge part of a project like this is the software the stuff

25
00:01:52,840 --> 00:02:01,240
that's going to handle Distributing all of ourish terabytes of video projects

26
00:01:58,000 --> 00:02:02,799
Word documents and Linux isos to the

27
00:02:01,240 --> 00:02:08,239
multiple machines that we just showed you but we can't install any software

28
00:02:05,680 --> 00:02:15,480
until we have some Hardware so why don't we start there meet the super micro

29
00:02:10,679 --> 00:02:18,200
Grand twin A+ server as- 2115 gt-

30
00:02:15,480 --> 00:02:21,920
hntr despite its sort of ordinary looking appearance and unexciting

31
00:02:20,080 --> 00:02:28,800
sounding name it is anything but ordinary and it is very

32
00:02:25,120 --> 00:02:31,720
exciting because inside this 2u is four

33
00:02:28,800 --> 00:02:39,200
independent Compu computers but for what we're doing four nodes please we want

34
00:02:36,280 --> 00:02:45,400
eight inside each of these is a completely independent motherboard 384

35
00:02:43,040 --> 00:02:54,159
gigs of memory an AMD epic Genoa processor with 64 cores dual m.2 slots

36
00:02:49,519 --> 00:02:57,560
for redundant boot drives six PCIe Gen 5

37
00:02:54,159 --> 00:03:01,599
2 and 1/2 in NVMe slots up front and

38
00:02:57,560 --> 00:03:03,120
we've got IO in the rear now this bit

39
00:03:01,599 --> 00:03:10,200
here could be a little confusing at first glance but that is because not

40
00:03:06,440 --> 00:03:13,920
only do we have USB but we have two full

41
00:03:10,200 --> 00:03:16,480
gen 5x6 PCIe connections back here along

42
00:03:13,920 --> 00:03:22,319
with display output and power for the entire server this whole thing slides

43
00:03:20,040 --> 00:03:26,159
into the chassis which holds a really cool modular backplane assembly that

44
00:03:24,519 --> 00:03:31,280
we'll take a look at in a minute and then passes through thank you Jake ah to

45
00:03:29,360 --> 00:03:36,200
the back at the server where you've got a Management Port a single USB port for

46
00:03:34,159 --> 00:03:42,120
each server nope it's two and they're shared what the I was about to ask cuz

47
00:03:39,640 --> 00:03:47,439
we've also got a single VGA you see the button for two servers there no way this

48
00:03:45,000 --> 00:03:54,079
button toggles yeah and okay before we talk about that

49
00:03:50,640 --> 00:03:57,280
a little bit more look at these power

50
00:03:54,079 --> 00:04:00,959
supplies each of these is

51
00:03:57,280 --> 00:04:02,680
2200 Watts 80 plus typ tianium which

52
00:04:00,959 --> 00:04:07,959
sounds like a lot but when you're potentially handling four 400 wat epic

53
00:04:05,920 --> 00:04:12,720
Genoa CPUs along with a bunch of RAM up to 24 NVMe drives and eight network

54
00:04:10,159 --> 00:04:17,239
cards well it seems downright reasonable doesn't it is it 24 drives can't be 6

55
00:04:15,560 --> 00:04:23,800
yes 6 * 4 is 24 and of course that's just one of them

56
00:04:21,280 --> 00:04:28,000
we've got two of those and that means that in the event that one of these dies

57
00:04:25,919 --> 00:04:33,320
the system should be able to continue to operate uninterrupted which is a big

58
00:04:30,680 --> 00:04:38,360
part of the high availability goal that we have for this deployment speaking of

59
00:04:36,440 --> 00:04:44,800
high availability let's move on to our network cards each of those PCIe gen 5x6

60
00:04:43,000 --> 00:04:50,520
slots I showed you guys before terminates in one of these ocp 3.0 small

61
00:04:47,759 --> 00:04:56,080
form factor mezzanine slots and what we're putting in them is these connectx

62
00:04:53,240 --> 00:05:03,639
6 200 gbit cards from melanox excuse me from NVIDIA that okay

63
00:05:00,680 --> 00:05:08,160
these are the older Gen 4 ones so they're going to be limited by the slot

64
00:05:05,160 --> 00:05:10,639
speed of around 250 gabit per second but

65
00:05:08,160 --> 00:05:16,880
if we had newer cards that means that each of these nodes could do 200 plus

66
00:05:14,240 --> 00:05:21,680
another 200 400 up to 800 gigabit which would of course be a

67
00:05:19,600 --> 00:05:28,360
complete waste for us a because our workload can't take advantage of it and

68
00:05:23,400 --> 00:05:30,919
B because our switch is only 100 gbit

69
00:05:28,360 --> 00:05:35,600
sorry of course the two ports are still helpful we do have redundant

70
00:05:33,360 --> 00:05:39,600
switches except there's kind of a problem here that's still a single point

71
00:05:37,440 --> 00:05:45,120
of failure in a perfect world we would have two single port Nicks so if a Nick

72
00:05:42,160 --> 00:05:49,919
were to die it would still be okay but because we have so many nodes we're not

73
00:05:47,919 --> 00:05:54,039
really worried about an individual node you know they could have one boot drive

74
00:05:51,800 --> 00:05:59,160
and it die or one Nick and it die we still have an extra backup how many

75
00:05:56,800 --> 00:06:04,759
nines do you want I mean I don't know like one would would be good 9% which

76
00:06:02,479 --> 00:06:09,440
Jokes Aside is a really good point if we were architecting this properly there

77
00:06:06,919 --> 00:06:13,360
are so many more considerations that we would need to make like the power coming

78
00:06:11,680 --> 00:06:18,000
into the rack would have to come from two independent backed up sources the

79
00:06:16,000 --> 00:06:22,240
connectivity to our clients would have to be redundant as well the connectivity

80
00:06:20,639 --> 00:06:25,560
between all of the systems would have to be architected in such a way that no

81
00:06:23,639 --> 00:06:30,479
matter what fails everything will stay up and realistically for us we're not

82
00:06:28,800 --> 00:06:35,199
going to get that deep into it because our goal is better than we had before

83
00:06:33,280 --> 00:06:39,520
which was a single machine with its own built-in redundancies but other than

84
00:06:37,319 --> 00:06:43,319
that nothing now at least we should be able to lose a full machine out of these

85
00:06:41,680 --> 00:06:48,759
eight we can restart one of our core switches totally fine two machines out

86
00:06:45,479 --> 00:06:50,919
of these eight and we can still be

87
00:06:48,759 --> 00:06:54,639
limping along I mean limping is a bit of a stretch it's going to be very fast now

88
00:06:53,240 --> 00:06:59,720
normally if you buy a super micro machine they're going to pre-build it for you they're going to validate it for

89
00:06:57,039 --> 00:07:04,479
you you can even have them pre-build an entire Rack or racks of these things and

90
00:07:02,639 --> 00:07:08,520
then validate your application on it before it ships to you in fact we've got

91
00:07:07,039 --> 00:07:16,680
a whole video that we did about that that was sponsored by super micro a little while back of course this is LT

92
00:07:13,319 --> 00:07:18,879
my friends so we will be assembling this

93
00:07:16,680 --> 00:07:23,160
one ourselves do you like that spin of the screwdriver above the server don't

94
00:07:20,639 --> 00:07:26,919
worry I won't miss I'll never miss see I could do this a hundred times and I

95
00:07:24,360 --> 00:07:31,639
would never miss why no it's fine it's good it's okay we have seven more any

96
00:07:29,160 --> 00:07:36,120
who for our CPU we've gone with an epic Genova

97
00:07:32,639 --> 00:07:40,800
9534 this is a 64 core

98
00:07:36,120 --> 00:07:44,280
128 thread monster of a CPU it'll do 3.7

99
00:07:40,800 --> 00:07:47,159
GHz Max boost it has A4 Gigabyte of

100
00:07:44,280 --> 00:07:55,120
level three cache a 300 wat TDP it supports ddr5 memory up to 12 channels

101
00:07:51,240 --> 00:07:58,800
and it supports a whopping 128 Lanes of

102
00:07:55,120 --> 00:08:01,400
PCIe Gen 5 originally we were intending

103
00:07:58,800 --> 00:08:06,800
to go with 32 core chips but they were out of stock so free upgrade lucky us

104
00:08:04,879 --> 00:08:12,840
compared to previous generation AMD epic CPUs dooa is a big step up in terms of

105
00:08:09,919 --> 00:08:16,840
IO performance which makes it perfect for this application and in the long

106
00:08:15,120 --> 00:08:21,759
term I mean if we've got all the extra CPU cores and a whole bunch of RAM

107
00:08:19,039 --> 00:08:26,199
anyway why run WCA on the bare metal when we could install prox Mox and then

108
00:08:23,919 --> 00:08:31,360
use the other cores for I don't know High

109
00:08:27,319 --> 00:08:33,080
availability Plex server yeah Linux isos

110
00:08:31,360 --> 00:08:36,599
more realistically it would be something like active directory yeah which we

111
00:08:35,320 --> 00:08:40,479
don't really want to do right now because if you run active directory on

112
00:08:38,479 --> 00:08:45,399
one server and it goes down you're going to have a really really bad time but if

113
00:08:42,719 --> 00:08:50,040
you run it on a bunch of servers yeah it's good great so normally server CPU

114
00:08:48,399 --> 00:08:53,920
coolers would come with their own thermal paste pre-applied but since

115
00:08:51,880 --> 00:08:57,480
we're doing this ourselves and uh if you look carefully it's not the first time

116
00:08:55,640 --> 00:09:04,079
that it's been installed we are going to be using okay thank you for that a piece

117
00:09:00,200 --> 00:09:07,000
of Honeywell PTM 7950 this stuff is

118
00:09:04,079 --> 00:09:12,320
freaking awesome it has great thermal transfer properties and it can handle

119
00:09:09,480 --> 00:09:16,839
varying temperatures like seriously I don't remember many not even just

120
00:09:13,880 --> 00:09:21,600
varying but like a lot of huge cycles for a very very long time now available

121
00:09:19,600 --> 00:09:26,320
LTD store.com is that big enough does that cover all of the ccds and

122
00:09:23,880 --> 00:09:29,680
cxs oh there's a second piece of PL am I stupid is there a second piece of

123
00:09:28,040 --> 00:09:32,800
plastic no there isn't should I put one in the fridge no no no it's totally fine

124
00:09:31,279 --> 00:09:36,800
I've done this like a bunch of times yeah oh she's Min look at that see all

125
00:09:35,000 --> 00:09:40,480
right easy I would recommend putting it in the fridge before you use it all

126
00:09:38,880 --> 00:09:45,120
right to ensure we're making the absolute most of our CPU especially in

127
00:09:42,959 --> 00:09:50,399
this High throughput storage workload we're going to be populating all 12 of

128
00:09:47,480 --> 00:09:57,560
our memory Channels with 32 gig dims of ddr5 ECC running at 4800 megga

129
00:09:53,440 --> 00:10:02,600
transitors per second that's a total

130
00:09:57,560 --> 00:10:05,279
of 384 three terabytes of memory what

131
00:10:02,600 --> 00:10:11,040
across all eight oh each of the cables Jake removing

132
00:10:07,680 --> 00:10:13,160
right now is a PCIe by8 cable that feeds

133
00:10:11,040 --> 00:10:18,200
two of the drive bays in the front but the reason he's taking them out is that

134
00:10:15,079 --> 00:10:20,600
we can install our boot drives these are

135
00:10:18,200 --> 00:10:26,399
consumer grade each system is getting two Sab 512 gig gen 3 rocket drives and

136
00:10:24,760 --> 00:10:30,560
it's not because they're particularly special in any meaningful way they're

137
00:10:28,519 --> 00:10:35,800
not even that fast by modern standards but what they are is from our experience

138
00:10:32,839 --> 00:10:39,320
reliable enough and they are fast enough for what we're going to be doing which

139
00:10:36,880 --> 00:10:43,399
is just booting our operating system off of them movie Magic all of the other

140
00:10:41,240 --> 00:10:46,800
nodes are already built so what do you mean movie Magic super micro built them

141
00:10:45,120 --> 00:10:51,200
Oh I thought you buil them super micro builds them for you I took it apart okay

142
00:10:49,079 --> 00:10:55,279
fine I took that one apart no secrets left anymore yep no Intrigue no mystery

143
00:10:53,959 --> 00:11:01,440
you know what is still mysterious is inside of here I've actually never opened this before Oh okay let's have a

144
00:10:57,959 --> 00:11:02,920
look woo holy oh that's power supplies

145
00:11:01,440 --> 00:11:07,279
yeah this is so cool so the whole computer is cooled by four fans no way

146
00:11:05,839 --> 00:11:12,760
there's the two power supply fans and then these fans in their what do they call this like IO module I think is what

147
00:11:10,639 --> 00:11:16,600
they call it look at the blades on this thing counter rotating you're serious

148
00:11:14,720 --> 00:11:22,000
that's what you're looking at not this the most delicate of spaghet oh my God

149
00:11:19,560 --> 00:11:28,320
there's not even connectors every one of these wires is soldered directly to the

150
00:11:24,720 --> 00:11:29,959
back of the ocp 3.0 what yeah for

151
00:11:28,320 --> 00:11:37,839
storage we're installing ing two of kok's Speedy cd6 Gen 4 and vme drives in

152
00:11:34,040 --> 00:11:40,320
each node so we've got one that is 7

153
00:11:37,839 --> 00:11:44,240
tabt and another one that is 15 terabytes they're kind of placeholders

154
00:11:42,760 --> 00:11:50,079
for now and in the long term we're going to switch to Something in the neighborhood of about 4 15 tab drives

155
00:11:48,480 --> 00:11:55,920
per node but the drives we want to use are currently occupied by oh that

156
00:11:52,720 --> 00:11:57,880
project by a top secret pastry related

157
00:11:55,920 --> 00:12:02,240
project so that's going to have to wait the good news is that when those drives

158
00:11:59,800 --> 00:12:06,160
become available WCA supports live upgrading and downgrading so we can just

159
00:12:04,560 --> 00:12:10,800
pull these drives swap in the new ones pull swap pull swap pull swap as long as

160
00:12:08,079 --> 00:12:13,920
we uh don't do it all at once are we ready to fire these things up okay

161
00:12:12,320 --> 00:12:17,959
there's a lot going on here what is that is that a switch y hey look you can see

162
00:12:15,600 --> 00:12:25,560
the button now oh that's cool what you're hearing so far is just

163
00:12:21,680 --> 00:12:28,079
the NVIDIA SN 3700 32 Port 200 gig

164
00:12:25,560 --> 00:12:32,120
switch oh my God it even says melanox on the front I know maybe it's an old like

165
00:12:30,000 --> 00:12:35,800
review sample demo univ we got it with the $1 million PC and I'm pretty sure

166
00:12:34,240 --> 00:12:42,000
that that was already NVIDIA at that point can you hear that you hear it getting louder yeah

167
00:12:39,160 --> 00:12:46,880
who well that one's just excited to see this is the WKA dashboard maybe if I go

168
00:12:44,440 --> 00:12:52,360
over here cluster servers we can see all of our servers we have two drives per

169
00:12:50,880 --> 00:12:56,800
and then course this is a very interesting part of how wo works it's

170
00:12:54,600 --> 00:13:01,399
not like trass let's say where it just uses the whole CPU for whatever you're

171
00:12:58,560 --> 00:13:06,880
trying to do they dedicate and like fence off specific cores for specific

172
00:13:04,199 --> 00:13:13,320
tasks for instance each Drive gets a core so we've got two Drive containers

173
00:13:09,279 --> 00:13:16,600
that means two a full core per Drive

174
00:13:13,320 --> 00:13:19,320
yeah damn yeah you also have compute

175
00:13:16,600 --> 00:13:22,880
cores which do like the par calculation and intercluster communication and then

176
00:13:21,639 --> 00:13:27,680
there's front end which you don't necessarily always have frontend cores

177
00:13:25,000 --> 00:13:31,440
managed connecting to a file system so if you just had drives and Compu compute

178
00:13:29,839 --> 00:13:34,880
you wouldn't be able to access the files on this machine so you would have your

179
00:13:32,839 --> 00:13:39,839
backend servers right those would run drives and compute which is the cluster

180
00:13:37,680 --> 00:13:43,399
and then on your like GPU box you would run just the front end and that would

181
00:13:41,639 --> 00:13:48,079
allow the GPU box to connect to the backend cluster servers oh the back-end

182
00:13:46,399 --> 00:13:54,360
cluster servers don't need to run a front end unless you want to be able to

183
00:13:50,920 --> 00:13:56,839
access the files on that machine or from

184
00:13:54,360 --> 00:14:02,120
that machine which we want to cuz we're using SMB we're using it as a a file

185
00:13:59,560 --> 00:14:07,279
server stupid NZ for our stupid Windows machines yeah you can also have a

186
00:14:05,000 --> 00:14:10,360
dedicated front end machine yes so if you had like a 100 backend servers but

187
00:14:09,120 --> 00:14:15,399
then that's adding a single point of failure which is what we're trying to avoid you could have multiple of them

188
00:14:13,399 --> 00:14:20,680
okay you thought they thought of that yeah I set it up so every single machine

189
00:14:18,519 --> 00:14:26,480
in the cluster all eight of them are part of our SMB cluster which means it

190
00:14:23,600 --> 00:14:30,079
cannot go down realistically there are a ton of other file systems out there that

191
00:14:28,399 --> 00:14:35,279
you could use for something like this traz has their scale out setup for

192
00:14:32,279 --> 00:14:37,079
clustered ZFS which only requires three

193
00:14:35,279 --> 00:14:40,880
nodes and is something we'd be quite interested in trying out or if you're

194
00:14:39,120 --> 00:14:45,560
looking for object storage there's a million options but the main open-

195
00:14:42,920 --> 00:14:49,560
source one Min iio requires only four nodes though when we saw how nuts WCA

196
00:14:48,160 --> 00:14:57,240
was when we set up the million dooll server cluster I mean we had to try it

197
00:14:52,880 --> 00:15:01,600
out for ourselves and try it out we did

198
00:14:57,240 --> 00:15:04,079
so this is each not no holy

199
00:15:01,600 --> 00:15:09,480
sh look okay the crazy thing is look at the read latency now guys look look hold

200
00:15:06,079 --> 00:15:12,399
on hold on hold on at 70 gabt a second

201
00:15:09,480 --> 00:15:17,399
we've seen numbers like this before but we're talking with in some cases double

202
00:15:15,000 --> 00:15:22,560
the number of drives and no file system without a file system like raw to each

203
00:15:19,680 --> 00:15:29,040
drive this is with a file system with a file system over a network and we're

204
00:15:25,600 --> 00:15:30,360
only using 100 Gig ports like usually

205
00:15:29,040 --> 00:15:36,319
with a WCA setup like this you'd probably use 200 yeah cuz we oh my God

206
00:15:33,800 --> 00:15:41,000
we didn't know cuz we didn't even have networking as a factor last time all the

207
00:15:39,399 --> 00:15:45,759
drives were in one box I know this is networking too and the crazy part is

208
00:15:43,319 --> 00:15:52,319
we're not using RDMA this is like um some fancy uh what's it called dpdk I

209
00:15:48,759 --> 00:15:55,959
think is the library this is wild yeah

210
00:15:52,319 --> 00:15:59,399
look at that so read latency 131 microc

211
00:15:55,959 --> 00:16:02,040
seconds that's 4 million read iops with

212
00:15:59,399 --> 00:16:07,199
a latency of 1 millisecond average are are we able to keep using W FS like this

213
00:16:04,639 --> 00:16:12,040
is a trial okay this software is quite expensive this is unreal 4 million iops

214
00:16:09,920 --> 00:16:17,720
this is like it is unreal it's way more than we could possibly ever need but

215
00:16:15,240 --> 00:16:20,639
it's cool it's so cool don't they support tearing and everything oh yeah

216
00:16:19,600 --> 00:16:27,920
here I'll show you actually what that looks like this is on mother vault which I think right now has 400 Tippy bytes

217
00:16:25,279 --> 00:16:33,199
left so let's say Max Capacity is 400 terab now once we run out of the 100

218
00:16:31,279 --> 00:16:38,000
terab of SSD capacity which you can see here it'll just it'll tear I mean it

219
00:16:35,959 --> 00:16:42,199
automatically tear anyways and you do need to make sure that your object store

220
00:16:39,880 --> 00:16:46,160
is at least the same size as the flash or bigger because they're going to

221
00:16:44,160 --> 00:16:53,199
automatically tear everything to it that makes sense so in theory we

222
00:16:48,600 --> 00:16:55,319
move manually copy everything from Vault

223
00:16:53,199 --> 00:17:00,319
one time to wo one time because it stores in like 64 megabyte chunks and

224
00:16:58,560 --> 00:17:03,920
then it just stays there forever stays there forever and then we just have one

225
00:17:01,720 --> 00:17:08,120
network share and when something needs to get vaed you just you just move it

226
00:17:06,280 --> 00:17:11,480
from allow it to Decay yeah you would probably move it from pending projects

227
00:17:09,720 --> 00:17:15,439
to like done or something like that we make a folder for done yeah sure um and

228
00:17:13,640 --> 00:17:20,240
then it will just do it automatically wow or if it's a video that like

229
00:17:18,160 --> 00:17:23,880
somebody was working on and then you know it's been on hold for 3 months and

230
00:17:21,839 --> 00:17:27,600
we shot you know a ter of footage it will just and then when we're ready to

231
00:17:25,199 --> 00:17:31,840
work on it it'll promote it back up holy we K net boot off of this

232
00:17:29,480 --> 00:17:36,799
followup video yeah I mean why not it's so fast you literally could not we we

233
00:17:35,080 --> 00:17:41,640
couldn't saturate this now a lot of you at this point must be thinking gosh

234
00:17:39,120 --> 00:17:48,160
Mister that's an awful lot of computers for high availability couldn't you do

235
00:17:44,160 --> 00:17:51,400
this with two and you're not that far

236
00:17:48,160 --> 00:17:53,200
off the old school high availability net

237
00:17:51,400 --> 00:17:59,760
app storage appliances like that one we looked at recently did have just two

238
00:17:56,480 --> 00:18:03,400
machines but those were both connected

239
00:17:59,760 --> 00:18:05,960
to the same storage drives if each

240
00:18:03,400 --> 00:18:10,200
system has its own drives when things can get out of sync like let's say if

241
00:18:08,000 --> 00:18:15,120
one machine has downtime you can run into a situation where each system

242
00:18:12,640 --> 00:18:21,600
believes with all the conviction in its heart that it has the correct data and

243
00:18:18,440 --> 00:18:24,159
then if all you have is two how will

244
00:18:21,600 --> 00:18:29,240
they decide who's right this is typically referred to as split brain and

245
00:18:27,200 --> 00:18:35,120
that's why the majority of High availability systems have at bare

246
00:18:31,720 --> 00:18:37,200
minimum three servers this allows the

247
00:18:35,120 --> 00:18:44,159
third system to be a tie breaker of sorts in the case of a disagreement now

248
00:18:40,360 --> 00:18:46,240
in our case WCA that stupid Ultra fast

249
00:18:44,159 --> 00:18:50,679
file system that we're using which unlike anything that we've used before

250
00:18:48,320 --> 00:18:56,919
has been built specifically for NVMe drives not hard drives well it requires

251
00:18:53,880 --> 00:18:59,760
a minimum of six nodes with a

252
00:18:56,919 --> 00:19:04,919
recommendation of eight but running WKA can still be an advantage video editing

253
00:19:02,159 --> 00:19:09,520
with Adobe Premiere like we use is very latency sensitive and even a small delay

254
00:19:07,880 --> 00:19:15,240
when going to access a clip can be enough to make the software crash so any

255
00:19:12,039 --> 00:19:17,880
Improvement there is huge not to mention

256
00:19:15,240 --> 00:19:25,840
that a pair of these Grand twins speced out to the max with 128 car epic Berg

257
00:19:21,159 --> 00:19:29,080
CPUs would get you just four rack units

258
00:19:25,840 --> 00:19:33,200
with 1,000 CPU cores actually actually a

259
00:19:29,080 --> 00:19:36,559
little more 24 terab of ddr5 and up to 3

260
00:19:33,200 --> 00:19:38,799
pedabytes of ndme storage I mean h that

261
00:19:36,559 --> 00:19:43,600
makes our setup seem downright reasonable now the average W customers

262
00:19:41,520 --> 00:19:49,280
are going to be a little more demanding than us visual effect Studios AI

263
00:19:46,240 --> 00:19:51,679
developers genomics Labs all the folks

264
00:19:49,280 --> 00:19:54,960
out there that need Stupid Fast low latency storage and WCA showed us

265
00:19:53,360 --> 00:20:00,240
screenshots of clusters that were reading in excess of 1 terte per second

266
00:19:58,000 --> 00:20:04,760
consistently obviously that was a bigger cluster but it shows you what can be

267
00:20:02,320 --> 00:20:12,080
achieved with this kind of Hardware running on I mean what used to be the

268
00:20:07,600 --> 00:20:14,679
crappier option software raid man I feel

269
00:20:12,080 --> 00:20:20,200
bad even calling it that these days I had a interesting idea with the super

270
00:20:17,919 --> 00:20:25,600
micro folks so you know how we have like two pedabytes of 13 years worth of

271
00:20:22,840 --> 00:20:31,200
footage thousands and thousands of hours of footage thousands it's really cool

272
00:20:28,280 --> 00:20:36,760
that we have it but it's really hard to use unless you just happen to know what

273
00:20:34,000 --> 00:20:41,760
video the thing you were looking for is in well what if you could just like

274
00:20:38,760 --> 00:20:43,240
search for something lonus Sebastian I

275
00:20:41,760 --> 00:20:47,919
want every clip with lonus Sebastian in it wow bam look at that shot up and

276
00:20:46,159 --> 00:20:51,440
let's say you know there's this one that's uh detected that it's you

277
00:20:49,559 --> 00:20:58,720
throughout the entire clip yeah you're in a chair so you could search for clips

278
00:20:54,559 --> 00:21:00,000
of lonus sitting down with a keyboard

279
00:20:58,720 --> 00:21:05,360
yeah like we're going to be able to actually find stuff yeah right now there

280
00:21:02,799 --> 00:21:09,640
is a a finite amount of objects that are trained I mean chihuahua let me scroll

281
00:21:07,799 --> 00:21:13,120
through this it's a lot eventually you'll be able to train it and tell it

282
00:21:11,559 --> 00:21:17,679
hey this is what a computer fan looks like or this is what an SSD looks like

283
00:21:15,080 --> 00:21:23,520
oh my God that is so cool so wait is this running on these extra CPU cores or

284
00:21:21,039 --> 00:21:29,240
okay no not right now faces and logos are running on CPU yeah objects OCR and

285
00:21:26,559 --> 00:21:32,640
scenes run on GPU got it but they're not running on any of those machines they're

286
00:21:30,720 --> 00:21:36,679
running on a GPU workstation that super micro sent that's sitting at my desk um

287
00:21:34,440 --> 00:21:42,520
it was Heavy anyways what is happening on that new server is proxies because if

288
00:21:39,600 --> 00:21:47,200
we were to analyze the original Clips oh AAL formatting is a huge problem when

289
00:21:45,279 --> 00:21:51,200
you go into an AI model it might not necessarily support the Kodak that

290
00:21:48,679 --> 00:21:56,080
you're filming in sure but also Clips are like hundreds of megabytes a second

291
00:21:53,840 --> 00:21:59,520
potentially that would take forever so instead it generates proxies of

292
00:21:58,000 --> 00:22:04,360
everything first first which we're dumping to that new server and then we

293
00:22:02,279 --> 00:22:11,400
can take advantage of the Lightning Fast storage yeah you can we have 2.6 massive

294
00:22:08,799 --> 00:22:18,200
compute and we can basically create like a proxy map of what everything is in the

295
00:22:15,360 --> 00:22:23,120
main archive right that is so cool so far I've generated 2.6 terab of proxies

296
00:22:20,880 --> 00:22:27,760
which might not sound like a lot but they're only 5 megabit so it's actually

297
00:22:25,960 --> 00:22:34,880
like a lot this is going to be a flipping game Cher News

298
00:22:30,960 --> 00:22:37,679
sports can you imagine your CNN you want

299
00:22:34,880 --> 00:22:42,279
that person wearing a red tie yeah but right now we've done 25,000 so 2.6

300
00:22:39,840 --> 00:22:46,799
terabyt is 25,000 Pro okay well let's try and find something oh hold on once

301
00:22:44,679 --> 00:22:52,600
you've generated a proxy you have to then analyze it right ah so the analysis

302
00:22:50,400 --> 00:22:56,760
is not done no not even close I've analyzed 22 Clips okay everything with

303
00:22:54,880 --> 00:23:01,159
Elijah Elijah and this is the every clip that

304
00:22:59,120 --> 00:23:07,360
Elijah's in and you can even see this is so cool this is the actual ma'am as they

305
00:23:04,039 --> 00:23:09,480
call it media asset manager the axle AI

306
00:23:07,360 --> 00:23:12,960
guys built this before it was like AI as far as I'm aware back when you would

307
00:23:11,240 --> 00:23:18,320
have had to make comments like this manually now it's just AI so all of the

308
00:23:16,200 --> 00:23:23,760
data is in here now and we can see here's Adam and Elijah oh that's so cool

309
00:23:21,799 --> 00:23:28,559
here's all the different objects chair flower pot microphone oh let me show you

310
00:23:27,120 --> 00:23:32,840
the scene understanding thing cuz that is so cool this is like brand new thing

311
00:23:30,559 --> 00:23:38,720
they barely even worked it in but it basic it basically takes a snapshot

312
00:23:35,760 --> 00:23:43,159
every seconds two men are working on a project in a room there is a speaker

313
00:23:41,240 --> 00:23:47,279
stereo equipment there's a faucet there's a tripod there's the tripod some

314
00:23:45,200 --> 00:23:51,919
of these are a little less accurate two men are working on a robot in a room it

315
00:23:49,720 --> 00:23:55,600
kind of looks like a robot you I mean yeah sure two men are in a workshop

316
00:23:53,640 --> 00:23:59,840
looking at a laptop computer looking at a machine there is person Alex Clark so

317
00:23:58,120 --> 00:24:06,520
this is just running right now in real time like more stuff is getting processed as see here processing logos 9

318
00:24:04,640 --> 00:24:09,520
there it is processing logos and faces it's going to take a while yeah it's

319
00:24:08,320 --> 00:24:15,279
going to take forever they're still working on making it function on

320
00:24:11,799 --> 00:24:17,440
multiple gpus so once we can get it

321
00:24:15,279 --> 00:24:22,080
running on like four gpus say one GPU is doing face detection one's doing scene

322
00:24:19,919 --> 00:24:25,200
analysis one's doing object detection or something like that we'll be able to go

323
00:24:23,520 --> 00:24:30,360
a lot faster but right now it's just one GPU got it but this is so cool all

324
00:24:27,919 --> 00:24:33,679
that's left is to deploy it lonus had to run away to do some other stuff so I've

325
00:24:31,720 --> 00:24:38,200
hired some backup Cavalry Sean our infrastructure administrator except

326
00:24:36,159 --> 00:24:42,880
we've run into a bit of a problem lonus and me and our Infinite Wisdom while we

327
00:24:40,080 --> 00:24:46,720
were making this rack so much better ran a bunch of cables right where we need to

328
00:24:45,039 --> 00:24:51,399
put the server did we just start unplugging no yeah how are we even going

329
00:24:49,159 --> 00:24:54,520
to do this we have to like part the seas exactly I started to try to move some of

330
00:24:53,080 --> 00:24:59,159
the cables out of the way but they're all twisted together so hopefully the

331
00:24:56,919 --> 00:25:04,600
LTT Cable Management thing which you can finally get at ltp store.com will save

332
00:25:01,720 --> 00:25:11,600
us beautiful cable managed we can slide a server in there now I hope you're in

333
00:25:08,520 --> 00:25:14,880
yeah it's on ow ow ow ow ow ow ow okay

334
00:25:11,600 --> 00:25:18,120
you're good just go that wasn't so bad

335
00:25:14,880 --> 00:25:18,120
like made for it

336
00:25:18,320 --> 00:25:24,840
next hey we're in now we just have to

337
00:25:22,039 --> 00:25:28,000
run a million cables uhoh do you notice anything different well it's loud most

338
00:25:26,480 --> 00:25:31,279
of that's actually just the vent is on one of the air conditioners is broken

339
00:25:29,799 --> 00:25:34,840
again but do you notice anything different I mean the sticker's here that

340
00:25:33,360 --> 00:25:39,600
that sticker's been there for years seriously you haven't noticed anything else well you guys uh screwed something

341
00:25:37,640 --> 00:25:44,640
onto the oh did you put sauna Pan behind it yeah but I thought this is supposed

342
00:25:41,559 --> 00:25:46,799
to be a vented door my original plan was

343
00:25:44,640 --> 00:25:52,159
to get rid of the vent that you put in but that vent was there as a backup in

344
00:25:49,000 --> 00:25:53,679
case the HVAC ever failed so that fan is

345
00:25:52,159 --> 00:25:57,480
the exhaust and that's the intake you see all the gaps F God there gaps but do

346
00:25:55,799 --> 00:26:01,559
you notice the sound difference yeah it's a big difference it's huge but that

347
00:25:59,360 --> 00:26:05,039
server is so loud we basically ended up where we

348
00:26:03,240 --> 00:26:11,159
started yeah but that's okay I was just trying to normalize I just mean I didn't

349
00:26:06,880 --> 00:26:14,799
make it worse it's not that okay look at

350
00:26:11,159 --> 00:26:17,679
that woo cute right God that's a lot of

351
00:26:14,799 --> 00:26:22,640
metal if all goes to plan we could get rid of this and this and just have these

352
00:26:20,200 --> 00:26:27,000
so no more additional rack taken up which is nice

353
00:26:27,000 --> 00:26:32,440
wow it should sustain two entire servers

354
00:26:30,679 --> 00:26:36,760
dropping out without anyone even noticing do you really want to test it

355
00:26:34,919 --> 00:26:42,559
right now I like I haven't tried that all right here we go what could go

356
00:26:39,039 --> 00:26:44,720
wrong u i mean a lot the fact that all

357
00:26:42,559 --> 00:26:49,159
the fans just like turned down a bit is a little scary let's go see if anyone

358
00:26:46,840 --> 00:26:53,760
noticed oh hi Mark hi I'm holding your file server how's your edit going uh

359
00:26:51,480 --> 00:26:57,799
what huh is it working it's working is this on Wi-Fi hey Emily hey how's your

360
00:26:56,159 --> 00:27:03,120
edit going I'm holding your server that's cool is it working are you sure yeah Hoffman

361
00:27:01,760 --> 00:27:06,840
what's up how's your edit going this is your server right here it's amazing look

362
00:27:04,880 --> 00:27:09,320
feel it it's still warm wow yeah it's it's still warm how well how's it

363
00:27:07,919 --> 00:27:14,760
working it's great you know I'm editing the video that we're shooting you are yeah uh we're going to pull another one

364
00:27:12,159 --> 00:27:17,880
wait no l you forgot one yeah here here's another here's another one of

365
00:27:16,080 --> 00:27:23,360
your servers is it working it's great though huhuh for reference you're not

366
00:27:20,799 --> 00:27:28,520
supposed to do this you should power off the system first but we're just trying

367
00:27:25,360 --> 00:27:31,440
to simulate it failing yeah a terrible

368
00:27:28,520 --> 00:27:35,279
catastrophic failure I can't believe how smoothly it handled that see all the

369
00:27:33,159 --> 00:27:38,480
lights they never stopped blinking big thanks to Super Micro for these awesome

370
00:27:36,760 --> 00:27:42,600
servers thanks to WCA for making this crazy software thanks to axle for the

371
00:27:40,640 --> 00:27:45,960
awesome AI detection if you like this video maybe check out the video series

372
00:27:44,480 --> 00:27:49,840
of us building our nearly three pedabytes of archival storage which we

373
00:27:47,960 --> 00:27:54,279
call the mother Vault that thing is awesome and we showed it to you and it's

374
00:27:51,919 --> 00:27:58,080
faster now oh and thanks to you for being an awesome viewer
