1
00:00:00,160 --> 00:00:07,200
when i signed off at the end of the video about our amazing fast new

2
00:00:04,400 --> 00:00:11,759
all SSD storage server i thought it was as simple as okay let's

3
00:00:09,760 --> 00:00:14,880
load the final os on this thing chuck it in the server room we're ready to start

4
00:00:13,040 --> 00:00:20,240
editing off of it but it wasn't so our story begins with

5
00:00:18,080 --> 00:00:24,480
some short video clips that i sent over to wendell from level one text

6
00:00:21,760 --> 00:00:32,880
complaining hey about Windows storage spaces on our new 24 drive NVMe server

7
00:00:29,840 --> 00:00:35,120
machine here because what was happening

8
00:00:32,880 --> 00:00:40,239
was while i was copying files to what should be one of the fastest storage

9
00:00:37,520 --> 00:00:44,960
servers on the freaking planet i was getting great performance sometimes and

10
00:00:42,879 --> 00:00:50,320
then rock bottom performance others we're talking like

11
00:00:46,640 --> 00:00:52,480
10 20 30 megabytes a second so wendell

12
00:00:50,320 --> 00:00:58,079
dug into the system logs and discovered that there was some kind of a problem at

13
00:00:54,719 --> 00:01:01,680
the driver or pci express level where it

14
00:00:58,079 --> 00:01:04,159
was actually resetting individual drives

15
00:01:01,680 --> 00:01:09,680
like they were effectively timing out for seconds at a time while the data was

16
00:01:07,439 --> 00:01:14,000
in flight and then the poor array would be sitting there trying to figure out

17
00:01:11,280 --> 00:01:18,720
what to do while a drive is effectively mia then the drive reset would finish

18
00:01:17,040 --> 00:01:22,400
which is essentially like if you were pulling a drive out for like two seconds

19
00:01:20,320 --> 00:01:27,040
and then popping it back in and then the transfer would roll along at multiple

20
00:01:24,640 --> 00:01:32,159
hundreds of megabytes a second or we even saw at times numbers as high as 20

21
00:01:30,080 --> 00:01:35,200
plus gigabytes a second in crystal disk mark

22
00:01:33,360 --> 00:01:40,400
then it would hitch again rinse and repeat obviously i can't deploy it like

23
00:01:37,759 --> 00:01:44,960
that so i thought it was my knowledge of Windows storage spaces or lack thereof

24
00:01:42,880 --> 00:01:50,799
and that i had configured it wrong but then the mystery deepened so this dropping

25
00:01:48,640 --> 00:01:56,159
out behavior actually happened with a simple Windows software raid with just

26
00:01:53,840 --> 00:02:00,719
four devices in it i mean that's a relatively pedestrian

27
00:01:58,079 --> 00:02:03,920
16 gigabytes a second by the way if guys our sponsor for this

28
00:02:02,240 --> 00:02:08,479
video pulse way with pulseway you can remotely monitor manage and control all

29
00:02:05,840 --> 00:02:12,959
your Windows mac and Linux machines from one app create your free account today

30
00:02:10,560 --> 00:02:16,640
at the link below so we tried all the usual things we tried updating the

31
00:02:15,040 --> 00:02:21,120
drivers it was using the microsoft drivers we put the latest Intel drivers

32
00:02:18,879 --> 00:02:24,640
for these NVMe devices onto the system that didn't work we tried tweaking the

33
00:02:22,800 --> 00:02:29,440
power management to prevent the pci express lanes from switching to lower

34
00:02:27,599 --> 00:02:33,840
speeds when we were accessing all the drives and that could be a desirable

35
00:02:32,000 --> 00:02:37,760
behavior because there's so many drives in here that you're going to run into

36
00:02:35,519 --> 00:02:42,319
other system bottlenecks before you could possibly hope to use all the

37
00:02:39,200 --> 00:02:44,879
bandwidth of even a pci gen 3 link so

38
00:02:42,319 --> 00:02:49,280
gen 2 could be a pretty good bet but when it's happening automatically this

39
00:02:46,879 --> 00:02:53,840
speed switching takes time and that could be part of

40
00:02:51,599 --> 00:02:57,120
what's causing the problems but neither of those things or both of them were

41
00:02:55,360 --> 00:03:02,159
able to solve the problem and we only got a small improvement in the behavior

42
00:02:59,040 --> 00:03:05,120
so wendell suggested gee why don't we go

43
00:03:02,159 --> 00:03:09,760
over to Linux as he tends to do but then get this we got the same dropouts on

44
00:03:08,480 --> 00:03:14,480
Linux that seemed to suggest a hardware issue

45
00:03:12,480 --> 00:03:18,080
of some sort so guys this is why i ultimately made this

46
00:03:16,159 --> 00:03:21,840
video about it because this is pretty dry technical stuff for a lot of people

47
00:03:19,920 --> 00:03:28,480
but i thought it was fascinating NVMe is already so fast that a lot of

48
00:03:25,760 --> 00:03:32,000
stuff particularly software is not engineered for it which is turning out

49
00:03:30,159 --> 00:03:36,560
to be a bit of an industry-wide problem and when you take 24 of these drives

50
00:03:34,560 --> 00:03:42,560
that are capable of multiple gigabytes a second on paper that is now 24 times the

51
00:03:40,239 --> 00:03:47,040
problem think about it this way even with eight channels of memory which is

52
00:03:44,640 --> 00:03:53,120
pretty impressive the theoretical maximum memory bandwidth of our system

53
00:03:49,599 --> 00:03:54,959
here is around 200 gigabytes a second

54
00:03:53,120 --> 00:04:00,319
and real world you're looking at more like 100 to 150 gigabytes a second

55
00:03:58,560 --> 00:04:05,599
now let's talk about this storage array here this

56
00:04:02,000 --> 00:04:08,720
is capable on paper of about a hundred

57
00:04:05,599 --> 00:04:12,239
gigabytes a second in reads so we would

58
00:04:08,720 --> 00:04:13,680
need assuming perfect efficiency which

59
00:04:12,239 --> 00:04:19,519
obviously never happens in the real world nearly half of our memory bandwidth just

60
00:04:18,400 --> 00:04:24,639
to handle shifting data around when we're reading

61
00:04:22,000 --> 00:04:28,880
or writing to our storage array that's ridiculous and even the Linux

62
00:04:27,520 --> 00:04:33,600
kernel is going to be on the struggle bus when you're talking about that much

63
00:04:31,040 --> 00:04:37,360
data as wendell so succinctly put it because

64
00:04:34,400 --> 00:04:42,720
here's the way it's supposed to work the operating system kernel asks for

65
00:04:40,080 --> 00:04:47,120
some chunk of data let's say a loot of your wife to enjoy on your lunch break

66
00:04:44,400 --> 00:04:51,440
all right the disk says yep no problem but nan flash is pretty slow so i'm

67
00:04:49,360 --> 00:04:55,280
going to need a sec to load that into my buffer i'll let you know when it's ready

68
00:04:53,520 --> 00:05:00,320
the disk gets everything ready loaded into the buffer and then it sends what's

69
00:04:57,280 --> 00:05:02,479
called an interrupt to the CPU to say

70
00:05:00,320 --> 00:05:06,080
hey all right it's chill you can swing by and grab that data now

71
00:05:04,800 --> 00:05:12,400
but here's the problem we're running into if the CPU core that the interrupt was

72
00:05:09,840 --> 00:05:17,440
intended for is too busy doing something else or it gets put to sleep or it gets

73
00:05:15,360 --> 00:05:22,400
reassigned to some other task in the middle of this process which can be

74
00:05:19,600 --> 00:05:28,160
quite common on multi-core cpus that interrupt never arrives your processor

75
00:05:25,759 --> 00:05:35,039
never goes and gets the data and the whole train comes to a screeching halt

76
00:05:31,520 --> 00:05:38,000
and that is why we had no issues last

77
00:05:35,039 --> 00:05:43,199
video slamming the individual drives with data but then as soon as we put a

78
00:05:41,759 --> 00:05:47,520
file system you know as soon as we started running a

79
00:05:44,880 --> 00:05:50,880
zfs raid and our CPU was doing parity calculations while we were reading and

80
00:05:49,039 --> 00:05:57,280
writing to the array making the CPU actually do any work we were getting

81
00:05:53,199 --> 00:06:00,720
crippling errors all over the place

82
00:05:57,280 --> 00:06:03,120
so aws just rolled out NVMe and there

83
00:06:00,720 --> 00:06:07,360
are a ton of threads about issues under heavy loads suggesting that this appears

84
00:06:05,280 --> 00:06:12,240
to be an industry-wide problem and the dumbest part of this is that i don't

85
00:06:09,759 --> 00:06:17,199
actually even need my server to be this fast i'm only hitting it with a 40

86
00:06:15,039 --> 00:06:22,080
gigabit connection here that's only four gigabytes a second maximum so wendell

87
00:06:19,680 --> 00:06:26,960
actually even thought of turning down the pci express links to gen 2 and just

88
00:06:24,800 --> 00:06:31,039
leaving them there Gigabyte meanwhile the makers of this server was like sorry

89
00:06:28,960 --> 00:06:34,160
wait you want a speed limiter on this thing but then wendell ended up finding

90
00:06:32,720 --> 00:06:39,039
a software way to do it but then it turned out there was a kernel bug something something something ultimately

91
00:06:37,039 --> 00:06:43,199
it didn't pan out and it didn't work anyway that's okay because Linux already

92
00:06:41,919 --> 00:06:49,840
has kind of a solution to this now very very high

93
00:06:47,919 --> 00:06:54,960
speed devices like RAM based caching devices operate in a

94
00:06:52,479 --> 00:07:00,080
completely different mode called polling where the kernel essentially assumes

95
00:06:57,280 --> 00:07:04,479
that the device is so fast that the data is going to be ready right away and it

96
00:07:02,319 --> 00:07:07,599
would add a lot of overhead to do this on slower drives because there'd be a

97
00:07:05,840 --> 00:07:13,199
lot of pointless hey are you done yet hey are you done yet so a single NVMe

98
00:07:10,080 --> 00:07:14,960
doesn't need to be pulled but 24

99
00:07:13,199 --> 00:07:18,080
oh there's an argument to be made for operating in that mode

100
00:07:16,479 --> 00:07:22,800
so here's the mitigation that wendell implemented when possible the kernel is

101
00:07:20,800 --> 00:07:26,880
going to wait for the interrupt because that's the most efficient thing but if

102
00:07:24,720 --> 00:07:30,960
it waits for too long the queuing algorithm will just have the CPU pull

103
00:07:28,720 --> 00:07:36,319
the drive rapidly and say hey do you have that do you have that okay

104
00:07:32,880 --> 00:07:36,319
great i'm going to take that now

105
00:07:37,360 --> 00:07:42,240
all that tweaking and learning means that our final config ended up being

106
00:07:40,639 --> 00:07:48,319
quite different from the initial intention so we're using the latest version of proxmox a Linux distro that's

107
00:07:46,160 --> 00:07:53,840
designed for virtualization with zfs support out of the box and while we had

108
00:07:50,639 --> 00:07:57,440
actually initially intended to use zfs

109
00:07:53,840 --> 00:08:00,639
we were hitting 100 utilization on a 24

110
00:07:57,440 --> 00:08:03,919
core 48 thread CPU and doing

111
00:08:00,639 --> 00:08:05,199
best case scenario assuming that the bug

112
00:08:03,919 --> 00:08:09,759
didn't surface 10 gigs a second reads 4 gigs of second

113
00:08:07,599 --> 00:08:13,120
lights which would have actually been fine remember we've only got a 40

114
00:08:11,280 --> 00:08:17,360
gigabit network connection except that the access latency was not really

115
00:08:15,199 --> 00:08:23,759
suitable for a multi-user video editing environment it was over 150 microseconds

116
00:08:21,360 --> 00:08:29,120
and the craziest part of that is that we actually hit those numbers even with

117
00:08:26,319 --> 00:08:34,399
some pretty esoteric tweaks like disabling arc compression i mean most

118
00:08:31,759 --> 00:08:39,279
seasoned cfs users would freak out about doing that but the problem is that arc

119
00:08:37,200 --> 00:08:43,760
compression makes three copies of the data in memory while you are writing and

120
00:08:42,240 --> 00:08:47,839
remember how much left over memory bandwidth we have

121
00:08:45,360 --> 00:08:52,560
so yeah tripling the load there ain't gonna fly so new plan

122
00:08:50,560 --> 00:08:57,279
Linux multi-disk ain't perfect it's Linux's own built-in

123
00:08:55,040 --> 00:09:02,080
software raid and the main disadvantage is that in the event of an unexpected

124
00:08:59,120 --> 00:09:06,399
shutdown it'll be really slow for the 30 minutes or so that it takes to resync

125
00:09:04,240 --> 00:09:11,680
four terabytes of data but that should be fine i mean that's what

126
00:09:08,399 --> 00:09:13,440
the seventeen thousand dollar battery

127
00:09:11,680 --> 00:09:19,920
backup in this room is supposed to be for so we settled on four striped

128
00:09:17,360 --> 00:09:24,800
software raid fives and the next experiment was to play

129
00:09:22,240 --> 00:09:29,440
around with the chunk size so that's how the blocks of data are broken up on the

130
00:09:27,120 --> 00:09:34,560
raid as well as the block size which is on the file system level so the default

131
00:09:31,600 --> 00:09:38,320
raid chunk size is 512k and the file system is 64.

132
00:09:36,640 --> 00:09:43,680
but when we were running benchmarks based on an editor usage pattern we

133
00:09:40,640 --> 00:09:45,920
actually found that the 512k chunks were

134
00:09:43,680 --> 00:09:50,399
a little bit higher latency than we'd like to see which is really really

135
00:09:48,800 --> 00:09:54,480
important when you're you know scrubbing through files on a timeline so we

136
00:09:52,000 --> 00:09:59,519
actually ended up using 128k for both which happens to line up with

137
00:09:56,560 --> 00:10:02,160
the buffer size on these devices perfect

138
00:10:00,560 --> 00:10:06,720
now the conventional wisdom for accessing very large files over a

139
00:10:04,240 --> 00:10:12,000
network share would actually be to use a very large chunk size like even one

140
00:10:09,680 --> 00:10:17,519
megabyte but while that would be great for ingesting like big batches of new

141
00:10:15,040 --> 00:10:21,760
footage when we're skipping around rather than reading them sequentially

142
00:10:19,360 --> 00:10:27,040
with many users doing that all the same time it actually makes sense that this

143
00:10:24,800 --> 00:10:31,760
would work well and experimentally so far it seems pretty good i just

144
00:10:30,160 --> 00:10:35,519
realized i forgot to drive upstairs i'm gonna go grab that

145
00:10:33,760 --> 00:10:40,399
with multi-disc we ended up with a maximum throughput of around 16

146
00:10:37,519 --> 00:10:43,839
gigabytes per second reads and eight gigabytes per second rights which

147
00:10:42,160 --> 00:10:48,079
obviously is way less than the maximum this hardware

148
00:10:46,320 --> 00:10:51,360
can theoretically do but there's a lot of overhead to contend

149
00:10:49,839 --> 00:10:56,399
with and besides that doesn't mean that there's no benefit to having all of this

150
00:10:54,480 --> 00:11:00,399
performance in reserve so the latency advantage is something

151
00:10:58,640 --> 00:11:04,240
that we've already talked about we've actually seen that high latency storage

152
00:11:02,560 --> 00:11:09,839
can cause instability in the video editing software that's accessing it but

153
00:11:06,480 --> 00:11:13,519
another benefit counter intuitively is

154
00:11:09,839 --> 00:11:16,480
that because the storage is so fast

155
00:11:13,519 --> 00:11:22,240
no one chiplet on our CPU can keep up with it a disadvantage of a chiplet

156
00:11:19,279 --> 00:11:28,959
design is that it's got huge horsepower but it's hard to harness all of it for

157
00:11:25,200 --> 00:11:31,440
one single task like a file copy from

158
00:11:28,959 --> 00:11:36,720
one user over the network with that said it's great as a multi-user experience

159
00:11:34,320 --> 00:11:41,360
because each discrete user like let's say a camera operator who's dumping red

160
00:11:39,040 --> 00:11:46,000
footage and a video editor who also has to work at the same time

161
00:11:43,040 --> 00:11:51,200
end up having their access spread over multiple chiplets that are individually

162
00:11:49,120 --> 00:11:57,680
kind of limited so we had remember guys 150 gigabytes

163
00:11:55,200 --> 00:12:02,160
per second of memory bandwidth one chiplet can't get at all of it so when

164
00:12:00,160 --> 00:12:07,760
we have one user copying a file over the network that user can only get to one or

165
00:12:05,279 --> 00:12:11,839
two cores so there's no way that user can monopolize all the resources on the

166
00:12:10,079 --> 00:12:16,399
system because of the way the whole thing is architected all of this in

167
00:12:14,240 --> 00:12:19,680
theory so far we haven't actually thrown our editors at it so let's go see if

168
00:12:17,839 --> 00:12:22,800
it's booted up and get them to try it

169
00:12:21,040 --> 00:12:27,040
taran was about to eat but now he has something very important to do what um

170
00:12:25,519 --> 00:12:31,200
you laughed dennis but you need to help too

171
00:12:28,800 --> 00:12:37,760
uh okay we're off to a good is this new yes this is pneumonic

172
00:12:33,680 --> 00:12:39,839
hi alex hi how's it going

173
00:12:37,760 --> 00:12:44,240
hi alex i have a new server to log you into i wouldn't even do that at this

174
00:12:41,680 --> 00:12:50,160
point nope what would you do no no this is not real i'm acting out come on

175
00:12:48,160 --> 00:12:53,920
and i'm not acknowledging it i add you too wait are we supposed to work off of

176
00:12:51,839 --> 00:12:57,120
this or just just just i just want to know if it works

177
00:12:55,200 --> 00:13:01,680
so you're supposed to work but like not important work i'm gonna mirror old

178
00:12:59,440 --> 00:13:04,880
wanik over to this one one more time okay so anything you do here will be

179
00:13:03,200 --> 00:13:08,800
overwritten so we're not supposed to use it look do you want me to do this

180
00:13:06,959 --> 00:13:12,000
so do them but then we're going to just wipe it out okay when are you going to

181
00:13:10,000 --> 00:13:15,360
swipe it what part of test is not clear just open up a project

182
00:13:13,519 --> 00:13:18,720
how's it going oh seems fine

183
00:13:17,120 --> 00:13:22,639
it's you know let's see if we can pump it up to full

184
00:13:20,480 --> 00:13:26,480
res well that's less of a network bottleneck thing and more of a you know

185
00:13:24,480 --> 00:13:30,320
the rest of the system but okay it's playing it though

186
00:13:28,000 --> 00:13:34,959
which is kind of surprising what

187
00:13:31,360 --> 00:13:36,959
well Linus uh you having wanting to do

188
00:13:34,959 --> 00:13:40,560
increasingly ambitious projects i appreciate that we now have more space

189
00:13:39,040 --> 00:13:43,360
for them us running out of space has been a large

190
00:13:42,240 --> 00:13:48,079
large large problem good work it's not broken

191
00:13:46,000 --> 00:13:53,519
so this does this feel any different than it was before it might be a little

192
00:13:50,480 --> 00:13:55,600
snippier snappier better you don't have

193
00:13:53,519 --> 00:13:59,680
to lie to me but i don't know i mean it i don't really see much difference this

194
00:13:57,760 --> 00:14:05,839
is at 1 8 res though what if you crank it a bit um okay thank you

195
00:14:03,360 --> 00:14:09,199
but is it better why are you asking me i'm asking you that's the whole point of

196
00:14:07,760 --> 00:14:12,720
this exercise you can't do anything participating oh

197
00:14:11,040 --> 00:14:18,000
okay fine from what i can tell it's actually

198
00:14:15,360 --> 00:14:21,760
a lot snappier than what i remember the editors say it's good enough and

199
00:14:20,160 --> 00:14:24,800
we're not getting any data corruption and the performance is

200
00:14:23,920 --> 00:14:31,199
fine but every one of these line items is an

201
00:14:27,199 --> 00:14:32,639
NVMe device timing out and we actually

202
00:14:31,199 --> 00:14:36,720
did some troubleshooting that i haven't talked about yet so one of the first

203
00:14:35,120 --> 00:14:41,920
things that we did was we swapped out the 24 core CPU that i originally

204
00:14:38,800 --> 00:14:44,480
configured the server with for a 64 core

205
00:14:41,920 --> 00:14:49,519
one because we found that with the 24 core the CPU during heavy reads and

206
00:14:47,040 --> 00:14:55,120
writes was getting hit with 50 or more buffer flushing tasks that were each

207
00:14:51,839 --> 00:14:58,000
pulling 20 usage of a single core just

208
00:14:55,120 --> 00:15:01,760
choking the poor thing and 64 cores did help significantly

209
00:14:59,839 --> 00:15:06,560
but i also didn't want to allocate a four or five thousand dollar CPU to the

210
00:15:03,839 --> 00:15:12,320
server so we dialed back to 32 and that ended up being a big improvement as well

211
00:15:09,360 --> 00:15:17,519
so bottom line the 32 core so adding just another eight cores and then

212
00:15:14,399 --> 00:15:20,000
tweaking the timing between going from

213
00:15:17,519 --> 00:15:24,720
interrupt base to polling based access to the drives gave us good enough

214
00:15:22,639 --> 00:15:28,639
performance that we've seen three gigabytes a second when we're hitting it

215
00:15:26,639 --> 00:15:32,160
with three different clients at a time in the real world without any

216
00:15:30,800 --> 00:15:36,800
significant jumps in access latency or dips in

217
00:15:34,399 --> 00:15:40,480
transfer speeds so we're rolling with it but there's something to be said for

218
00:15:38,480 --> 00:15:46,399
like a dual socket approach to this with more spare pci express lanes and even

219
00:15:43,120 --> 00:15:48,399
more CPU cores or oh i don't know AMD

220
00:15:46,399 --> 00:15:52,639
working with their oems to make sure that you know when you actually hit

221
00:15:50,160 --> 00:15:57,040
their pci express lanes it doesn't cause a bunch of traffic jams elsewhere in the

222
00:15:54,720 --> 00:16:00,720
CPU a massive shout out to wendell from level one text by the way that guy's

223
00:15:58,959 --> 00:16:04,560
anything but level one i would strongly recommend going and subscribing to him

224
00:16:02,639 --> 00:16:08,560
if you love this kind of deep dive server stuff linode provides virtual

225
00:16:06,800 --> 00:16:13,279
servers that make it easy and affordable to host your own app site service or

226
00:16:11,040 --> 00:16:16,399
whatever in the cloud other entry-level hostings work when you start up but

227
00:16:14,959 --> 00:16:20,959
you'll eventually want to get something powerful customizable and easy to use

228
00:16:19,040 --> 00:16:24,800
for cloud computing they've got a diy option if you want a full custom setup

229
00:16:22,639 --> 00:16:30,000
or you can easily set up your own server with their one-click apps you can deploy

230
00:16:26,959 --> 00:16:32,000
minecraft cs go servers wordpress and

231
00:16:30,000 --> 00:16:36,399
much more and you can even spin up your own vpn and have plenty of space to host

232
00:16:34,079 --> 00:16:40,240
a website app or game server they've got affordable pricing with no hidden fees

233
00:16:38,320 --> 00:16:46,480
that try to sneak onto your monthly bill and they've got 100 human 24 7 365

234
00:16:44,000 --> 00:16:49,920
customer service via phone or support tickets get twenty dollars in free

235
00:16:48,320 --> 00:16:54,000
credit on your new account with code minus 20 or by clicking the link in the

236
00:16:51,920 --> 00:16:57,519
video description so thanks for watching guys if you're looking for another

237
00:16:55,360 --> 00:17:01,360
survey video to check out maybe uh have a look at our petabyte project update

238
00:16:59,920 --> 00:17:05,439
and actually we're going to have another petabyte project coming soon so make

239
00:17:03,120 --> 00:17:11,199
sure you're subscribed so you don't miss it and remember how much memory

240
00:17:08,160 --> 00:17:12,000
no i need to scroll down okay no problem

241
00:17:11,199 --> 00:17:16,959
but give me a second um

242
00:17:15,039 --> 00:17:21,520
[ __ ] off and remember how much leftover memory

243
00:17:18,720 --> 00:17:24,799
bandwidth we have so yeah [ __ ] off

244
00:17:23,280 --> 00:17:29,280
why then your CPU goes [ __ ] off why isn't

245
00:17:27,439 --> 00:17:35,840
this working we're using the latest version of proxmox a Linux distribute

246
00:17:32,640 --> 00:17:38,080
[ __ ] off i need this to work what the

247
00:17:35,840 --> 00:17:41,080
[ __ ] okay