WEBVTT

00:00:00.160 --> 00:00:07.200
when i signed off at the end of the video about our amazing fast new

00:00:04.400 --> 00:00:11.759
all SSD storage server i thought it was as simple as okay let's

00:00:09.760 --> 00:00:14.880
load the final os on this thing chuck it in the server room we're ready to start

00:00:13.040 --> 00:00:20.240
editing off of it but it wasn't so our story begins with

00:00:18.080 --> 00:00:24.480
some short video clips that i sent over to wendell from level one text

00:00:21.760 --> 00:00:32.880
complaining hey about Windows storage spaces on our new 24 drive NVMe server

00:00:29.840 --> 00:00:35.120
machine here because what was happening

00:00:32.880 --> 00:00:40.239
was while i was copying files to what should be one of the fastest storage

00:00:37.520 --> 00:00:44.960
servers on the freaking planet i was getting great performance sometimes and

00:00:42.879 --> 00:00:50.320
then rock bottom performance others we're talking like

00:00:46.640 --> 00:00:52.480
10 20 30 megabytes a second so wendell

00:00:50.320 --> 00:00:58.079
dug into the system logs and discovered that there was some kind of a problem at

00:00:54.719 --> 00:01:01.680
the driver or pci express level where it

00:00:58.079 --> 00:01:04.159
was actually resetting individual drives

00:01:01.680 --> 00:01:09.680
like they were effectively timing out for seconds at a time while the data was

00:01:07.439 --> 00:01:14.000
in flight and then the poor array would be sitting there trying to figure out

00:01:11.280 --> 00:01:18.720
what to do while a drive is effectively mia then the drive reset would finish

00:01:17.040 --> 00:01:22.400
which is essentially like if you were pulling a drive out for like two seconds

00:01:20.320 --> 00:01:27.040
and then popping it back in and then the transfer would roll along at multiple

00:01:24.640 --> 00:01:32.159
hundreds of megabytes a second or we even saw at times numbers as high as 20

00:01:30.080 --> 00:01:35.200
plus gigabytes a second in crystal disk mark

00:01:33.360 --> 00:01:40.400
then it would hitch again rinse and repeat obviously i can't deploy it like

00:01:37.759 --> 00:01:44.960
that so i thought it was my knowledge of Windows storage spaces or lack thereof

00:01:42.880 --> 00:01:50.799
and that i had configured it wrong but then the mystery deepened so this dropping

00:01:48.640 --> 00:01:56.159
out behavior actually happened with a simple Windows software raid with just

00:01:53.840 --> 00:02:00.719
four devices in it i mean that's a relatively pedestrian

00:01:58.079 --> 00:02:03.920
16 gigabytes a second by the way if guys our sponsor for this

00:02:02.240 --> 00:02:08.479
video pulse way with pulseway you can remotely monitor manage and control all

00:02:05.840 --> 00:02:12.959
your Windows mac and Linux machines from one app create your free account today

00:02:10.560 --> 00:02:16.640
at the link below so we tried all the usual things we tried updating the

00:02:15.040 --> 00:02:21.120
drivers it was using the microsoft drivers we put the latest Intel drivers

00:02:18.879 --> 00:02:24.640
for these NVMe devices onto the system that didn't work we tried tweaking the

00:02:22.800 --> 00:02:29.440
power management to prevent the pci express lanes from switching to lower

00:02:27.599 --> 00:02:33.840
speeds when we were accessing all the drives and that could be a desirable

00:02:32.000 --> 00:02:37.760
behavior because there's so many drives in here that you're going to run into

00:02:35.519 --> 00:02:42.319
other system bottlenecks before you could possibly hope to use all the

00:02:39.200 --> 00:02:44.879
bandwidth of even a pci gen 3 link so

00:02:42.319 --> 00:02:49.280
gen 2 could be a pretty good bet but when it's happening automatically this

00:02:46.879 --> 00:02:53.840
speed switching takes time and that could be part of

00:02:51.599 --> 00:02:57.120
what's causing the problems but neither of those things or both of them were

00:02:55.360 --> 00:03:02.159
able to solve the problem and we only got a small improvement in the behavior

00:02:59.040 --> 00:03:05.120
so wendell suggested gee why don't we go

00:03:02.159 --> 00:03:09.760
over to Linux as he tends to do but then get this we got the same dropouts on

00:03:08.480 --> 00:03:14.480
Linux that seemed to suggest a hardware issue

00:03:12.480 --> 00:03:18.080
of some sort so guys this is why i ultimately made this

00:03:16.159 --> 00:03:21.840
video about it because this is pretty dry technical stuff for a lot of people

00:03:19.920 --> 00:03:28.480
but i thought it was fascinating NVMe is already so fast that a lot of

00:03:25.760 --> 00:03:32.000
stuff particularly software is not engineered for it which is turning out

00:03:30.159 --> 00:03:36.560
to be a bit of an industry-wide problem and when you take 24 of these drives

00:03:34.560 --> 00:03:42.560
that are capable of multiple gigabytes a second on paper that is now 24 times the

00:03:40.239 --> 00:03:47.040
problem think about it this way even with eight channels of memory which is

00:03:44.640 --> 00:03:53.120
pretty impressive the theoretical maximum memory bandwidth of our system

00:03:49.599 --> 00:03:54.959
here is around 200 gigabytes a second

00:03:53.120 --> 00:04:00.319
and real world you're looking at more like 100 to 150 gigabytes a second

00:03:58.560 --> 00:04:05.599
now let's talk about this storage array here this

00:04:02.000 --> 00:04:08.720
is capable on paper of about a hundred

00:04:05.599 --> 00:04:12.239
gigabytes a second in reads so we would

00:04:08.720 --> 00:04:13.680
need assuming perfect efficiency which

00:04:12.239 --> 00:04:19.519
obviously never happens in the real world nearly half of our memory bandwidth just

00:04:18.400 --> 00:04:24.639
to handle shifting data around when we're reading

00:04:22.000 --> 00:04:28.880
or writing to our storage array that's ridiculous and even the Linux

00:04:27.520 --> 00:04:33.600
kernel is going to be on the struggle bus when you're talking about that much

00:04:31.040 --> 00:04:37.360
data as wendell so succinctly put it because

00:04:34.400 --> 00:04:42.720
here's the way it's supposed to work the operating system kernel asks for

00:04:40.080 --> 00:04:47.120
some chunk of data let's say a loot of your wife to enjoy on your lunch break

00:04:44.400 --> 00:04:51.440
all right the disk says yep no problem but nan flash is pretty slow so i'm

00:04:49.360 --> 00:04:55.280
going to need a sec to load that into my buffer i'll let you know when it's ready

00:04:53.520 --> 00:05:00.320
the disk gets everything ready loaded into the buffer and then it sends what's

00:04:57.280 --> 00:05:02.479
called an interrupt to the CPU to say

00:05:00.320 --> 00:05:06.080
hey all right it's chill you can swing by and grab that data now

00:05:04.800 --> 00:05:12.400
but here's the problem we're running into if the CPU core that the interrupt was

00:05:09.840 --> 00:05:17.440
intended for is too busy doing something else or it gets put to sleep or it gets

00:05:15.360 --> 00:05:22.400
reassigned to some other task in the middle of this process which can be

00:05:19.600 --> 00:05:28.160
quite common on multi-core cpus that interrupt never arrives your processor

00:05:25.759 --> 00:05:35.039
never goes and gets the data and the whole train comes to a screeching halt

00:05:31.520 --> 00:05:38.000
and that is why we had no issues last

00:05:35.039 --> 00:05:43.199
video slamming the individual drives with data but then as soon as we put a

00:05:41.759 --> 00:05:47.520
file system you know as soon as we started running a

00:05:44.880 --> 00:05:50.880
zfs raid and our CPU was doing parity calculations while we were reading and

00:05:49.039 --> 00:05:57.280
writing to the array making the CPU actually do any work we were getting

00:05:53.199 --> 00:06:00.720
crippling errors all over the place

00:05:57.280 --> 00:06:03.120
so aws just rolled out NVMe and there

00:06:00.720 --> 00:06:07.360
are a ton of threads about issues under heavy loads suggesting that this appears

00:06:05.280 --> 00:06:12.240
to be an industry-wide problem and the dumbest part of this is that i don't

00:06:09.759 --> 00:06:17.199
actually even need my server to be this fast i'm only hitting it with a 40

00:06:15.039 --> 00:06:22.080
gigabit connection here that's only four gigabytes a second maximum so wendell

00:06:19.680 --> 00:06:26.960
actually even thought of turning down the pci express links to gen 2 and just

00:06:24.800 --> 00:06:31.039
leaving them there Gigabyte meanwhile the makers of this server was like sorry

00:06:28.960 --> 00:06:34.160
wait you want a speed limiter on this thing but then wendell ended up finding

00:06:32.720 --> 00:06:39.039
a software way to do it but then it turned out there was a kernel bug something something something ultimately

00:06:37.039 --> 00:06:43.199
it didn't pan out and it didn't work anyway that's okay because Linux already

00:06:41.919 --> 00:06:49.840
has kind of a solution to this now very very high

00:06:47.919 --> 00:06:54.960
speed devices like RAM based caching devices operate in a

00:06:52.479 --> 00:07:00.080
completely different mode called polling where the kernel essentially assumes

00:06:57.280 --> 00:07:04.479
that the device is so fast that the data is going to be ready right away and it

00:07:02.319 --> 00:07:07.599
would add a lot of overhead to do this on slower drives because there'd be a

00:07:05.840 --> 00:07:13.199
lot of pointless hey are you done yet hey are you done yet so a single NVMe

00:07:10.080 --> 00:07:14.960
doesn't need to be pulled but 24

00:07:13.199 --> 00:07:18.080
oh there's an argument to be made for operating in that mode

00:07:16.479 --> 00:07:22.800
so here's the mitigation that wendell implemented when possible the kernel is

00:07:20.800 --> 00:07:26.880
going to wait for the interrupt because that's the most efficient thing but if

00:07:24.720 --> 00:07:30.960
it waits for too long the queuing algorithm will just have the CPU pull

00:07:28.720 --> 00:07:36.319
the drive rapidly and say hey do you have that do you have that okay

00:07:32.880 --> 00:07:36.319
great i'm going to take that now

00:07:37.360 --> 00:07:42.240
all that tweaking and learning means that our final config ended up being

00:07:40.639 --> 00:07:48.319
quite different from the initial intention so we're using the latest version of proxmox a Linux distro that's

00:07:46.160 --> 00:07:53.840
designed for virtualization with zfs support out of the box and while we had

00:07:50.639 --> 00:07:57.440
actually initially intended to use zfs

00:07:53.840 --> 00:08:00.639
we were hitting 100 utilization on a 24

00:07:57.440 --> 00:08:03.919
core 48 thread CPU and doing

00:08:00.639 --> 00:08:05.199
best case scenario assuming that the bug

00:08:03.919 --> 00:08:09.759
didn't surface 10 gigs a second reads 4 gigs of second

00:08:07.599 --> 00:08:13.120
lights which would have actually been fine remember we've only got a 40

00:08:11.280 --> 00:08:17.360
gigabit network connection except that the access latency was not really

00:08:15.199 --> 00:08:23.759
suitable for a multi-user video editing environment it was over 150 microseconds

00:08:21.360 --> 00:08:29.120
and the craziest part of that is that we actually hit those numbers even with

00:08:26.319 --> 00:08:34.399
some pretty esoteric tweaks like disabling arc compression i mean most

00:08:31.759 --> 00:08:39.279
seasoned cfs users would freak out about doing that but the problem is that arc

00:08:37.200 --> 00:08:43.760
compression makes three copies of the data in memory while you are writing and

00:08:42.240 --> 00:08:47.839
remember how much left over memory bandwidth we have

00:08:45.360 --> 00:08:52.560
so yeah tripling the load there ain't gonna fly so new plan

00:08:50.560 --> 00:08:57.279
Linux multi-disk ain't perfect it's Linux's own built-in

00:08:55.040 --> 00:09:02.080
software raid and the main disadvantage is that in the event of an unexpected

00:08:59.120 --> 00:09:06.399
shutdown it'll be really slow for the 30 minutes or so that it takes to resync

00:09:04.240 --> 00:09:11.680
four terabytes of data but that should be fine i mean that's what

00:09:08.399 --> 00:09:13.440
the seventeen thousand dollar battery

00:09:11.680 --> 00:09:19.920
backup in this room is supposed to be for so we settled on four striped

00:09:17.360 --> 00:09:24.800
software raid fives and the next experiment was to play

00:09:22.240 --> 00:09:29.440
around with the chunk size so that's how the blocks of data are broken up on the

00:09:27.120 --> 00:09:34.560
raid as well as the block size which is on the file system level so the default

00:09:31.600 --> 00:09:38.320
raid chunk size is 512k and the file system is 64.

00:09:36.640 --> 00:09:43.680
but when we were running benchmarks based on an editor usage pattern we

00:09:40.640 --> 00:09:45.920
actually found that the 512k chunks were

00:09:43.680 --> 00:09:50.399
a little bit higher latency than we'd like to see which is really really

00:09:48.800 --> 00:09:54.480
important when you're you know scrubbing through files on a timeline so we

00:09:52.000 --> 00:09:59.519
actually ended up using 128k for both which happens to line up with

00:09:56.560 --> 00:10:02.160
the buffer size on these devices perfect

00:10:00.560 --> 00:10:06.720
now the conventional wisdom for accessing very large files over a

00:10:04.240 --> 00:10:12.000
network share would actually be to use a very large chunk size like even one

00:10:09.680 --> 00:10:17.519
megabyte but while that would be great for ingesting like big batches of new

00:10:15.040 --> 00:10:21.760
footage when we're skipping around rather than reading them sequentially

00:10:19.360 --> 00:10:27.040
with many users doing that all the same time it actually makes sense that this

00:10:24.800 --> 00:10:31.760
would work well and experimentally so far it seems pretty good i just

00:10:30.160 --> 00:10:35.519
realized i forgot to drive upstairs i'm gonna go grab that

00:10:33.760 --> 00:10:40.399
with multi-disc we ended up with a maximum throughput of around 16

00:10:37.519 --> 00:10:43.839
gigabytes per second reads and eight gigabytes per second rights which

00:10:42.160 --> 00:10:48.079
obviously is way less than the maximum this hardware

00:10:46.320 --> 00:10:51.360
can theoretically do but there's a lot of overhead to contend

00:10:49.839 --> 00:10:56.399
with and besides that doesn't mean that there's no benefit to having all of this

00:10:54.480 --> 00:11:00.399
performance in reserve so the latency advantage is something

00:10:58.640 --> 00:11:04.240
that we've already talked about we've actually seen that high latency storage

00:11:02.560 --> 00:11:09.839
can cause instability in the video editing software that's accessing it but

00:11:06.480 --> 00:11:13.519
another benefit counter intuitively is

00:11:09.839 --> 00:11:16.480
that because the storage is so fast

00:11:13.519 --> 00:11:22.240
no one chiplet on our CPU can keep up with it a disadvantage of a chiplet

00:11:19.279 --> 00:11:28.959
design is that it's got huge horsepower but it's hard to harness all of it for

00:11:25.200 --> 00:11:31.440
one single task like a file copy from

00:11:28.959 --> 00:11:36.720
one user over the network with that said it's great as a multi-user experience

00:11:34.320 --> 00:11:41.360
because each discrete user like let's say a camera operator who's dumping red

00:11:39.040 --> 00:11:46.000
footage and a video editor who also has to work at the same time

00:11:43.040 --> 00:11:51.200
end up having their access spread over multiple chiplets that are individually

00:11:49.120 --> 00:11:57.680
kind of limited so we had remember guys 150 gigabytes

00:11:55.200 --> 00:12:02.160
per second of memory bandwidth one chiplet can't get at all of it so when

00:12:00.160 --> 00:12:07.760
we have one user copying a file over the network that user can only get to one or

00:12:05.279 --> 00:12:11.839
two cores so there's no way that user can monopolize all the resources on the

00:12:10.079 --> 00:12:16.399
system because of the way the whole thing is architected all of this in

00:12:14.240 --> 00:12:19.680
theory so far we haven't actually thrown our editors at it so let's go see if

00:12:17.839 --> 00:12:22.800
it's booted up and get them to try it

00:12:21.040 --> 00:12:27.040
taran was about to eat but now he has something very important to do what um

00:12:25.519 --> 00:12:31.200
you laughed dennis but you need to help too

00:12:28.800 --> 00:12:37.760
uh okay we're off to a good is this new yes this is pneumonic

00:12:33.680 --> 00:12:39.839
hi alex hi how's it going

00:12:37.760 --> 00:12:44.240
hi alex i have a new server to log you into i wouldn't even do that at this

00:12:41.680 --> 00:12:50.160
point nope what would you do no no this is not real i'm acting out come on

00:12:48.160 --> 00:12:53.920
and i'm not acknowledging it i add you too wait are we supposed to work off of

00:12:51.839 --> 00:12:57.120
this or just just just i just want to know if it works

00:12:55.200 --> 00:13:01.680
so you're supposed to work but like not important work i'm gonna mirror old

00:12:59.440 --> 00:13:04.880
wanik over to this one one more time okay so anything you do here will be

00:13:03.200 --> 00:13:08.800
overwritten so we're not supposed to use it look do you want me to do this

00:13:06.959 --> 00:13:12.000
so do them but then we're going to just wipe it out okay when are you going to

00:13:10.000 --> 00:13:15.360
swipe it what part of test is not clear just open up a project

00:13:13.519 --> 00:13:18.720
how's it going oh seems fine

00:13:17.120 --> 00:13:22.639
it's you know let's see if we can pump it up to full

00:13:20.480 --> 00:13:26.480
res well that's less of a network bottleneck thing and more of a you know

00:13:24.480 --> 00:13:30.320
the rest of the system but okay it's playing it though

00:13:28.000 --> 00:13:34.959
which is kind of surprising what

00:13:31.360 --> 00:13:36.959
well Linus uh you having wanting to do

00:13:34.959 --> 00:13:40.560
increasingly ambitious projects i appreciate that we now have more space

00:13:39.040 --> 00:13:43.360
for them us running out of space has been a large

00:13:42.240 --> 00:13:48.079
large large problem good work it's not broken

00:13:46.000 --> 00:13:53.519
so this does this feel any different than it was before it might be a little

00:13:50.480 --> 00:13:55.600
snippier snappier better you don't have

00:13:53.519 --> 00:13:59.680
to lie to me but i don't know i mean it i don't really see much difference this

00:13:57.760 --> 00:14:05.839
is at 1 8 res though what if you crank it a bit um okay thank you

00:14:03.360 --> 00:14:09.199
but is it better why are you asking me i'm asking you that's the whole point of

00:14:07.760 --> 00:14:12.720
this exercise you can't do anything participating oh

00:14:11.040 --> 00:14:18.000
okay fine from what i can tell it's actually

00:14:15.360 --> 00:14:21.760
a lot snappier than what i remember the editors say it's good enough and

00:14:20.160 --> 00:14:24.800
we're not getting any data corruption and the performance is

00:14:23.920 --> 00:14:31.199
fine but every one of these line items is an

00:14:27.199 --> 00:14:32.639
NVMe device timing out and we actually

00:14:31.199 --> 00:14:36.720
did some troubleshooting that i haven't talked about yet so one of the first

00:14:35.120 --> 00:14:41.920
things that we did was we swapped out the 24 core CPU that i originally

00:14:38.800 --> 00:14:44.480
configured the server with for a 64 core

00:14:41.920 --> 00:14:49.519
one because we found that with the 24 core the CPU during heavy reads and

00:14:47.040 --> 00:14:55.120
writes was getting hit with 50 or more buffer flushing tasks that were each

00:14:51.839 --> 00:14:58.000
pulling 20 usage of a single core just

00:14:55.120 --> 00:15:01.760
choking the poor thing and 64 cores did help significantly

00:14:59.839 --> 00:15:06.560
but i also didn't want to allocate a four or five thousand dollar CPU to the

00:15:03.839 --> 00:15:12.320
server so we dialed back to 32 and that ended up being a big improvement as well

00:15:09.360 --> 00:15:17.519
so bottom line the 32 core so adding just another eight cores and then

00:15:14.399 --> 00:15:20.000
tweaking the timing between going from

00:15:17.519 --> 00:15:24.720
interrupt base to polling based access to the drives gave us good enough

00:15:22.639 --> 00:15:28.639
performance that we've seen three gigabytes a second when we're hitting it

00:15:26.639 --> 00:15:32.160
with three different clients at a time in the real world without any

00:15:30.800 --> 00:15:36.800
significant jumps in access latency or dips in

00:15:34.399 --> 00:15:40.480
transfer speeds so we're rolling with it but there's something to be said for

00:15:38.480 --> 00:15:46.399
like a dual socket approach to this with more spare pci express lanes and even

00:15:43.120 --> 00:15:48.399
more CPU cores or oh i don't know AMD

00:15:46.399 --> 00:15:52.639
working with their oems to make sure that you know when you actually hit

00:15:50.160 --> 00:15:57.040
their pci express lanes it doesn't cause a bunch of traffic jams elsewhere in the

00:15:54.720 --> 00:16:00.720
CPU a massive shout out to wendell from level one text by the way that guy's

00:15:58.959 --> 00:16:04.560
anything but level one i would strongly recommend going and subscribing to him

00:16:02.639 --> 00:16:08.560
if you love this kind of deep dive server stuff linode provides virtual

00:16:06.800 --> 00:16:13.279
servers that make it easy and affordable to host your own app site service or

00:16:11.040 --> 00:16:16.399
whatever in the cloud other entry-level hostings work when you start up but

00:16:14.959 --> 00:16:20.959
you'll eventually want to get something powerful customizable and easy to use

00:16:19.040 --> 00:16:24.800
for cloud computing they've got a diy option if you want a full custom setup

00:16:22.639 --> 00:16:30.000
or you can easily set up your own server with their one-click apps you can deploy

00:16:26.959 --> 00:16:32.000
minecraft cs go servers wordpress and

00:16:30.000 --> 00:16:36.399
much more and you can even spin up your own vpn and have plenty of space to host

00:16:34.079 --> 00:16:40.240
a website app or game server they've got affordable pricing with no hidden fees

00:16:38.320 --> 00:16:46.480
that try to sneak onto your monthly bill and they've got 100 human 24 7 365

00:16:44.000 --> 00:16:49.920
customer service via phone or support tickets get twenty dollars in free

00:16:48.320 --> 00:16:54.000
credit on your new account with code minus 20 or by clicking the link in the

00:16:51.920 --> 00:16:57.519
video description so thanks for watching guys if you're looking for another

00:16:55.360 --> 00:17:01.360
survey video to check out maybe uh have a look at our petabyte project update

00:16:59.920 --> 00:17:05.439
and actually we're going to have another petabyte project coming soon so make

00:17:03.120 --> 00:17:11.199
sure you're subscribed so you don't miss it and remember how much memory

00:17:08.160 --> 00:17:12.000
no i need to scroll down okay no problem

00:17:11.199 --> 00:17:16.959
but give me a second um

00:17:15.039 --> 00:17:21.520
[ __ ] off and remember how much leftover memory

00:17:18.720 --> 00:17:24.799
bandwidth we have so yeah [ __ ] off

00:17:23.280 --> 00:17:29.280
why then your CPU goes [ __ ] off why isn't

00:17:27.439 --> 00:17:35.840
this working we're using the latest version of proxmox a Linux distribute

00:17:32.640 --> 00:17:38.080
[ __ ] off i need this to work what the

00:17:35.840 --> 00:17:41.080
[ __ ] okay