WEBVTT

00:00:00.080 --> 00:00:05.120
you ever spend a lot of money on something only to have it arrive broken

00:00:03.280 --> 00:00:08.720
or otherwise defective it's a pit

00:00:06.560 --> 00:00:13.840
even worse is opening up the box for your replacement only to find out that

00:00:10.880 --> 00:00:17.760
they didn't actually fix the problem and i think that's what just happened to me

00:00:16.240 --> 00:00:23.359
with over ten thousand dollars worth of Intel's four

00:00:19.920 --> 00:00:25.119
terabyte p 4500 ssds you guys remember

00:00:23.359 --> 00:00:29.519
these we tried to roll them in a new server deployment over a year ago only

00:00:27.439 --> 00:00:34.640
to conclude that they were fundamentally flawed and could never work in it Intel

00:00:32.559 --> 00:00:38.879
graciously agreed to swap them all out for us

00:00:36.480 --> 00:00:43.440
but here's the thing Intel i thought it was understood that i

00:00:40.480 --> 00:00:48.879
didn't want the exact same broken thing back and yet

00:00:46.559 --> 00:00:53.280
here we are telling you about our sponsor smart

00:00:50.559 --> 00:00:57.840
deploy smart deploy enables it admins to manage pcs from the cloud you can push

00:00:55.280 --> 00:01:02.079
Windows apps and security patches to any device anywhere without leaving your

00:00:59.600 --> 00:01:05.519
desk get your exclusive software for free worth over eight hundred dollars at

00:01:03.680 --> 00:01:08.519
smartdeploy.com Linus

00:01:15.600 --> 00:01:21.119
getting Intel to agree to replace these took a very long time and wasn't

00:01:19.280 --> 00:01:25.200
straightforward as far as i can tell their initial plan was to do nothing

00:01:23.280 --> 00:01:29.520
about my problem until i pulled strings with my media contacts

00:01:27.119 --> 00:01:34.320
in fairness to Intel these are second hand drives i bought on ebay but also in

00:01:32.159 --> 00:01:38.479
fairness to me they are still within their warranty period so my request

00:01:36.560 --> 00:01:43.520
isn't totally unreasonable and besides while Intel has never

00:01:41.200 --> 00:01:48.240
publicly acknowledged any design flaw with the p 4500 i've heard of other

00:01:45.920 --> 00:01:51.520
customers even getting cash refunds as compensation for their trouble and if

00:01:50.240 --> 00:01:55.439
you've ever dealt with a computer hardware returns department you'll know

00:01:53.439 --> 00:01:59.360
that cash refunds ain't the kind of thing that happens when it's the

00:01:56.719 --> 00:02:01.600
customer that's screwed it up so

00:02:00.240 --> 00:02:06.079
let's take a closer look at what they sent i don't actually know what to say here

00:02:04.240 --> 00:02:10.640
as far as i can tell everything about these is identical to the drives that i

00:02:08.800 --> 00:02:13.680
sent back to Intel model number capacity

00:02:12.560 --> 00:02:18.720
yeah it's they're the same thing look at this

00:02:16.640 --> 00:02:22.560
these actually have marks on them that would seem to indicate that they've been

00:02:20.400 --> 00:02:26.160
installed in drive sleds before i'm not actually mad about that though

00:02:24.239 --> 00:02:30.239
it's pretty typical to get a refurbished product back when you send in an rma and

00:02:27.920 --> 00:02:35.200
it's not like they still make this drive or anything just

00:02:32.239 --> 00:02:39.120
not exactly confidence inspiring and come to think of it do i have any way of

00:02:36.879 --> 00:02:43.120
knowing that they actually sent back different drives and didn't just

00:02:41.440 --> 00:02:46.800
box back up the same drives and send them back to me you dealt with the rma for this right

00:02:45.760 --> 00:02:52.239
did they ask any questions about the issues we were having they just processed the rma

00:02:50.319 --> 00:02:56.400
and sent us back those do we even know these are new drives

00:02:54.480 --> 00:02:59.280
i have no clue these came from latin america oh

00:02:57.680 --> 00:03:04.879
right of course we sent them a whole list of serial numbers before we ship

00:03:01.120 --> 00:03:07.040
the drive so oh yeah 720 versus 807.

00:03:04.879 --> 00:03:09.360
these would have been manufactured way later

00:03:07.920 --> 00:03:13.680
assuming that until serial numbers are sequential

00:03:11.120 --> 00:03:16.720
okay so they are different drives at least but

00:03:14.959 --> 00:03:20.720
after double checking with wendell from level one tax who helped us diagnose the

00:03:18.640 --> 00:03:24.640
problem in the first place he seems to be under the impression that

00:03:22.480 --> 00:03:29.360
nothing about these new units should change the problems that we had before

00:03:26.959 --> 00:03:34.400
and they were big problems individually the drives behaved fine but

00:03:32.480 --> 00:03:39.200
as soon as you built them into a larger array of some sort and then hit it with

00:03:36.640 --> 00:03:42.560
any kind of sustained load even just copying a file would cause individual

00:03:41.360 --> 00:03:48.560
drives to randomly fritz out or drop out for a few

00:03:45.680 --> 00:03:52.879
seconds at a time and then reappear absolutely tanking the performance of

00:03:50.560 --> 00:03:59.280
the whole array and we tried everything Windows Linux zfs dropping down to pci

00:03:56.400 --> 00:04:03.840
express gen 2 speeds in the BIOS wendell eventually narrowed it down to

00:04:00.959 --> 00:04:07.280
the way this drive issues interrupts to the CPU

00:04:05.360 --> 00:04:11.599
if you don't know an interrupt is basically a way for a device to tell

00:04:09.280 --> 00:04:15.920
your computer's processor hey i've got something for you come pay attention to

00:04:13.439 --> 00:04:19.120
me now since this is a storage device chances are that the thing it's got is

00:04:17.840 --> 00:04:25.440
some data but small problem if the CPU were to

00:04:22.479 --> 00:04:30.240
respond to that interrupt too quickly which is what we saw in Linux there's a

00:04:27.840 --> 00:04:33.440
chance that the data won't be in the drive's buffer yet

00:04:32.000 --> 00:04:39.199
missing data oh so the Linux kernel would keep retrying

00:04:37.120 --> 00:04:45.360
the drive to see if the data turns up and lo and behold it would totally

00:04:42.560 --> 00:04:49.520
uncorrupted which is good but with the aforementioned performance

00:04:47.520 --> 00:04:53.840
problems which are bad now wendell's hacky workaround was to

00:04:51.759 --> 00:04:58.960
tell the operating system to pull the drives essentially having the CPU

00:04:56.240 --> 00:05:02.880
checking constantly got anything for me got anything for me yet how about now do

00:05:01.120 --> 00:05:07.440
you got anything for me he's some kind of genius and it did work

00:05:05.520 --> 00:05:12.240
but that approach comes with a bit of a performance penalty as well think of it

00:05:09.680 --> 00:05:16.720
kind of like the postal service it's way more efficient for you to schedule pick

00:05:15.199 --> 00:05:21.520
up with fedex whenever you want to send a package instead of just

00:05:18.880 --> 00:05:26.320
having a courier show up every hour of every day on the off chance that you

00:05:22.880 --> 00:05:28.560
have something to ship so poor cpus

00:05:26.320 --> 00:05:31.199
checking in with 24 of these drives constantly

00:05:29.840 --> 00:05:36.000
not great and obviously most drives don't have

00:05:33.919 --> 00:05:43.280
this problem it's just that these ones do and it wasn't just one of

00:05:38.560 --> 00:05:44.560
them it was the entire badge of all 26

00:05:43.280 --> 00:05:50.479
drives so why would they send us back the same

00:05:47.360 --> 00:05:52.639
bloody thing that wasn't working and if

00:05:50.479 --> 00:05:55.680
that's not what happened then what on earth did they change about

00:05:54.320 --> 00:05:59.600
these the plot thickens

00:05:57.680 --> 00:06:03.360
this AMD epic roam server has been sitting completely untouched since the

00:06:01.759 --> 00:06:08.319
last time that we tried to roll it out and gave up ultimately replacing it with

00:06:06.400 --> 00:06:11.360
a dell loaded up with liquid honey badgers but

00:06:09.520 --> 00:06:14.080
just because we found another solution doesn't mean we couldn't use more

00:06:12.720 --> 00:06:20.560
capacity and we wouldn't like to freaking use it it's 96

00:06:17.280 --> 00:06:23.919
raw terabytes of NVMe storage each with

00:06:20.560 --> 00:06:25.919
their own dedicated pci express links

00:06:23.919 --> 00:06:29.759
and the epic is a beast i want to freaking use the thing

00:06:28.880 --> 00:06:35.600
so let's fire it up shall we let's see if we can figure out what they

00:06:33.360 --> 00:06:38.880
changed obviously something they could have changed would be the firmware on

00:06:37.120 --> 00:06:42.960
the drives but conveniently because i haven't touched this thing i've got the

00:06:40.720 --> 00:06:46.840
exact same version of solid state drive toolbox from Intel that i had before and

00:06:45.440 --> 00:06:51.919
i can check if it allows any kind of firmware

00:06:49.919 --> 00:06:55.919
update here we go nope

00:06:53.360 --> 00:06:58.800
exactly the same firmware even though they're running the same

00:06:56.960 --> 00:07:03.280
firmware as the ones i shipped out maybe there's been an update we can check that

00:07:01.840 --> 00:07:05.919
um no

00:07:05.120 --> 00:07:12.080
okay no no firmware change then

00:07:09.840 --> 00:07:16.880
i have mostly given up that this is going to do anything but

00:07:14.400 --> 00:07:21.759
let's at least give it a chance we can use this custom view and

00:07:19.440 --> 00:07:27.680
administrative events to see the drive dropping out or not you'll get

00:07:24.960 --> 00:07:33.520
a little uh yellow warning a fatal hardware error has occurred in memory

00:07:30.080 --> 00:07:33.520
we'll have to deal with that later

00:07:33.840 --> 00:07:38.880
performance is as expected for a gen 3 drive

00:07:37.039 --> 00:07:42.840
that's somewhat hopeful that's promising nine runs

00:07:41.599 --> 00:07:48.080
zero dropouts of course as i mentioned before

00:07:46.240 --> 00:07:50.960
we saw the biggest problems with multiple drives running at the same time

00:07:50.080 --> 00:07:56.479
i mean you could kind of imagine that's how something like this makes it past qc on

00:07:55.280 --> 00:08:01.840
AMD Gigabyte other server manufacturers as

00:07:58.879 --> 00:08:04.720
well and Intel's side everything works fine in isolation

00:08:03.520 --> 00:08:09.120
but you open up all these load them up with

00:08:06.960 --> 00:08:12.319
drives and slam the pci express controller with enough high speed

00:08:10.639 --> 00:08:16.560
storage that you're getting dangerously close to RAM like speeds and

00:08:15.440 --> 00:08:22.960
well the wings start to burn up a little don't they

00:08:19.599 --> 00:08:22.960
let's try four drives next

00:08:24.840 --> 00:08:30.319
okay uh i think i'm probably gonna have to

00:08:28.720 --> 00:08:36.560
reboot theoretically they're hot swappable but like

00:08:32.959 --> 00:08:38.640
no dropouts so far i'm feeling good

00:08:36.560 --> 00:08:43.919
hey you guys want a free tech tip you notice that in a mirror your read speeds

00:08:41.360 --> 00:08:47.920
scale individually with drives but you only get the write speed of half of your

00:08:45.680 --> 00:08:52.640
number of drives that's because you have to write two mirrored copies across the

00:08:50.399 --> 00:08:57.200
entire array so you're effectively splitting your writes in half but with

00:08:54.800 --> 00:09:02.720
reads you can actually take advantage of all the drives in parallel neat huh

00:09:01.279 --> 00:09:08.000
okay big test time that's promising we've got

00:09:05.839 --> 00:09:12.399
eight drives running in parity mode in storage spaces which i prefer because it

00:09:09.760 --> 00:09:15.519
gives way more capacity than mirror now you'll see a couple of these are not

00:09:13.760 --> 00:09:20.000
blinking but that's more likely just due to the fact that in

00:09:18.080 --> 00:09:24.480
parity mode it writes in a very different way compared to mirror these

00:09:22.640 --> 00:09:27.839
write speeds suck for how many drives are in there but from our experience it

00:09:26.160 --> 00:09:31.920
seems to be a weird interaction between crystal disk mark and storage spaces

00:09:29.839 --> 00:09:38.480
because we've seen better real world performance than what it measures but

00:09:34.240 --> 00:09:42.880
this defies all logic i don't understand

00:09:38.480 --> 00:09:42.880
how there's how it's working i mean

00:09:43.040 --> 00:09:48.560
to be clear i'm happy also

00:09:46.720 --> 00:09:53.760
we haven't actually loaded this thing up yet should we try 16

00:09:51.279 --> 00:09:53.760
oh yeah

00:09:54.320 --> 00:10:00.880
no wait wait disc 10 has been surprise removed

00:09:58.399 --> 00:10:00.880
wait what

00:10:02.640 --> 00:10:08.320
hold on a second i got one of those before i haven't removed any

00:10:06.800 --> 00:10:12.399
i just thought it was something to do with while i was putting drives in but i

00:10:10.320 --> 00:10:16.560
didn't take any drives out did i there it is

00:10:13.360 --> 00:10:18.560
store NVMe reset to device blah blah

00:10:16.560 --> 00:10:25.200
raid port 15 was issued this is exactly the problem

00:10:22.320 --> 00:10:30.200
we had before need a drink

00:10:27.200 --> 00:10:30.200
lttstore.com

00:10:30.240 --> 00:10:35.600
prove its water in here or prove it's not water uh

00:10:36.800 --> 00:10:43.680
go buy a water bottle they're insulated they're really nice let's go back for a second to the

00:10:41.600 --> 00:10:47.040
questions i set out to answer today number one

00:10:44.959 --> 00:10:51.120
what did they change okay well we've got that one

00:10:49.519 --> 00:10:55.200
nothing question number two

00:10:53.440 --> 00:10:59.519
what was up with this replacement process when they've indicated at least

00:10:57.600 --> 00:11:04.160
to other customers that they're aware of this problem

00:11:01.040 --> 00:11:06.320
and they know evidently that they can't

00:11:04.160 --> 00:11:10.240
they have no way of fixing it because here's the thing

00:11:07.839 --> 00:11:14.399
any other Intel drive at least according to wendell would have worked fine the

00:11:12.480 --> 00:11:19.120
slightly newer 4501 doesn't have this problem the 4511 which

00:11:17.040 --> 00:11:24.079
is fundamentally the same drive but in their ruler form factor

00:11:21.519 --> 00:11:28.399
same thing why send me

00:11:25.680 --> 00:11:33.519
stacks of broken drives when you know how much like it's not free you know the

00:11:32.079 --> 00:11:37.680
cost would be the same to send me something that works

00:11:36.320 --> 00:11:42.160
unless they do work

00:11:40.240 --> 00:11:45.839
one of the most basic troubleshooting steps when you suspect a piece of

00:11:43.680 --> 00:11:49.120
hardware is defective is to take it and put it in another system to isolate your

00:11:47.920 --> 00:11:52.560
variables unfortunately testing just one of them

00:11:51.040 --> 00:11:58.639
at a time wouldn't have told me anything because they worked fine and

00:11:55.360 --> 00:12:01.839
NVMe servers that can take 16 or 24

00:11:58.639 --> 00:12:03.519
drives don't exactly grow on trees so it

00:12:01.839 --> 00:12:11.200
wasn't an option for me fortunately now that old wanik has been replaced

00:12:08.399 --> 00:12:16.480
i can actually take these new drives chuck them in here and see if we see the

00:12:13.760 --> 00:12:16.480
same behavior

00:12:16.800 --> 00:12:21.360
locked and loaded all 24 drives

00:12:21.680 --> 00:12:26.560
let's see if it works

00:12:24.800 --> 00:12:31.440
performance is not as good um

00:12:29.120 --> 00:12:38.240
even at its best this server has a fraction of the pci express connectivity

00:12:34.880 --> 00:12:40.720
and its gen 3 compared to AMD epic but

00:12:38.240 --> 00:12:45.279
if the drives don't drop out and we get that consistency that's

00:12:43.279 --> 00:12:49.600
way more important because we're only accessing this for video editing over a

00:12:47.120 --> 00:12:54.399
network you don't actually need you know 20 or 100 gigabytes a second

00:12:52.560 --> 00:12:58.079
that's just like genital measuring at that point

00:12:57.279 --> 00:13:04.399
wait there's a disc oh disc 26 has been surprise removed

00:13:01.519 --> 00:13:07.920
okay yep yep that's fine that actually did happen so that means the whole thing

00:13:06.720 --> 00:13:11.279
ran no errors

00:13:09.839 --> 00:13:15.040
worked perfectly slowly but

00:13:13.600 --> 00:13:22.720
perfectly so then now Intel's behavior looks like

00:13:19.040 --> 00:13:23.760
actually more generous than stupid so a

00:13:22.720 --> 00:13:30.000
tech probably without understanding the whole backstory saw these perfectly functional

00:13:27.760 --> 00:13:35.200
drives in the rma pool and went i don't know the customer's always right and

00:13:32.320 --> 00:13:40.079
replace them not realizing that the problem was because of this edge case

00:13:37.839 --> 00:13:44.880
error that only shows up on some platforms under

00:13:42.320 --> 00:13:47.440
some workloads so

00:13:45.600 --> 00:13:55.040
thanks Intel it's a specific incompatibility between

00:13:50.880 --> 00:13:56.800
this drive and second gen AMD epic cpus

00:13:55.040 --> 00:14:00.000
dell has already fixed the timing behavior through a firmware update on at

00:13:58.320 --> 00:14:04.160
least some of their boards maybe one or multiples of their

00:14:02.320 --> 00:14:07.760
customers put in a big order for new servers and then had a bunch of these

00:14:06.160 --> 00:14:12.399
they didn't want to replace and curiously AMD's daytona test platform

00:14:10.560 --> 00:14:17.120
from quanta doesn't have the problem either so

00:14:14.480 --> 00:14:22.880
maybe just Gigabyte couldn't be arsed to fix this but then in fairness to them

00:14:20.160 --> 00:14:26.800
it's just this one outdated drive from Intel and therefore it probably doesn't

00:14:25.120 --> 00:14:30.399
affect any of their actual customers like remember guys

00:14:28.160 --> 00:14:34.720
this right here is an engineering sample unit meaning i didn't pay for it and

00:14:32.639 --> 00:14:37.760
it's still technically the property of Gigabyte

00:14:36.399 --> 00:14:43.040
so it's not really anyone's fault

00:14:40.720 --> 00:14:48.240
but that doesn't change that i spent over ten thousand dollars on drives and

00:14:45.839 --> 00:14:53.600
my server doesn't work and i want it to work so what i've decided to do is just

00:14:51.199 --> 00:14:58.320
eat the performance penalty and keep these new drives in my old super micro

00:14:56.399 --> 00:15:02.560
box even though the limited number of pci express lanes means they are

00:14:59.920 --> 00:15:05.920
bottlenecked to hell and back i mean while we waited on a solution to

00:15:04.240 --> 00:15:10.560
this liquid stepped up and set us up with their honey badger den which is

00:15:07.920 --> 00:15:15.839
faster anyway and has been rock solid so thanks liquid thanks

00:15:13.040 --> 00:15:21.519
Intel i think and uh now this will be the like

00:15:17.920 --> 00:15:23.199
capacity SSD server there you go

00:15:21.519 --> 00:15:27.440
and this will be my segue to our sponsored drop check out their enter

00:15:25.600 --> 00:15:31.920
keyboard it's made with enthusiasts in mind making it easy to swap out keycaps

00:15:29.759 --> 00:15:36.639
and even key switches it's got an aluminum top plate and plastic bottom

00:15:33.839 --> 00:15:40.800
plate that feels great and white leds for visibility in dark conditions the

00:15:38.639 --> 00:15:45.759
pbd keycaps are double shot so they shine through and it weighs

00:15:42.839 --> 00:15:49.199
964 grams so you could basically kill a man with it not that i would recommend

00:15:47.360 --> 00:15:52.480
that it's available in three colors with your choice of mechanical switches and

00:15:50.880 --> 00:15:55.839
you can buy yours today at the link in the video description if you guys are

00:15:54.399 --> 00:15:59.680
looking for another server video to watch you can check out the original

00:15:57.440 --> 00:16:04.000
deployment nightmare with sorry not this with this one or you can check out we

00:16:02.320 --> 00:16:07.759
actually did a few videos on the honey badger den that thing is so freaking

00:16:05.600 --> 00:16:13.240
fast like a hundred gigabytes a second raw it's dangerously close to memory speeds