WEBVTT

00:00:00.000 --> 00:00:07.500
when I say we store and handle a lot of data for a YouTube channel I mean it I

00:00:05.279 --> 00:00:12.360
mean we've built some sick 100 plus terabyte servers for some of our fellow

00:00:09.300 --> 00:00:15.599
YouTubers but those are nothing compared

00:00:12.360 --> 00:00:17.340
to the two plus petabytes of archival

00:00:15.599 --> 00:00:21.720
storage that we currently have in production in our server room that is

00:00:19.199 --> 00:00:28.859
storing all the footage for every video we have ever made at full quality

00:00:25.439 --> 00:00:31.800
for the uninitiated that is over 11

00:00:28.859 --> 00:00:36.000
000 Warzone installs worth of data but with great power comes great

00:00:33.540 --> 00:00:41.700
responsibility and we weren't responsible

00:00:38.280 --> 00:00:44.100
despite our super dope Hardware we made

00:00:41.700 --> 00:00:48.719
a little oopsie that resulted in us permanently losing data that we don't

00:00:46.140 --> 00:00:53.280
have any backup for we still don't know how much but what we do know is what

00:00:51.180 --> 00:00:58.559
went wrong and we've got a plan to recover what we can but it is going to

00:00:55.680 --> 00:01:03.420
take some work and some money thanks to our sponsor hetzner hetzner offers high

00:01:01.260 --> 00:01:07.260
performance Cloud servers for an amazing price with their new Us location in

00:01:05.580 --> 00:01:10.500
Ashburn Virginia you can deploy Cloud servers in four different locations and

00:01:09.299 --> 00:01:15.420
benefit from features like load balancers block storage and more use

00:01:12.659 --> 00:01:18.860
code ltt22 at the link below for twenty dollars off

00:01:16.680 --> 00:01:18.860
foreign

00:01:25.040 --> 00:01:31.680
let's start with a bit of background on our servers our archival storage is

00:01:29.220 --> 00:01:37.619
composed of two discrete cluster ffs clusters both of them spread across two

00:01:34.500 --> 00:01:40.619
45 drives storinator servers each with

00:01:37.619 --> 00:01:43.259
60 hard drives the original petabyte

00:01:40.619 --> 00:01:49.200
project is made up of the Delta 1 and Delta 2 servers and goes by the moniker

00:01:45.659 --> 00:01:52.320
old Vault petabyte project 2 or the new

00:01:49.200 --> 00:01:53.880
vault is Delta 3 and Delta IV now

00:01:52.320 --> 00:01:58.560
because of the nature of our content most of our employees are pretty Tech

00:01:56.460 --> 00:02:03.360
literate with many of them even falling into the tech wizard category so we've

00:02:00.899 --> 00:02:07.560
always had substantially lower need for tech support than the average company

00:02:05.040 --> 00:02:12.959
and as a result we have never hired a full-time it person despite the handful

00:02:10.259 --> 00:02:17.940
of times perhaps including this one that it probably would have been helpful

00:02:14.879 --> 00:02:20.099
so in the early days I managed the

00:02:17.940 --> 00:02:24.020
infrastructure and since then I've had some help from both outside sources

00:02:24.360 --> 00:02:28.220
and other members of the writing team

00:02:29.340 --> 00:02:34.920
we all have different strengths but what

00:02:32.340 --> 00:02:40.140
we all have in common is that we have other jobs to do meaning that it's never

00:02:37.140 --> 00:02:41.400
really been clear who exactly is

00:02:40.140 --> 00:02:46.680
supposed to be accountable when something slips through the cracks

00:02:43.680 --> 00:02:49.260
and unfortunately while obvious issues

00:02:46.680 --> 00:02:53.580
like a replacement power cable and a handful of failed drives over the years

00:02:50.879 --> 00:02:57.660
were handled by Anthony we never really tasked anyone with performing

00:02:55.200 --> 00:03:01.560
preventative maintenance on our precious petabyte servers a quick point of

00:03:00.000 --> 00:03:05.700
clarification before we get into the rest of this nothing that happened is

00:03:03.599 --> 00:03:10.980
the result of anything other than us messing up the hardware both from 45

00:03:08.519 --> 00:03:15.480
drives and from Seagate Who provided the bulk of what makes up our petabyte

00:03:12.900 --> 00:03:19.500
project servers has performed beyond our expectations and we would recommend

00:03:17.340 --> 00:03:23.159
checking out both of them if you or your business has serious data storage needs

00:03:21.420 --> 00:03:28.260
we're going to have links to them down below but even the Best Hardware in the

00:03:25.920 --> 00:03:32.879
world can be let down by misconfigured software and Jake who tasked himself

00:03:31.080 --> 00:03:37.200
with auditing our current infrastructure found just such a thing

00:03:35.459 --> 00:03:41.099
everything was actually going pretty well he was setting up monitoring and

00:03:39.180 --> 00:03:44.640
alerts verifying that every machine would gracefully shut down when the

00:03:42.959 --> 00:03:48.360
power goes out which happens a lot here for some reason but he eventually worked

00:03:46.860 --> 00:03:53.879
his way around to the petabyte project servers and checked the status of the

00:03:50.280 --> 00:03:56.819
ZFS pools or Z pools on each of them and

00:03:53.879 --> 00:04:02.580
this is where the Kaka hit the fan right off the bat Delta one had two of its 60

00:03:59.819 --> 00:04:07.500
drives faulted in the same v-depth and you can think of a v-dev kind of like

00:04:04.440 --> 00:04:10.799
its own mini raid array within a larger

00:04:07.500 --> 00:04:12.480
pool of multiple raid arrays so in our

00:04:10.799 --> 00:04:18.720
configuration where we're running raid Z2 if another disc out of our 15 Drive

00:04:16.079 --> 00:04:23.940
v-dev was to have any kind of problem we would incur irrecoverable data loss upon

00:04:21.959 --> 00:04:28.199
further inspection both of the drives were completely dead which does happen

00:04:25.860 --> 00:04:33.360
with mechanical devices and had dropped from the system so we replaced them and

00:04:30.960 --> 00:04:37.259
let the array start rebuilding that's pretty scary but not in another itself a

00:04:36.180 --> 00:04:41.940
lost cause more on that later though far scarier

00:04:39.600 --> 00:04:47.699
was when Delta 3 which is part of the new Vault cluster had five drives in a

00:04:45.240 --> 00:04:52.440
faulted state with two of the V devs having two drives down that's very

00:04:51.180 --> 00:04:57.120
dangerous interestingly these drives weren't

00:04:54.720 --> 00:05:02.580
actually dead instead they had just faulted due to having too many errors so

00:05:00.840 --> 00:05:06.780
read and write errors like this are usually caused by a faulty cable or

00:05:04.560 --> 00:05:10.800
connection but they can also be the sign of a Dying Drive in our case these

00:05:09.120 --> 00:05:15.180
errors probably cropped up due to a sudden power loss or due to naturally

00:05:12.960 --> 00:05:18.780
occurring bit Rod as they were never configured to shut down nicely while on

00:05:17.160 --> 00:05:22.020
backup power in the case of an outage and we've had quite a few of those over

00:05:20.280 --> 00:05:26.280
the years now storage systems are usually designed to

00:05:24.419 --> 00:05:30.720
be able to recover from such an event especially ZFS which is known for being

00:05:28.500 --> 00:05:35.400
one of the most resilient ones out there after booting back up from a power loss

00:05:32.639 --> 00:05:39.720
ZFS pools and most other raid or raid-like storage arrays should do

00:05:37.259 --> 00:05:43.800
something called a scrub or a resync which in the case of ZFS means that

00:05:42.060 --> 00:05:47.460
every block of data gets checked to ensure that there are no errors and if

00:05:45.539 --> 00:05:51.240
there are any errors these errors are automatically fixed with the parity data

00:05:49.500 --> 00:05:57.360
that is stored in the array on most Nas operating systems like true

00:05:53.940 --> 00:05:59.400
Nas unread or any pre-built Nas this

00:05:57.360 --> 00:06:03.300
process should just happen automatically and even if nothing goes wrong they

00:06:01.380 --> 00:06:10.979
should also run a scheduled scrub every month or so but our servers were set up

00:06:05.639 --> 00:06:14.340
by us a long time ago on Centos and

00:06:10.979 --> 00:06:16.680
never updated so neither a scheduled nor

00:06:14.340 --> 00:06:20.460
a power on recovery scrub was ever configured meaning the only time data

00:06:19.199 --> 00:06:26.100
Integrity would have been checked on these arrays is when a block of data got

00:06:23.699 --> 00:06:32.100
read this function should theoretically protect against bitrod but since we have

00:06:28.680 --> 00:06:34.740
thousands of old videos of which a very

00:06:32.100 --> 00:06:40.319
very small portion ever actually gets accessed the rest were essentially left

00:06:37.259 --> 00:06:43.440
to slowly rot and power loss themselves

00:06:40.319 --> 00:06:44.940
into an unrecoverable mess when we found

00:06:43.440 --> 00:06:49.380
the drive issues we weren't even aware of all this yet and even though the five

00:06:47.100 --> 00:06:53.580
drives weren't technically Dead We aired on the side of caution and started to

00:06:50.880 --> 00:06:57.360
replace operation on all of them it was while we were rebuilding the array on

00:06:55.259 --> 00:07:00.960
Delta 3 with the new disks that we started to uncover the absolute mess of

00:06:59.819 --> 00:07:06.479
data errors ZFS has reported around

00:07:03.080 --> 00:07:09.600
169 million errors at the time of

00:07:06.479 --> 00:07:11.400
recording this and no it's not nice

00:07:09.600 --> 00:07:16.500
in fact there are so many errors on Delta 3 that with two faulted drives in

00:07:14.160 --> 00:07:21.660
both of the first v-devs there is not enough parity data to fix the errors and

00:07:19.380 --> 00:07:26.520
this caused the array to offline itself to protect against further degradation

00:07:24.180 --> 00:07:30.300
and unfortunately much further along in the process the same thing happened on

00:07:28.680 --> 00:07:36.300
Delta 1. that means that both the original and

00:07:33.000 --> 00:07:41.160
new petabyte projects old and new vault

00:07:36.300 --> 00:07:43.080
have suffered non-recoverable data loss

00:07:41.160 --> 00:07:47.340
so now what do we do in regards to the corrupted and lost

00:07:45.120 --> 00:07:51.900
data honestly nothing I mean it's very likely that even with

00:07:49.199 --> 00:07:57.660
169 million data errors we still have virtually all of the original bits in

00:07:54.840 --> 00:08:02.940
the right places but as far as we know there's no way to just tell ZFS yo dog

00:08:00.780 --> 00:08:07.979
ignore those errors you know pretend like they never happened to easy ZFS or

00:08:05.520 --> 00:08:13.380
something instead then the plan is to build a new properly configured 1.2

00:08:11.280 --> 00:08:17.400
petabyte server featuring seagate's shiny new 20 terabyte drives which we're

00:08:15.720 --> 00:08:22.080
really excited about like these things are almost as shiny as our reflective

00:08:19.020 --> 00:08:24.000
hard drive shirt LT store.com

00:08:22.080 --> 00:08:29.699
and once that's complete we intend to move all of the data from the new Vault

00:08:26.160 --> 00:08:31.259
cluster onto this new new vault

00:08:29.699 --> 00:08:36.419
new new vault then we'll reset up new Vault ensure all

00:08:34.800 --> 00:08:42.300
the drives are good and repeat the process to move old Vault data onto it

00:08:39.300 --> 00:08:45.000
then we can reformat old Vault probably

00:08:42.300 --> 00:08:49.140
upgrade it a bit and use it for new data maybe we'll rename it to new new Vault

00:08:47.040 --> 00:08:52.560
get subscribed so you don't miss any of that we'll hopefully be building that

00:08:50.880 --> 00:08:56.760
new server this week now if everything were set up properly

00:08:54.480 --> 00:09:01.740
with regularly scheduled and post power loss scrubs this entire problem would

00:08:59.519 --> 00:09:06.060
probably have never happened and if we had a backup of that data we would be

00:09:04.080 --> 00:09:10.260
able to Simply restore from that but here's the thing

00:09:07.560 --> 00:09:14.760
backing up over a petabyte of data is really expensive either we would need to

00:09:12.600 --> 00:09:19.080
build a duplicate server array to backup to or we could back up to the cloud but

00:09:17.519 --> 00:09:25.260
even using the economical option backblaze B2 it would cost us somewhere

00:09:21.420 --> 00:09:28.080
between 5 and 10 000 US dollars per

00:09:25.260 --> 00:09:31.860
month to store that kind of data now if it was mission critical then by all

00:09:30.060 --> 00:09:36.540
means it should have been backed up in both of those ways but having all of our

00:09:34.560 --> 00:09:41.700
archival footage from day one of the channel has always been a nice to have

00:09:39.300 --> 00:09:45.060
and an excuse for us to explore really cool Tech that we otherwise wouldn't

00:09:43.080 --> 00:09:49.320
have any reason to play with I mean it takes a little bit more effort and it

00:09:46.500 --> 00:09:52.920
yields lower quality results but we have a backup of all of our old videos it's

00:09:51.360 --> 00:09:58.200
called downloading them off of YouTube or Floatplane if we wanted a higher

00:09:55.260 --> 00:10:02.940
quality copy so the good news is that our production monix server is running

00:10:00.300 --> 00:10:05.880
great with proper backups configured and this isn't going to have any kind of

00:10:04.140 --> 00:10:09.660
lasting effect on our business but I am still hopeful that if all goes

00:10:08.040 --> 00:10:13.500
well with the recovery efforts we'll be able to get back the majority of the

00:10:11.399 --> 00:10:18.480
data mostly error free but only time will tell a lot of time

00:10:16.080 --> 00:10:22.560
because transferring all those petabytes of data off of hard drives to other hard

00:10:20.519 --> 00:10:27.800
drives is going to take weeks or even months so let this be a lesson follow

00:10:25.200 --> 00:10:31.920
Proper Storage practices have a backup and probably hire someone to take care

00:10:30.600 --> 00:10:37.260
of your data if you don't have the time especially if you measure it in anything

00:10:33.600 --> 00:10:39.180
other than tens of terabytes or you

00:10:37.260 --> 00:10:43.200
might lose all of it but you won't lose our sponsor Lambda are you training deep

00:10:41.940 --> 00:10:48.240
learning models for the next big breakthrough in artificial intelligence then you should know about Lambda the

00:10:46.440 --> 00:10:51.959
Deep learning company founded by Machine learning Engineers Lambda builds GPU

00:10:50.160 --> 00:10:55.320
workstations servers and Cloud infrastructure for creating deep

00:10:53.459 --> 00:10:59.579
learning models they've helped all five of the big tech companies and 47 of the

00:10:57.540 --> 00:11:02.820
top 50 research universities accelerate their machine learning workflows

00:11:00.779 --> 00:11:07.140
lambda's easy to use configurators let you spec out exactly the hardware you

00:11:04.740 --> 00:11:10.980
need from GPU laptops and workstations all the way up to custom server clusters

00:11:09.120 --> 00:11:14.399
and all Lambda machines come pre-installed with Lambda stack keeping

00:11:13.079 --> 00:11:18.660
your Linux machine learning environment up to date and out of dependency hell

00:11:16.560 --> 00:11:22.980
and with Lambda Cloud you can spin up a virtual machine in minutes train models

00:11:20.519 --> 00:11:26.399
with four NVIDIA a6000s at just a fraction of the cost cost of the big

00:11:24.360 --> 00:11:30.959
cloud providers so go to lambdalabs.com Linus to configure your own workstation

00:11:28.560 --> 00:11:35.459
or try out Lambda Cloud today if you like this video maybe check out the time

00:11:32.459 --> 00:11:39.839
I almost lost all of our active projects

00:11:35.459 --> 00:11:42.000
when the OG 1X server failed that was a

00:11:39.839 --> 00:11:45.240
far more stressful situation I'm actually like

00:11:43.440 --> 00:11:48.120
I'm actually pretty relaxed right now for someone with less much data on the

00:11:47.160 --> 00:11:51.839
line yeah I'm I'm doing okay thanks for

00:11:50.579 --> 00:11:56.240
asking I mean I'd prefer to get it back you

00:11:54.180 --> 00:11:56.240
know
