1
00:00:00,000 --> 00:00:07,500
when I say we store and handle a lot of data for a YouTube channel I mean it I

2
00:00:05,279 --> 00:00:12,360
mean we've built some sick 100 plus terabyte servers for some of our fellow

3
00:00:09,300 --> 00:00:15,599
YouTubers but those are nothing compared

4
00:00:12,360 --> 00:00:17,340
to the two plus petabytes of archival

5
00:00:15,599 --> 00:00:21,720
storage that we currently have in production in our server room that is

6
00:00:19,199 --> 00:00:28,859
storing all the footage for every video we have ever made at full quality

7
00:00:25,439 --> 00:00:31,800
for the uninitiated that is over 11

8
00:00:28,859 --> 00:00:36,000
000 Warzone installs worth of data but with great power comes great

9
00:00:33,540 --> 00:00:41,700
responsibility and we weren't responsible

10
00:00:38,280 --> 00:00:44,100
despite our super dope Hardware we made

11
00:00:41,700 --> 00:00:48,719
a little oopsie that resulted in us permanently losing data that we don't

12
00:00:46,140 --> 00:00:53,280
have any backup for we still don't know how much but what we do know is what

13
00:00:51,180 --> 00:00:58,559
went wrong and we've got a plan to recover what we can but it is going to

14
00:00:55,680 --> 00:01:03,420
take some work and some money thanks to our sponsor hetzner hetzner offers high

15
00:01:01,260 --> 00:01:07,260
performance Cloud servers for an amazing price with their new Us location in

16
00:01:05,580 --> 00:01:10,500
Ashburn Virginia you can deploy Cloud servers in four different locations and

17
00:01:09,299 --> 00:01:15,420
benefit from features like load balancers block storage and more use

18
00:01:12,659 --> 00:01:18,860
code ltt22 at the link below for twenty dollars off

19
00:01:16,680 --> 00:01:18,860
foreign

20
00:01:25,040 --> 00:01:31,680
let's start with a bit of background on our servers our archival storage is

21
00:01:29,220 --> 00:01:37,619
composed of two discrete cluster ffs clusters both of them spread across two

22
00:01:34,500 --> 00:01:40,619
45 drives storinator servers each with

23
00:01:37,619 --> 00:01:43,259
60 hard drives the original petabyte

24
00:01:40,619 --> 00:01:49,200
project is made up of the Delta 1 and Delta 2 servers and goes by the moniker

25
00:01:45,659 --> 00:01:52,320
old Vault petabyte project 2 or the new

26
00:01:49,200 --> 00:01:53,880
vault is Delta 3 and Delta IV now

27
00:01:52,320 --> 00:01:58,560
because of the nature of our content most of our employees are pretty Tech

28
00:01:56,460 --> 00:02:03,360
literate with many of them even falling into the tech wizard category so we've

29
00:02:00,899 --> 00:02:07,560
always had substantially lower need for tech support than the average company

30
00:02:05,040 --> 00:02:12,959
and as a result we have never hired a full-time it person despite the handful

31
00:02:10,259 --> 00:02:17,940
of times perhaps including this one that it probably would have been helpful

32
00:02:14,879 --> 00:02:20,099
so in the early days I managed the

33
00:02:17,940 --> 00:02:24,020
infrastructure and since then I've had some help from both outside sources

34
00:02:24,360 --> 00:02:28,220
and other members of the writing team

35
00:02:29,340 --> 00:02:34,920
we all have different strengths but what

36
00:02:32,340 --> 00:02:40,140
we all have in common is that we have other jobs to do meaning that it's never

37
00:02:37,140 --> 00:02:41,400
really been clear who exactly is

38
00:02:40,140 --> 00:02:46,680
supposed to be accountable when something slips through the cracks

39
00:02:43,680 --> 00:02:49,260
and unfortunately while obvious issues

40
00:02:46,680 --> 00:02:53,580
like a replacement power cable and a handful of failed drives over the years

41
00:02:50,879 --> 00:02:57,660
were handled by Anthony we never really tasked anyone with performing

42
00:02:55,200 --> 00:03:01,560
preventative maintenance on our precious petabyte servers a quick point of

43
00:03:00,000 --> 00:03:05,700
clarification before we get into the rest of this nothing that happened is

44
00:03:03,599 --> 00:03:10,980
the result of anything other than us messing up the hardware both from 45

45
00:03:08,519 --> 00:03:15,480
drives and from Seagate Who provided the bulk of what makes up our petabyte

46
00:03:12,900 --> 00:03:19,500
project servers has performed beyond our expectations and we would recommend

47
00:03:17,340 --> 00:03:23,159
checking out both of them if you or your business has serious data storage needs

48
00:03:21,420 --> 00:03:28,260
we're going to have links to them down below but even the Best Hardware in the

49
00:03:25,920 --> 00:03:32,879
world can be let down by misconfigured software and Jake who tasked himself

50
00:03:31,080 --> 00:03:37,200
with auditing our current infrastructure found just such a thing

51
00:03:35,459 --> 00:03:41,099
everything was actually going pretty well he was setting up monitoring and

52
00:03:39,180 --> 00:03:44,640
alerts verifying that every machine would gracefully shut down when the

53
00:03:42,959 --> 00:03:48,360
power goes out which happens a lot here for some reason but he eventually worked

54
00:03:46,860 --> 00:03:53,879
his way around to the petabyte project servers and checked the status of the

55
00:03:50,280 --> 00:03:56,819
ZFS pools or Z pools on each of them and

56
00:03:53,879 --> 00:04:02,580
this is where the Kaka hit the fan right off the bat Delta one had two of its 60

57
00:03:59,819 --> 00:04:07,500
drives faulted in the same v-depth and you can think of a v-dev kind of like

58
00:04:04,440 --> 00:04:10,799
its own mini raid array within a larger

59
00:04:07,500 --> 00:04:12,480
pool of multiple raid arrays so in our

60
00:04:10,799 --> 00:04:18,720
configuration where we're running raid Z2 if another disc out of our 15 Drive

61
00:04:16,079 --> 00:04:23,940
v-dev was to have any kind of problem we would incur irrecoverable data loss upon

62
00:04:21,959 --> 00:04:28,199
further inspection both of the drives were completely dead which does happen

63
00:04:25,860 --> 00:04:33,360
with mechanical devices and had dropped from the system so we replaced them and

64
00:04:30,960 --> 00:04:37,259
let the array start rebuilding that's pretty scary but not in another itself a

65
00:04:36,180 --> 00:04:41,940
lost cause more on that later though far scarier

66
00:04:39,600 --> 00:04:47,699
was when Delta 3 which is part of the new Vault cluster had five drives in a

67
00:04:45,240 --> 00:04:52,440
faulted state with two of the V devs having two drives down that's very

68
00:04:51,180 --> 00:04:57,120
dangerous interestingly these drives weren't

69
00:04:54,720 --> 00:05:02,580
actually dead instead they had just faulted due to having too many errors so

70
00:05:00,840 --> 00:05:06,780
read and write errors like this are usually caused by a faulty cable or

71
00:05:04,560 --> 00:05:10,800
connection but they can also be the sign of a Dying Drive in our case these

72
00:05:09,120 --> 00:05:15,180
errors probably cropped up due to a sudden power loss or due to naturally

73
00:05:12,960 --> 00:05:18,780
occurring bit Rod as they were never configured to shut down nicely while on

74
00:05:17,160 --> 00:05:22,020
backup power in the case of an outage and we've had quite a few of those over

75
00:05:20,280 --> 00:05:26,280
the years now storage systems are usually designed to

76
00:05:24,419 --> 00:05:30,720
be able to recover from such an event especially ZFS which is known for being

77
00:05:28,500 --> 00:05:35,400
one of the most resilient ones out there after booting back up from a power loss

78
00:05:32,639 --> 00:05:39,720
ZFS pools and most other raid or raid-like storage arrays should do

79
00:05:37,259 --> 00:05:43,800
something called a scrub or a resync which in the case of ZFS means that

80
00:05:42,060 --> 00:05:47,460
every block of data gets checked to ensure that there are no errors and if

81
00:05:45,539 --> 00:05:51,240
there are any errors these errors are automatically fixed with the parity data

82
00:05:49,500 --> 00:05:57,360
that is stored in the array on most Nas operating systems like true

83
00:05:53,940 --> 00:05:59,400
Nas unread or any pre-built Nas this

84
00:05:57,360 --> 00:06:03,300
process should just happen automatically and even if nothing goes wrong they

85
00:06:01,380 --> 00:06:10,979
should also run a scheduled scrub every month or so but our servers were set up

86
00:06:05,639 --> 00:06:14,340
by us a long time ago on Centos and

87
00:06:10,979 --> 00:06:16,680
never updated so neither a scheduled nor

88
00:06:14,340 --> 00:06:20,460
a power on recovery scrub was ever configured meaning the only time data

89
00:06:19,199 --> 00:06:26,100
Integrity would have been checked on these arrays is when a block of data got

90
00:06:23,699 --> 00:06:32,100
read this function should theoretically protect against bitrod but since we have

91
00:06:28,680 --> 00:06:34,740
thousands of old videos of which a very

92
00:06:32,100 --> 00:06:40,319
very small portion ever actually gets accessed the rest were essentially left

93
00:06:37,259 --> 00:06:43,440
to slowly rot and power loss themselves

94
00:06:40,319 --> 00:06:44,940
into an unrecoverable mess when we found

95
00:06:43,440 --> 00:06:49,380
the drive issues we weren't even aware of all this yet and even though the five

96
00:06:47,100 --> 00:06:53,580
drives weren't technically Dead We aired on the side of caution and started to

97
00:06:50,880 --> 00:06:57,360
replace operation on all of them it was while we were rebuilding the array on

98
00:06:55,259 --> 00:07:00,960
Delta 3 with the new disks that we started to uncover the absolute mess of

99
00:06:59,819 --> 00:07:06,479
data errors ZFS has reported around

100
00:07:03,080 --> 00:07:09,600
169 million errors at the time of

101
00:07:06,479 --> 00:07:11,400
recording this and no it's not nice

102
00:07:09,600 --> 00:07:16,500
in fact there are so many errors on Delta 3 that with two faulted drives in

103
00:07:14,160 --> 00:07:21,660
both of the first v-devs there is not enough parity data to fix the errors and

104
00:07:19,380 --> 00:07:26,520
this caused the array to offline itself to protect against further degradation

105
00:07:24,180 --> 00:07:30,300
and unfortunately much further along in the process the same thing happened on

106
00:07:28,680 --> 00:07:36,300
Delta 1. that means that both the original and

107
00:07:33,000 --> 00:07:41,160
new petabyte projects old and new vault

108
00:07:36,300 --> 00:07:43,080
have suffered non-recoverable data loss

109
00:07:41,160 --> 00:07:47,340
so now what do we do in regards to the corrupted and lost

110
00:07:45,120 --> 00:07:51,900
data honestly nothing I mean it's very likely that even with

111
00:07:49,199 --> 00:07:57,660
169 million data errors we still have virtually all of the original bits in

112
00:07:54,840 --> 00:08:02,940
the right places but as far as we know there's no way to just tell ZFS yo dog

113
00:08:00,780 --> 00:08:07,979
ignore those errors you know pretend like they never happened to easy ZFS or

114
00:08:05,520 --> 00:08:13,380
something instead then the plan is to build a new properly configured 1.2

115
00:08:11,280 --> 00:08:17,400
petabyte server featuring seagate's shiny new 20 terabyte drives which we're

116
00:08:15,720 --> 00:08:22,080
really excited about like these things are almost as shiny as our reflective

117
00:08:19,020 --> 00:08:24,000
hard drive shirt LT store.com

118
00:08:22,080 --> 00:08:29,699
and once that's complete we intend to move all of the data from the new Vault

119
00:08:26,160 --> 00:08:31,259
cluster onto this new new vault

120
00:08:29,699 --> 00:08:36,419
new new vault then we'll reset up new Vault ensure all

121
00:08:34,800 --> 00:08:42,300
the drives are good and repeat the process to move old Vault data onto it

122
00:08:39,300 --> 00:08:45,000
then we can reformat old Vault probably

123
00:08:42,300 --> 00:08:49,140
upgrade it a bit and use it for new data maybe we'll rename it to new new Vault

124
00:08:47,040 --> 00:08:52,560
get subscribed so you don't miss any of that we'll hopefully be building that

125
00:08:50,880 --> 00:08:56,760
new server this week now if everything were set up properly

126
00:08:54,480 --> 00:09:01,740
with regularly scheduled and post power loss scrubs this entire problem would

127
00:08:59,519 --> 00:09:06,060
probably have never happened and if we had a backup of that data we would be

128
00:09:04,080 --> 00:09:10,260
able to Simply restore from that but here's the thing

129
00:09:07,560 --> 00:09:14,760
backing up over a petabyte of data is really expensive either we would need to

130
00:09:12,600 --> 00:09:19,080
build a duplicate server array to backup to or we could back up to the cloud but

131
00:09:17,519 --> 00:09:25,260
even using the economical option backblaze B2 it would cost us somewhere

132
00:09:21,420 --> 00:09:28,080
between 5 and 10 000 US dollars per

133
00:09:25,260 --> 00:09:31,860
month to store that kind of data now if it was mission critical then by all

134
00:09:30,060 --> 00:09:36,540
means it should have been backed up in both of those ways but having all of our

135
00:09:34,560 --> 00:09:41,700
archival footage from day one of the channel has always been a nice to have

136
00:09:39,300 --> 00:09:45,060
and an excuse for us to explore really cool Tech that we otherwise wouldn't

137
00:09:43,080 --> 00:09:49,320
have any reason to play with I mean it takes a little bit more effort and it

138
00:09:46,500 --> 00:09:52,920
yields lower quality results but we have a backup of all of our old videos it's

139
00:09:51,360 --> 00:09:58,200
called downloading them off of YouTube or Floatplane if we wanted a higher

140
00:09:55,260 --> 00:10:02,940
quality copy so the good news is that our production monix server is running

141
00:10:00,300 --> 00:10:05,880
great with proper backups configured and this isn't going to have any kind of

142
00:10:04,140 --> 00:10:09,660
lasting effect on our business but I am still hopeful that if all goes

143
00:10:08,040 --> 00:10:13,500
well with the recovery efforts we'll be able to get back the majority of the

144
00:10:11,399 --> 00:10:18,480
data mostly error free but only time will tell a lot of time

145
00:10:16,080 --> 00:10:22,560
because transferring all those petabytes of data off of hard drives to other hard

146
00:10:20,519 --> 00:10:27,800
drives is going to take weeks or even months so let this be a lesson follow

147
00:10:25,200 --> 00:10:31,920
Proper Storage practices have a backup and probably hire someone to take care

148
00:10:30,600 --> 00:10:37,260
of your data if you don't have the time especially if you measure it in anything

149
00:10:33,600 --> 00:10:39,180
other than tens of terabytes or you

150
00:10:37,260 --> 00:10:43,200
might lose all of it but you won't lose our sponsor Lambda are you training deep

151
00:10:41,940 --> 00:10:48,240
learning models for the next big breakthrough in artificial intelligence then you should know about Lambda the

152
00:10:46,440 --> 00:10:51,959
Deep learning company founded by Machine learning Engineers Lambda builds GPU

153
00:10:50,160 --> 00:10:55,320
workstations servers and Cloud infrastructure for creating deep

154
00:10:53,459 --> 00:10:59,579
learning models they've helped all five of the big tech companies and 47 of the

155
00:10:57,540 --> 00:11:02,820
top 50 research universities accelerate their machine learning workflows

156
00:11:00,779 --> 00:11:07,140
lambda's easy to use configurators let you spec out exactly the hardware you

157
00:11:04,740 --> 00:11:10,980
need from GPU laptops and workstations all the way up to custom server clusters

158
00:11:09,120 --> 00:11:14,399
and all Lambda machines come pre-installed with Lambda stack keeping

159
00:11:13,079 --> 00:11:18,660
your Linux machine learning environment up to date and out of dependency hell

160
00:11:16,560 --> 00:11:22,980
and with Lambda Cloud you can spin up a virtual machine in minutes train models

161
00:11:20,519 --> 00:11:26,399
with four NVIDIA a6000s at just a fraction of the cost cost of the big

162
00:11:24,360 --> 00:11:30,959
cloud providers so go to lambdalabs.com Linus to configure your own workstation

163
00:11:28,560 --> 00:11:35,459
or try out Lambda Cloud today if you like this video maybe check out the time

164
00:11:32,459 --> 00:11:39,839
I almost lost all of our active projects

165
00:11:35,459 --> 00:11:42,000
when the OG 1X server failed that was a

166
00:11:39,839 --> 00:11:45,240
far more stressful situation I'm actually like

167
00:11:43,440 --> 00:11:48,120
I'm actually pretty relaxed right now for someone with less much data on the

168
00:11:47,160 --> 00:11:51,839
line yeah I'm I'm doing okay thanks for

169
00:11:50,579 --> 00:11:56,240
asking I mean I'd prefer to get it back you

170
00:11:54,180 --> 00:11:56,240
know
