{"video_id":"Npu7jkJk5nM","title":"Our data is GONE... Again - Petabyte Project Recovery Part 1","channel":"Linus Tech Tips","show":"Linus Tech Tips","published_at":"2022-05-05T14:53:45Z","duration_s":715,"segments":[{"start_s":0.0,"end_s":7.5,"text":"when I say we store and handle a lot of data for a YouTube channel I mean it I","speaker":null,"is_sponsor":0},{"start_s":5.279,"end_s":12.36,"text":"mean we've built some sick 100 plus terabyte servers for some of our fellow","speaker":null,"is_sponsor":0},{"start_s":9.3,"end_s":15.599,"text":"YouTubers but those are nothing compared","speaker":null,"is_sponsor":0},{"start_s":12.36,"end_s":17.34,"text":"to the two plus petabytes of archival","speaker":null,"is_sponsor":0},{"start_s":15.599,"end_s":21.72,"text":"storage that we currently have in production in our server room that is","speaker":null,"is_sponsor":0},{"start_s":19.199,"end_s":28.859,"text":"storing all the footage for every video we have ever made at full quality","speaker":null,"is_sponsor":0},{"start_s":25.439,"end_s":31.8,"text":"for the uninitiated that is over 11","speaker":null,"is_sponsor":0},{"start_s":28.859,"end_s":36.0,"text":"000 Warzone installs worth of data but with great power comes great","speaker":null,"is_sponsor":0},{"start_s":33.54,"end_s":41.7,"text":"responsibility and we weren't responsible","speaker":null,"is_sponsor":0},{"start_s":38.28,"end_s":44.1,"text":"despite our super dope Hardware we made","speaker":null,"is_sponsor":0},{"start_s":41.7,"end_s":48.719,"text":"a little oopsie that resulted in us permanently losing data that we don't","speaker":null,"is_sponsor":0},{"start_s":46.14,"end_s":53.28,"text":"have any backup for we still don't know how much but what we do know is what","speaker":null,"is_sponsor":0},{"start_s":51.18,"end_s":58.559,"text":"went wrong and we've got a plan to recover what we can but it is going to","speaker":null,"is_sponsor":0},{"start_s":55.68,"end_s":63.42,"text":"take some work and some money thanks to our sponsor hetzner hetzner offers high","speaker":null,"is_sponsor":0},{"start_s":61.26,"end_s":67.26,"text":"performance Cloud servers for an amazing price with their new Us location in","speaker":null,"is_sponsor":1},{"start_s":65.58,"end_s":70.5,"text":"Ashburn Virginia you can deploy Cloud servers in four different locations and","speaker":null,"is_sponsor":1},{"start_s":69.299,"end_s":75.42,"text":"benefit from features like load balancers block storage and more use","speaker":null,"is_sponsor":1},{"start_s":72.659,"end_s":78.86,"text":"code ltt22 at the link below for twenty dollars off","speaker":null,"is_sponsor":1},{"start_s":76.68,"end_s":78.86,"text":"foreign","speaker":null,"is_sponsor":0},{"start_s":85.04,"end_s":91.68,"text":"let's start with a bit of background on our servers our archival storage is","speaker":null,"is_sponsor":0},{"start_s":89.22,"end_s":97.619,"text":"composed of two discrete cluster ffs clusters both of them spread across two","speaker":null,"is_sponsor":0},{"start_s":94.5,"end_s":100.619,"text":"45 drives storinator servers each with","speaker":null,"is_sponsor":0},{"start_s":97.619,"end_s":103.259,"text":"60 hard drives the original petabyte","speaker":null,"is_sponsor":0},{"start_s":100.619,"end_s":109.2,"text":"project is made up of the Delta 1 and Delta 2 servers and goes by the moniker","speaker":null,"is_sponsor":0},{"start_s":105.659,"end_s":112.32,"text":"old Vault petabyte project 2 or the new","speaker":null,"is_sponsor":0},{"start_s":109.2,"end_s":113.88,"text":"vault is Delta 3 and Delta IV now","speaker":null,"is_sponsor":0},{"start_s":112.32,"end_s":118.56,"text":"because of the nature of our content most of our employees are pretty Tech","speaker":null,"is_sponsor":0},{"start_s":116.46,"end_s":123.36,"text":"literate with many of them even falling into the tech wizard category so we've","speaker":null,"is_sponsor":0},{"start_s":120.899,"end_s":127.56,"text":"always had substantially lower need for tech support than the average company","speaker":null,"is_sponsor":0},{"start_s":125.04,"end_s":132.959,"text":"and as a result we have never hired a full-time it person despite the handful","speaker":null,"is_sponsor":0},{"start_s":130.259,"end_s":137.94,"text":"of times perhaps including this one that it probably would have been helpful","speaker":null,"is_sponsor":0},{"start_s":134.879,"end_s":140.099,"text":"so in the early days I managed the","speaker":null,"is_sponsor":0},{"start_s":137.94,"end_s":144.02,"text":"infrastructure and since then I've had some help from both outside sources","speaker":null,"is_sponsor":0},{"start_s":144.36,"end_s":148.22,"text":"and other members of the writing team","speaker":null,"is_sponsor":0},{"start_s":149.34,"end_s":154.92,"text":"we all have different strengths but what","speaker":null,"is_sponsor":0},{"start_s":152.34,"end_s":160.14,"text":"we all have in common is that we have other jobs to do meaning that it's never","speaker":null,"is_sponsor":0},{"start_s":157.14,"end_s":161.4,"text":"really been clear who exactly is","speaker":null,"is_sponsor":0},{"start_s":160.14,"end_s":166.68,"text":"supposed to be accountable when something slips through the cracks","speaker":null,"is_sponsor":0},{"start_s":163.68,"end_s":169.26,"text":"and unfortunately while obvious issues","speaker":null,"is_sponsor":0},{"start_s":166.68,"end_s":173.58,"text":"like a replacement power cable and a handful of failed drives over the years","speaker":null,"is_sponsor":0},{"start_s":170.879,"end_s":177.66,"text":"were handled by Anthony we never really tasked anyone with performing","speaker":null,"is_sponsor":0},{"start_s":175.2,"end_s":181.56,"text":"preventative maintenance on our precious petabyte servers a quick point of","speaker":null,"is_sponsor":0},{"start_s":180.0,"end_s":185.7,"text":"clarification before we get into the rest of this nothing that happened is","speaker":null,"is_sponsor":0},{"start_s":183.599,"end_s":190.98,"text":"the result of anything other than us messing up the hardware both from 45","speaker":null,"is_sponsor":0},{"start_s":188.519,"end_s":195.48,"text":"drives and from Seagate Who provided the bulk of what makes up our petabyte","speaker":null,"is_sponsor":1},{"start_s":192.9,"end_s":199.5,"text":"project servers has performed beyond our expectations and we would recommend","speaker":null,"is_sponsor":1},{"start_s":197.34,"end_s":203.159,"text":"checking out both of them if you or your business has serious data storage needs","speaker":null,"is_sponsor":1},{"start_s":201.42,"end_s":208.26,"text":"we're going to have links to them down below but even the Best Hardware in the","speaker":null,"is_sponsor":1},{"start_s":205.92,"end_s":212.879,"text":"world can be let down by misconfigured software and Jake who tasked himself","speaker":null,"is_sponsor":0},{"start_s":211.08,"end_s":217.2,"text":"with auditing our current infrastructure found just such a thing","speaker":null,"is_sponsor":0},{"start_s":215.459,"end_s":221.099,"text":"everything was actually going pretty well he was setting up monitoring and","speaker":null,"is_sponsor":0},{"start_s":219.18,"end_s":224.64,"text":"alerts verifying that every machine would gracefully shut down when the","speaker":null,"is_sponsor":0},{"start_s":222.959,"end_s":228.36,"text":"power goes out which happens a lot here for some reason but he eventually worked","speaker":null,"is_sponsor":0},{"start_s":226.86,"end_s":233.879,"text":"his way around to the petabyte project servers and checked the status of the","speaker":null,"is_sponsor":0},{"start_s":230.28,"end_s":236.819,"text":"ZFS pools or Z pools on each of them and","speaker":null,"is_sponsor":0},{"start_s":233.879,"end_s":242.58,"text":"this is where the Kaka hit the fan right off the bat Delta one had two of its 60","speaker":null,"is_sponsor":0},{"start_s":239.819,"end_s":247.5,"text":"drives faulted in the same v-depth and you can think of a v-dev kind of like","speaker":null,"is_sponsor":0},{"start_s":244.44,"end_s":250.799,"text":"its own mini raid array within a larger","speaker":null,"is_sponsor":0},{"start_s":247.5,"end_s":252.48,"text":"pool of multiple raid arrays so in our","speaker":null,"is_sponsor":0},{"start_s":250.799,"end_s":258.72,"text":"configuration where we're running raid Z2 if another disc out of our 15 Drive","speaker":null,"is_sponsor":0},{"start_s":256.079,"end_s":263.94,"text":"v-dev was to have any kind of problem we would incur irrecoverable data loss upon","speaker":null,"is_sponsor":0},{"start_s":261.959,"end_s":268.199,"text":"further inspection both of the drives were completely dead which does happen","speaker":null,"is_sponsor":0},{"start_s":265.86,"end_s":273.36,"text":"with mechanical devices and had dropped from the system so we replaced them and","speaker":null,"is_sponsor":0},{"start_s":270.96,"end_s":277.259,"text":"let the array start rebuilding that's pretty scary but not in another itself a","speaker":null,"is_sponsor":0},{"start_s":276.18,"end_s":281.94,"text":"lost cause more on that later though far scarier","speaker":null,"is_sponsor":0},{"start_s":279.6,"end_s":287.699,"text":"was when Delta 3 which is part of the new Vault cluster had five drives in a","speaker":null,"is_sponsor":0},{"start_s":285.24,"end_s":292.44,"text":"faulted state with two of the V devs having two drives down that's very","speaker":null,"is_sponsor":0},{"start_s":291.18,"end_s":297.12,"text":"dangerous interestingly these drives weren't","speaker":null,"is_sponsor":0},{"start_s":294.72,"end_s":302.58,"text":"actually dead instead they had just faulted due to having too many errors so","speaker":null,"is_sponsor":0},{"start_s":300.84,"end_s":306.78,"text":"read and write errors like this are usually caused by a faulty cable or","speaker":null,"is_sponsor":0},{"start_s":304.56,"end_s":310.8,"text":"connection but they can also be the sign of a Dying Drive in our case these","speaker":null,"is_sponsor":0},{"start_s":309.12,"end_s":315.18,"text":"errors probably cropped up due to a sudden power loss or due to naturally","speaker":null,"is_sponsor":0},{"start_s":312.96,"end_s":318.78,"text":"occurring bit Rod as they were never configured to shut down nicely while on","speaker":null,"is_sponsor":0},{"start_s":317.16,"end_s":322.02,"text":"backup power in the case of an outage and we've had quite a few of those over","speaker":null,"is_sponsor":0},{"start_s":320.28,"end_s":326.28,"text":"the years now storage systems are usually designed to","speaker":null,"is_sponsor":0},{"start_s":324.419,"end_s":330.72,"text":"be able to recover from such an event especially ZFS which is known for being","speaker":null,"is_sponsor":0},{"start_s":328.5,"end_s":335.4,"text":"one of the most resilient ones out there after booting back up from a power loss","speaker":null,"is_sponsor":0},{"start_s":332.639,"end_s":339.72,"text":"ZFS pools and most other raid or raid-like storage arrays should do","speaker":null,"is_sponsor":0},{"start_s":337.259,"end_s":343.8,"text":"something called a scrub or a resync which in the case of ZFS means that","speaker":null,"is_sponsor":0},{"start_s":342.06,"end_s":347.46,"text":"every block of data gets checked to ensure that there are no errors and if","speaker":null,"is_sponsor":0},{"start_s":345.539,"end_s":351.24,"text":"there are any errors these errors are automatically fixed with the parity data","speaker":null,"is_sponsor":0},{"start_s":349.5,"end_s":357.36,"text":"that is stored in the array on most Nas operating systems like true","speaker":null,"is_sponsor":0},{"start_s":353.94,"end_s":359.4,"text":"Nas unread or any pre-built Nas this","speaker":null,"is_sponsor":0},{"start_s":357.36,"end_s":363.3,"text":"process should just happen automatically and even if nothing goes wrong they","speaker":null,"is_sponsor":0},{"start_s":361.38,"end_s":370.979,"text":"should also run a scheduled scrub every month or so but our servers were set up","speaker":null,"is_sponsor":0},{"start_s":365.639,"end_s":374.34,"text":"by us a long time ago on Centos and","speaker":null,"is_sponsor":0},{"start_s":370.979,"end_s":376.68,"text":"never updated so neither a scheduled nor","speaker":null,"is_sponsor":0},{"start_s":374.34,"end_s":380.46,"text":"a power on recovery scrub was ever configured meaning the only time data","speaker":null,"is_sponsor":0},{"start_s":379.199,"end_s":386.1,"text":"Integrity would have been checked on these arrays is when a block of data got","speaker":null,"is_sponsor":0},{"start_s":383.699,"end_s":392.1,"text":"read this function should theoretically protect against bitrod but since we have","speaker":null,"is_sponsor":0},{"start_s":388.68,"end_s":394.74,"text":"thousands of old videos of which a very","speaker":null,"is_sponsor":0},{"start_s":392.1,"end_s":400.319,"text":"very small portion ever actually gets accessed the rest were essentially left","speaker":null,"is_sponsor":0},{"start_s":397.259,"end_s":403.44,"text":"to slowly rot and power loss themselves","speaker":null,"is_sponsor":0},{"start_s":400.319,"end_s":404.94,"text":"into an unrecoverable mess when we found","speaker":null,"is_sponsor":0},{"start_s":403.44,"end_s":409.38,"text":"the drive issues we weren't even aware of all this yet and even though the five","speaker":null,"is_sponsor":0},{"start_s":407.1,"end_s":413.58,"text":"drives weren't technically Dead We aired on the side of caution and started to","speaker":null,"is_sponsor":0},{"start_s":410.88,"end_s":417.36,"text":"replace operation on all of them it was while we were rebuilding the array on","speaker":null,"is_sponsor":0},{"start_s":415.259,"end_s":420.96,"text":"Delta 3 with the new disks that we started to uncover the absolute mess of","speaker":null,"is_sponsor":0},{"start_s":419.819,"end_s":426.479,"text":"data errors ZFS has reported around","speaker":null,"is_sponsor":0},{"start_s":423.08,"end_s":429.6,"text":"169 million errors at the time of","speaker":null,"is_sponsor":0},{"start_s":426.479,"end_s":431.4,"text":"recording this and no it's not nice","speaker":null,"is_sponsor":0},{"start_s":429.6,"end_s":436.5,"text":"in fact there are so many errors on Delta 3 that with two faulted drives in","speaker":null,"is_sponsor":0},{"start_s":434.16,"end_s":441.66,"text":"both of the first v-devs there is not enough parity data to fix the errors and","speaker":null,"is_sponsor":0},{"start_s":439.38,"end_s":446.52,"text":"this caused the array to offline itself to protect against further degradation","speaker":null,"is_sponsor":0},{"start_s":444.18,"end_s":450.3,"text":"and unfortunately much further along in the process the same thing happened on","speaker":null,"is_sponsor":0},{"start_s":448.68,"end_s":456.3,"text":"Delta 1. that means that both the original and","speaker":null,"is_sponsor":0},{"start_s":453.0,"end_s":461.16,"text":"new petabyte projects old and new vault","speaker":null,"is_sponsor":0},{"start_s":456.3,"end_s":463.08,"text":"have suffered non-recoverable data loss","speaker":null,"is_sponsor":0},{"start_s":461.16,"end_s":467.34,"text":"so now what do we do in regards to the corrupted and lost","speaker":null,"is_sponsor":0},{"start_s":465.12,"end_s":471.9,"text":"data honestly nothing I mean it's very likely that even with","speaker":null,"is_sponsor":0},{"start_s":469.199,"end_s":477.66,"text":"169 million data errors we still have virtually all of the original bits in","speaker":null,"is_sponsor":0},{"start_s":474.84,"end_s":482.94,"text":"the right places but as far as we know there's no way to just tell ZFS yo dog","speaker":null,"is_sponsor":0},{"start_s":480.78,"end_s":487.979,"text":"ignore those errors you know pretend like they never happened to easy ZFS or","speaker":null,"is_sponsor":0},{"start_s":485.52,"end_s":493.38,"text":"something instead then the plan is to build a new properly configured 1.2","speaker":null,"is_sponsor":0},{"start_s":491.28,"end_s":497.4,"text":"petabyte server featuring seagate's shiny new 20 terabyte drives which we're","speaker":null,"is_sponsor":0},{"start_s":495.72,"end_s":502.08,"text":"really excited about like these things are almost as shiny as our reflective","speaker":null,"is_sponsor":1},{"start_s":499.02,"end_s":504.0,"text":"hard drive shirt LT store.com","speaker":null,"is_sponsor":1},{"start_s":502.08,"end_s":509.699,"text":"and once that's complete we intend to move all of the data from the new Vault","speaker":null,"is_sponsor":1},{"start_s":506.16,"end_s":511.259,"text":"cluster onto this new new vault","speaker":null,"is_sponsor":0},{"start_s":509.699,"end_s":516.419,"text":"new new vault then we'll reset up new Vault ensure all","speaker":null,"is_sponsor":0},{"start_s":514.8,"end_s":522.3,"text":"the drives are good and repeat the process to move old Vault data onto it","speaker":null,"is_sponsor":0},{"start_s":519.3,"end_s":525.0,"text":"then we can reformat old Vault probably","speaker":null,"is_sponsor":0},{"start_s":522.3,"end_s":529.14,"text":"upgrade it a bit and use it for new data maybe we'll rename it to new new Vault","speaker":null,"is_sponsor":0},{"start_s":527.04,"end_s":532.56,"text":"get subscribed so you don't miss any of that we'll hopefully be building that","speaker":null,"is_sponsor":1},{"start_s":530.88,"end_s":536.76,"text":"new server this week now if everything were set up properly","speaker":null,"is_sponsor":1},{"start_s":534.48,"end_s":541.74,"text":"with regularly scheduled and post power loss scrubs this entire problem would","speaker":null,"is_sponsor":0},{"start_s":539.519,"end_s":546.06,"text":"probably have never happened and if we had a backup of that data we would be","speaker":null,"is_sponsor":0},{"start_s":544.08,"end_s":550.26,"text":"able to Simply restore from that but here's the thing","speaker":null,"is_sponsor":0},{"start_s":547.56,"end_s":554.76,"text":"backing up over a petabyte of data is really expensive either we would need to","speaker":null,"is_sponsor":0},{"start_s":552.6,"end_s":559.08,"text":"build a duplicate server array to backup to or we could back up to the cloud but","speaker":null,"is_sponsor":0},{"start_s":557.519,"end_s":565.26,"text":"even using the economical option backblaze B2 it would cost us somewhere","speaker":null,"is_sponsor":0},{"start_s":561.42,"end_s":568.08,"text":"between 5 and 10 000 US dollars per","speaker":null,"is_sponsor":0},{"start_s":565.26,"end_s":571.86,"text":"month to store that kind of data now if it was mission critical then by all","speaker":null,"is_sponsor":0},{"start_s":570.06,"end_s":576.54,"text":"means it should have been backed up in both of those ways but having all of our","speaker":null,"is_sponsor":0},{"start_s":574.56,"end_s":581.7,"text":"archival footage from day one of the channel has always been a nice to have","speaker":null,"is_sponsor":0},{"start_s":579.3,"end_s":585.06,"text":"and an excuse for us to explore really cool Tech that we otherwise wouldn't","speaker":null,"is_sponsor":0},{"start_s":583.08,"end_s":589.32,"text":"have any reason to play with I mean it takes a little bit more effort and it","speaker":null,"is_sponsor":0},{"start_s":586.5,"end_s":592.92,"text":"yields lower quality results but we have a backup of all of our old videos it's","speaker":null,"is_sponsor":0},{"start_s":591.36,"end_s":598.2,"text":"called downloading them off of YouTube or Floatplane if we wanted a higher","speaker":null,"is_sponsor":0},{"start_s":595.26,"end_s":602.94,"text":"quality copy so the good news is that our production monix server is running","speaker":null,"is_sponsor":0},{"start_s":600.3,"end_s":605.88,"text":"great with proper backups configured and this isn't going to have any kind of","speaker":null,"is_sponsor":0},{"start_s":604.14,"end_s":609.66,"text":"lasting effect on our business but I am still hopeful that if all goes","speaker":null,"is_sponsor":0},{"start_s":608.04,"end_s":613.5,"text":"well with the recovery efforts we'll be able to get back the majority of the","speaker":null,"is_sponsor":0},{"start_s":611.399,"end_s":618.48,"text":"data mostly error free but only time will tell a lot of time","speaker":null,"is_sponsor":0},{"start_s":616.08,"end_s":622.56,"text":"because transferring all those petabytes of data off of hard drives to other hard","speaker":null,"is_sponsor":0},{"start_s":620.519,"end_s":627.8,"text":"drives is going to take weeks or even months so let this be a lesson follow","speaker":null,"is_sponsor":0},{"start_s":625.2,"end_s":631.92,"text":"Proper Storage practices have a backup and probably hire someone to take care","speaker":null,"is_sponsor":0},{"start_s":630.6,"end_s":637.26,"text":"of your data if you don't have the time especially if you measure it in anything","speaker":null,"is_sponsor":0},{"start_s":633.6,"end_s":639.18,"text":"other than tens of terabytes or you","speaker":null,"is_sponsor":0},{"start_s":637.26,"end_s":643.2,"text":"might lose all of it but you won't lose our sponsor Lambda are you training deep","speaker":null,"is_sponsor":0},{"start_s":641.94,"end_s":648.24,"text":"learning models for the next big breakthrough in artificial intelligence then you should know about Lambda the","speaker":null,"is_sponsor":1},{"start_s":646.44,"end_s":651.959,"text":"Deep learning company founded by Machine learning Engineers Lambda builds GPU","speaker":null,"is_sponsor":1},{"start_s":650.16,"end_s":655.32,"text":"workstations servers and Cloud infrastructure for creating deep","speaker":null,"is_sponsor":1},{"start_s":653.459,"end_s":659.579,"text":"learning models they've helped all five of the big tech companies and 47 of the","speaker":null,"is_sponsor":1},{"start_s":657.54,"end_s":662.82,"text":"top 50 research universities accelerate their machine learning workflows","speaker":null,"is_sponsor":1},{"start_s":660.779,"end_s":667.14,"text":"lambda's easy to use configurators let you spec out exactly the hardware you","speaker":null,"is_sponsor":1},{"start_s":664.74,"end_s":670.98,"text":"need from GPU laptops and workstations all the way up to custom server clusters","speaker":null,"is_sponsor":1},{"start_s":669.12,"end_s":674.399,"text":"and all Lambda machines come pre-installed with Lambda stack keeping","speaker":null,"is_sponsor":1},{"start_s":673.079,"end_s":678.66,"text":"your Linux machine learning environment up to date and out of dependency hell","speaker":null,"is_sponsor":1},{"start_s":676.56,"end_s":682.98,"text":"and with Lambda Cloud you can spin up a virtual machine in minutes train models","speaker":null,"is_sponsor":1},{"start_s":680.519,"end_s":686.399,"text":"with four NVIDIA a6000s at just a fraction of the cost cost of the big","speaker":null,"is_sponsor":1},{"start_s":684.36,"end_s":690.959,"text":"cloud providers so go to lambdalabs.com Linus to configure your own workstation","speaker":null,"is_sponsor":1},{"start_s":688.56,"end_s":695.459,"text":"or try out Lambda Cloud today if you like this video maybe check out the time","speaker":null,"is_sponsor":1},{"start_s":692.459,"end_s":699.839,"text":"I almost lost all of our active projects","speaker":null,"is_sponsor":0},{"start_s":695.459,"end_s":702.0,"text":"when the OG 1X server failed that was a","speaker":null,"is_sponsor":0},{"start_s":699.839,"end_s":705.24,"text":"far more stressful situation I'm actually like","speaker":null,"is_sponsor":0},{"start_s":703.44,"end_s":708.12,"text":"I'm actually pretty relaxed right now for someone with less much data on the","speaker":null,"is_sponsor":0},{"start_s":707.16,"end_s":711.839,"text":"line yeah I'm I'm doing okay thanks for","speaker":null,"is_sponsor":0},{"start_s":710.579,"end_s":716.24,"text":"asking I mean I'd prefer to get it back you","speaker":null,"is_sponsor":0},{"start_s":714.18,"end_s":716.24,"text":"know","speaker":null,"is_sponsor":0}],"full_text":"when I say we store and handle a lot of data for a YouTube channel I mean it I mean we've built some sick 100 plus terabyte servers for some of our fellow YouTubers but those are nothing compared to the two plus petabytes of archival storage that we currently have in production in our server room that is storing all the footage for every video we have ever made at full quality for the uninitiated that is over 11 000 Warzone installs worth of data but with great power comes great responsibility and we weren't responsible despite our super dope Hardware we made a little oopsie that resulted in us permanently losing data that we don't have any backup for we still don't know how much but what we do know is what went wrong and we've got a plan to recover what we can but it is going to take some work and some money thanks to our sponsor hetzner hetzner offers high performance Cloud servers for an amazing price with their new Us location in Ashburn Virginia you can deploy Cloud servers in four different locations and benefit from features like load balancers block storage and more use code ltt22 at the link below for twenty dollars off foreign let's start with a bit of background on our servers our archival storage is composed of two discrete cluster ffs clusters both of them spread across two 45 drives storinator servers each with 60 hard drives the original petabyte project is made up of the Delta 1 and Delta 2 servers and goes by the moniker old Vault petabyte project 2 or the new vault is Delta 3 and Delta IV now because of the nature of our content most of our employees are pretty Tech literate with many of them even falling into the tech wizard category so we've always had substantially lower need for tech support than the average company and as a result we have never hired a full-time it person despite the handful of times perhaps including this one that it probably would have been helpful so in the early days I managed the infrastructure and since then I've had some help from both outside sources and other members of the writing team we all have different strengths but what we all have in common is that we have other jobs to do meaning that it's never really been clear who exactly is supposed to be accountable when something slips through the cracks and unfortunately while obvious issues like a replacement power cable and a handful of failed drives over the years were handled by Anthony we never really tasked anyone with performing preventative maintenance on our precious petabyte servers a quick point of clarification before we get into the rest of this nothing that happened is the result of anything other than us messing up the hardware both from 45 drives and from Seagate Who provided the bulk of what makes up our petabyte project servers has performed beyond our expectations and we would recommend checking out both of them if you or your business has serious data storage needs we're going to have links to them down below but even the Best Hardware in the world can be let down by misconfigured software and Jake who tasked himself with auditing our current infrastructure found just such a thing everything was actually going pretty well he was setting up monitoring and alerts verifying that every machine would gracefully shut down when the power goes out which happens a lot here for some reason but he eventually worked his way around to the petabyte project servers and checked the status of the ZFS pools or Z pools on each of them and this is where the Kaka hit the fan right off the bat Delta one had two of its 60 drives faulted in the same v-depth and you can think of a v-dev kind of like its own mini raid array within a larger pool of multiple raid arrays so in our configuration where we're running raid Z2 if another disc out of our 15 Drive v-dev was to have any kind of problem we would incur irrecoverable data loss upon further inspection both of the drives were completely dead which does happen with mechanical devices and had dropped from the system so we replaced them and let the array start rebuilding that's pretty scary but not in another itself a lost cause more on that later though far scarier was when Delta 3 which is part of the new Vault cluster had five drives in a faulted state with two of the V devs having two drives down that's very dangerous interestingly these drives weren't actually dead instead they had just faulted due to having too many errors so read and write errors like this are usually caused by a faulty cable or connection but they can also be the sign of a Dying Drive in our case these errors probably cropped up due to a sudden power loss or due to naturally occurring bit Rod as they were never configured to shut down nicely while on backup power in the case of an outage and we've had quite a few of those over the years now storage systems are usually designed to be able to recover from such an event especially ZFS which is known for being one of the most resilient ones out there after booting back up from a power loss ZFS pools and most other raid or raid-like storage arrays should do something called a scrub or a resync which in the case of ZFS means that every block of data gets checked to ensure that there are no errors and if there are any errors these errors are automatically fixed with the parity data that is stored in the array on most Nas operating systems like true Nas unread or any pre-built Nas this process should just happen automatically and even if nothing goes wrong they should also run a scheduled scrub every month or so but our servers were set up by us a long time ago on Centos and never updated so neither a scheduled nor a power on recovery scrub was ever configured meaning the only time data Integrity would have been checked on these arrays is when a block of data got read this function should theoretically protect against bitrod but since we have thousands of old videos of which a very very small portion ever actually gets accessed the rest were essentially left to slowly rot and power loss themselves into an unrecoverable mess when we found the drive issues we weren't even aware of all this yet and even though the five drives weren't technically Dead We aired on the side of caution and started to replace operation on all of them it was while we were rebuilding the array on Delta 3 with the new disks that we started to uncover the absolute mess of data errors ZFS has reported around 169 million errors at the time of recording this and no it's not nice in fact there are so many errors on Delta 3 that with two faulted drives in both of the first v-devs there is not enough parity data to fix the errors and this caused the array to offline itself to protect against further degradation and unfortunately much further along in the process the same thing happened on Delta 1. that means that both the original and new petabyte projects old and new vault have suffered non-recoverable data loss so now what do we do in regards to the corrupted and lost data honestly nothing I mean it's very likely that even with 169 million data errors we still have virtually all of the original bits in the right places but as far as we know there's no way to just tell ZFS yo dog ignore those errors you know pretend like they never happened to easy ZFS or something instead then the plan is to build a new properly configured 1.2 petabyte server featuring seagate's shiny new 20 terabyte drives which we're really excited about like these things are almost as shiny as our reflective hard drive shirt LT store.com and once that's complete we intend to move all of the data from the new Vault cluster onto this new new vault new new vault then we'll reset up new Vault ensure all the drives are good and repeat the process to move old Vault data onto it then we can reformat old Vault probably upgrade it a bit and use it for new data maybe we'll rename it to new new Vault get subscribed so you don't miss any of that we'll hopefully be building that new server this week now if everything were set up properly with regularly scheduled and post power loss scrubs this entire problem would probably have never happened and if we had a backup of that data we would be able to Simply restore from that but here's the thing backing up over a petabyte of data is really expensive either we would need to build a duplicate server array to backup to or we could back up to the cloud but even using the economical option backblaze B2 it would cost us somewhere between 5 and 10 000 US dollars per month to store that kind of data now if it was mission critical then by all means it should have been backed up in both of those ways but having all of our archival footage from day one of the channel has always been a nice to have and an excuse for us to explore really cool Tech that we otherwise wouldn't have any reason to play with I mean it takes a little bit more effort and it yields lower quality results but we have a backup of all of our old videos it's called downloading them off of YouTube or Floatplane if we wanted a higher quality copy so the good news is that our production monix server is running great with proper backups configured and this isn't going to have any kind of lasting effect on our business but I am still hopeful that if all goes well with the recovery efforts we'll be able to get back the majority of the data mostly error free but only time will tell a lot of time because transferring all those petabytes of data off of hard drives to other hard drives is going to take weeks or even months so let this be a lesson follow Proper Storage practices have a backup and probably hire someone to take care of your data if you don't have the time especially if you measure it in anything other than tens of terabytes or you might lose all of it but you won't lose our sponsor Lambda are you training deep learning models for the next big breakthrough in artificial intelligence then you should know about Lambda the Deep learning company founded by Machine learning Engineers Lambda builds GPU workstations servers and Cloud infrastructure for creating deep learning models they've helped all five of the big tech companies and 47 of the top 50 research universities accelerate their machine learning workflows lambda's easy to use configurators let you spec out exactly the hardware you need from GPU laptops and workstations all the way up to custom server clusters and all Lambda machines come pre-installed with Lambda stack keeping your Linux machine learning environment up to date and out of dependency hell and with Lambda Cloud you can spin up a virtual machine in minutes train models with four NVIDIA a6000s at just a fraction of the cost cost of the big cloud providers so go to lambdalabs.com Linus to configure your own workstation or try out Lambda Cloud today if you like this video maybe check out the time I almost lost all of our active projects when the OG 1X server failed that was a far more stressful situation I'm actually like I'm actually pretty relaxed right now for someone with less much data on the line yeah I'm I'm doing okay thanks for asking I mean I'd prefer to get it back you know"}