WEBVTT

00:00:00.240 --> 00:00:06.400
this is an SSD

00:00:03.520 --> 00:00:11.040
no no seriously this whole thing is a complete SSD and it's designed for

00:00:09.200 --> 00:00:16.720
massive data centers like what you might find at the likes of meta so

00:00:13.920 --> 00:00:20.320
how come i've never heard of it well pure storage wonders that too and

00:00:18.480 --> 00:00:25.439
they've sponsored this video for us to take a tour of their facilities here in

00:00:23.119 --> 00:00:30.400
sunny mountain view california and dig deeper into just how these things work

00:00:28.160 --> 00:00:33.600
and maybe while they're at it they can explain a little bit about how

00:00:32.399 --> 00:00:40.960
they can make the slowest flash used in ssds today run

00:00:37.360 --> 00:00:44.440
twice as fast as last year's good stuff

00:00:40.960 --> 00:00:44.440
let's take a look

00:00:51.360 --> 00:00:57.440
from the beginning pure storage's secret sauce has been in their approach to

00:00:55.440 --> 00:01:01.359
software while their solution is proprietary with it they're able to

00:00:59.280 --> 00:01:06.240
tightly control each individual nand flash chip in an array potentially the

00:01:03.440 --> 00:01:09.920
size of a whole rack but

00:01:07.360 --> 00:01:14.799
there's no way that i know of to drill down to that level of detail with any

00:01:12.159 --> 00:01:19.840
normal SSD like even enterprise grade sas and u.2 drives have controllers that

00:01:17.360 --> 00:01:26.000
make the drive show up as disks like you don't get direct access to the chips so

00:01:22.880 --> 00:01:30.000
what's up with that this

00:01:26.000 --> 00:01:32.479
is a direct flash module or dfm for

00:01:30.000 --> 00:01:36.560
short it's a strange looking module it's a lot larger than any SSD form factor

00:01:34.880 --> 00:01:39.840
i've seen before and there's a lot of nand flash on there

00:01:38.240 --> 00:01:44.720
up to 48 terabytes in the current generation along with super capacitors

00:01:42.400 --> 00:01:48.799
to make sure writes complete properly in the case of a sudden power loss

00:01:46.640 --> 00:01:53.600
pretty standard for data center SSD yes but while you'd expect loads of dram

00:01:51.200 --> 00:01:58.799
cash to make it fast there's no dram on here that is a

00:01:56.560 --> 00:02:03.840
terrible thing for performance under normal circumstances just because of the

00:02:00.880 --> 00:02:08.399
way ssds work whenever an individual die is busy performing some action the data

00:02:06.320 --> 00:02:12.480
on it is completely inaccessible and you have to wait dram caches collect the

00:02:10.800 --> 00:02:15.760
data and delay writing it until there's enough of it that it actually makes

00:02:14.080 --> 00:02:20.319
sense to spend the time actually writing it out because in order for an SSD to

00:02:18.160 --> 00:02:25.200
write that data any blocks that are totally empty first had to be erased

00:02:23.520 --> 00:02:28.959
as you add additional levels of storage to a cell so from single to multi to

00:02:27.520 --> 00:02:33.920
triple to quad the time it takes to actually do this

00:02:31.200 --> 00:02:38.239
increases exponentially and if you remember from our review of Intel's

00:02:35.840 --> 00:02:44.319
early dreamless qlc drives you would know that it can be slower than a hard

00:02:40.640 --> 00:02:46.400
drive this could be catastrophic for

00:02:44.319 --> 00:02:50.480
real-time applications if your d-ramless array decided that it was time for some

00:02:48.720 --> 00:02:53.920
spring cleaning so have pure storage just gone

00:02:52.239 --> 00:02:58.959
completely mad if so well

00:02:56.640 --> 00:03:02.560
there may be a method to their madness up to four of these dfms can slot into a

00:03:01.360 --> 00:03:09.200
blade like this one which itself

00:03:06.239 --> 00:03:15.599
is a self-contained server running a socketed xeon processor

00:03:12.000 --> 00:03:18.560
there's the dram yes this xeon CPU is

00:03:15.599 --> 00:03:22.720
acting as a glorified SSD controller now it does other array management stuff too

00:03:20.560 --> 00:03:26.879
but to talk to the storage it actually runs a very low level interface

00:03:24.959 --> 00:03:31.519
bootstrap from Linux that talks to a custom controller on each dfm

00:03:29.280 --> 00:03:34.959
these controllers do one very simple task

00:03:32.480 --> 00:03:37.920
provide an interface to all the flash on the module

00:03:36.159 --> 00:03:42.480
it doesn't really do anything that you'd expect from a typical SSD controller all

00:03:40.319 --> 00:03:46.879
of that is handled by the CPU while it does use the NVMe protocol and

00:03:44.560 --> 00:03:50.640
a u.2 connection it won't show up as a disk in any other system simply because

00:03:48.879 --> 00:03:53.840
it doesn't have the brain this arrangement lets them directly

00:03:52.400 --> 00:03:59.680
control when where and how new data gets stored like

00:03:57.920 --> 00:04:03.200
it's kind of like apple silicon in a way except they can handle multiple vendors

00:04:01.439 --> 00:04:07.599
of flash in the same array more importantly there's no pretending

00:04:04.959 --> 00:04:11.280
to be a hard drive so while block sizes can vary between vendors you're not

00:04:09.760 --> 00:04:15.680
forced to write out a whole four kilobytes each time you need to write a

00:04:12.799 --> 00:04:20.079
couple of bytes so because the blade knows what the whole array looks like

00:04:18.079 --> 00:04:23.520
data that's changed very frequently can be grouped together to prevent a

00:04:21.359 --> 00:04:27.680
situation known as tombstoning where deleted but not erased data can stick

00:04:26.080 --> 00:04:32.639
around because it's uneconomical to evict it which means that that cell is

00:04:30.160 --> 00:04:36.960
effectively never touched again that's partly why over provision space which is

00:04:35.040 --> 00:04:40.400
where a percentage of the raw capacity is reserved for

00:04:38.240 --> 00:04:43.600
system use is important for SSD performance a workaround that pure

00:04:42.560 --> 00:04:48.080
storage doesn't really need in a nutshell each

00:04:46.240 --> 00:04:52.880
of these blades is basically an SSD with extra steps

00:04:50.800 --> 00:04:56.479
because of that pure storage is able to control things like

00:04:54.320 --> 00:05:00.880
when and where to start wear leveling which is incredibly important to qlc

00:04:58.960 --> 00:05:05.680
flash because it's typically only rated for less than a thousand programming

00:05:02.479 --> 00:05:07.759
race cycles so obviously there's a lot

00:05:05.680 --> 00:05:12.400
of incentive to never have to cycle a cell if you don't have to to help with

00:05:10.560 --> 00:05:15.280
that data is compressed wherever possible

00:05:13.759 --> 00:05:20.240
there's another component to this too because you can only ever read or write

00:05:18.000 --> 00:05:23.840
flash memory not both the more you can avoid writing to

00:05:21.759 --> 00:05:27.759
frequently read cells or overwriting frequently written data to those cells

00:05:25.919 --> 00:05:31.759
the less often you're stuck waiting for the SSD to come around and actually spit

00:05:29.919 --> 00:05:36.720
out the data you need and that's especially important with qlc because of

00:05:34.720 --> 00:05:41.039
how long it takes to erase and rewrite data on that type of flash remember at

00:05:39.039 --> 00:05:45.440
this kind of scale you're talking databases that are massive enough and

00:05:42.880 --> 00:05:51.199
accessed often enough that even a delay of tens of milliseconds could be a major

00:05:48.800 --> 00:05:54.400
problem and of course they've thought about that

00:05:53.600 --> 00:06:01.680
too what if i told you that in the time it takes for a right cycle to end the

00:05:58.639 --> 00:06:03.600
inaccessible bits can be rebuilt from

00:06:01.680 --> 00:06:07.600
parity data yeah that's not something you can do

00:06:05.199 --> 00:06:11.840
with a traditional SSD the controllers that pretend that they're still spinning

00:06:09.280 --> 00:06:15.420
discs for the pc's sake quickly become the bottleneck let's meet a complete

00:06:13.840 --> 00:06:22.639
system and see it at work this is a flashblade s chassis pure

00:06:19.680 --> 00:06:29.680
storage's newest product each of these can hold 10 blades with 4 dfm's each

00:06:26.400 --> 00:06:31.400
which at up to 48 terabytes per dfm is a

00:06:29.680 --> 00:06:36.479
staggering 1.9

00:06:32.960 --> 00:06:37.520
petabytes of storage in a single 5u

00:06:36.479 --> 00:06:43.680
chassis remember the lengths we had to go to in order to get a petabyte of hard drives

00:06:41.759 --> 00:06:47.520
and then how many boxes a petabyte of flash took up

00:06:45.199 --> 00:06:51.120
this is next level there are four power supplies on this

00:06:49.120 --> 00:06:56.240
thing that are all modular each with the capability of drawing 2.4 kilowatts from

00:06:54.400 --> 00:07:00.240
the wall and while they don't sell them right now multi-chassis configs with up

00:06:58.080 --> 00:07:04.960
to 10 of these things in a rack are in the pipeline a fully loaded

00:07:03.280 --> 00:07:09.360
flashblade s has enough redundancy that it can lose a

00:07:07.840 --> 00:07:14.400
whole blade and a dfm and another blade and be

00:07:12.160 --> 00:07:18.560
totally fine the blades connect to each other via the same backplane that the

00:07:16.000 --> 00:07:23.759
blades slot into called fabric i o modules or fioms for short each

00:07:21.280 --> 00:07:27.840
flashblade s has two files which automates managing the blades and

00:07:25.520 --> 00:07:32.960
chassis units with a total of eight 100 gigabit per second links

00:07:30.479 --> 00:07:36.160
only four of which are currently used they say that when they're ready to

00:07:34.319 --> 00:07:39.919
enable the others they'll be able to do it for free via a software update

00:07:38.319 --> 00:07:44.400
something they're able to do thanks to their evergreen model there's a lot to

00:07:42.000 --> 00:07:49.440
this but basically they don't charge for new software features and their modular

00:07:46.880 --> 00:07:54.639
approach to hardware means that you can continuously upgrade to the point where

00:07:51.680 --> 00:07:59.840
one of their alpha stage clients from 10 years ago has continuously upgraded

00:07:57.039 --> 00:08:03.120
their array with no downtime to this very day

00:08:01.280 --> 00:08:07.840
it's like the array of theseus none of it is still the same except for the data

00:08:05.280 --> 00:08:12.720
and even that might be different now but enough talk

00:08:09.599 --> 00:08:15.280
let's see them roar in the data center

00:08:12.720 --> 00:08:19.039
this is a rack full of the previous generation versions of what we were just

00:08:17.199 --> 00:08:23.520
looking at a little while ago these are all connected together

00:08:21.440 --> 00:08:28.240
with a fabric like this i think all fiber and they're all connected via

00:08:26.160 --> 00:08:31.680
these backplanes here so this top one here is connected to all

00:08:30.319 --> 00:08:35.919
of the pink cables and this bottom one is

00:08:34.080 --> 00:08:39.440
connected to mostly glue there's some tape here as well

00:08:37.360 --> 00:08:42.719
and all of these are communicating with each other in a maximum layout so this

00:08:41.599 --> 00:08:47.600
whole rack is acting as basically a single SSD as

00:08:45.920 --> 00:08:51.600
you can see all of these power connectors

00:08:48.959 --> 00:08:57.040
are modular the actual power cables themselves have tabs on them so you

00:08:54.640 --> 00:09:01.440
can't pull them out so yeah this is all of their last gen stuff this over here

00:08:59.519 --> 00:09:05.600
is a rack full of their new gen stuff this is glass plate s

00:09:03.680 --> 00:09:10.320
what we're looking at here is more or less a setup of a bunch of

00:09:08.480 --> 00:09:14.560
standalone units rather than having everything built

00:09:11.680 --> 00:09:18.720
together into one giant array each one of these is a smaller array

00:09:16.640 --> 00:09:21.680
compared to spinning disks this is actually pretty good power

00:09:20.560 --> 00:09:26.720
density like for the amount of storage you're

00:09:24.880 --> 00:09:32.640
actually getting here it's a significant saving but it's something

00:09:30.000 --> 00:09:36.640
like 1.3 watts per terabyte or something like that this is one of the first

00:09:34.800 --> 00:09:41.839
modules that they actually built for this type of array

00:09:38.480 --> 00:09:43.839
it actually has an fpga on it that

00:09:41.839 --> 00:09:49.360
takes all of its information from an sd card here and that is a mini usb port

00:09:46.959 --> 00:09:56.320
that tells you how long ago this was but this was NVMe before pretty much

00:09:52.560 --> 00:09:57.920
anybody was using NVMe which uses a u.2

00:09:56.320 --> 00:10:01.920
connector here to connect directly to the back plate

00:09:59.440 --> 00:10:07.519
basically i had to create a pci express backplane for this because

00:10:04.399 --> 00:10:09.680
actual controllers hpas didn't really

00:10:07.519 --> 00:10:13.200
exist at the time for this kind of story and i was just handed one of these this

00:10:12.160 --> 00:10:18.240
is a trainless version it's been taken out of

00:10:15.839 --> 00:10:22.399
the tray of the dfm that's currently being used

00:10:19.760 --> 00:10:26.240
we can see that there are

00:10:23.920 --> 00:10:30.399
super capacitors on here in order to make sure that

00:10:28.000 --> 00:10:33.839
in case uh say for example one of these were unceremoniously yanked out of the

00:10:32.720 --> 00:10:38.079
server any data that was already being written

00:10:36.560 --> 00:10:42.399
will be written so it won't be partially written you

00:10:39.920 --> 00:10:47.279
won't get corruption in that way also on the back we can see here that we

00:10:44.640 --> 00:10:53.600
have extra man flash ships this is a 48 terabyte module

00:10:51.800 --> 00:10:58.560
4848 that is a massive amount of storage no

00:10:56.560 --> 00:11:02.320
hard drive could possibly even come close to this within the next

00:11:00.959 --> 00:11:06.079
i don't i don't know how long it would take because currently at the rate of

00:11:04.240 --> 00:11:10.399
growth for hard drives it would probably be another 10 years or

00:11:08.079 --> 00:11:13.839
so assuming hard drives are still relevant at that point in time

00:11:12.320 --> 00:11:17.839
which is something that beer storage is trying to actually avoid

00:11:15.839 --> 00:11:23.680
they feel that flash storage is inherently superior to spinning disks

00:11:21.200 --> 00:11:26.839
and quite frankly from what i've seen so far it definitely kind of looks like it

00:11:25.839 --> 00:11:30.800
you get better density

00:11:29.120 --> 00:11:34.000
better power draw and you get

00:11:32.720 --> 00:11:39.040
a much much lower level ability to like just

00:11:36.560 --> 00:11:42.399
manage the data on the drive than you would get from a traditional type of

00:11:40.959 --> 00:11:47.120
hard drive you may always be asking yourself who

00:11:44.320 --> 00:11:50.880
could possibly need this much storage well the answer al there's rails and

00:11:49.120 --> 00:11:56.000
stuff here so oh this this is great so the answer is

00:11:54.399 --> 00:12:01.680
anybody who's in the deep learning so like this is an NVIDIA dgs

00:11:59.440 --> 00:12:04.560
this is one of the things that pure storage is actually partnered with

00:12:03.200 --> 00:12:10.240
NVIDIA 4 uh meta recently partnered with uh

00:12:07.839 --> 00:12:15.519
your stores and i think also NVIDIA for their ai uh deep learning platform so

00:12:13.040 --> 00:12:20.240
the djx is powering the brains whereas pure storage is powering all of

00:12:17.920 --> 00:12:22.800
the data sets that they need to crunch through

00:12:21.120 --> 00:12:27.519
so you're never going to get

00:12:24.480 --> 00:12:30.000
a GPU with petabytes of

00:12:27.519 --> 00:12:34.880
video memory at least not in the next uh i don't know several decades at least

00:12:33.680 --> 00:12:38.959
so what we're looking at here is an array

00:12:36.959 --> 00:12:43.120
that's fast enough to keep up with those deep learning workloads on those gpus to

00:12:41.120 --> 00:12:47.120
keep them fed so that they can actually be doing their job more often

00:12:45.600 --> 00:12:51.200
otherwise what you would need is a massive amount of memory for the system

00:12:49.279 --> 00:12:57.360
to cache that kind of thing which is just impractical so here we have man

00:12:53.760 --> 00:13:00.480
flash doing that exact job it is very

00:12:57.360 --> 00:13:03.839
responsive and very dense

00:13:00.480 --> 00:13:06.160
so for those reasons basically like

00:13:03.839 --> 00:13:10.160
if you have a really deep learning workload this is pretty much the

00:13:08.320 --> 00:13:15.120
premiere solution right now as far as i can tell both NVIDIA and pure storage need to

00:13:13.200 --> 00:13:21.360
think so we're looking at here is an example of an NVIDIA dgx a100 in action

00:13:18.800 --> 00:13:26.800
right now this is actually computing deep learning data and it's connected

00:13:23.920 --> 00:13:29.279
via 100 meter per second links through these switches here

00:13:28.320 --> 00:13:35.040
to the gear storage arrays so there's the

00:13:33.360 --> 00:13:38.320
previous generation flashblades and i believe there's also a flashblade s

00:13:37.279 --> 00:13:43.680
there as well which is the current generation i think they're doing performance testing on those right now

00:13:41.519 --> 00:13:50.320
to see which is faster and how to you know how to basically optimize for these

00:13:46.839 --> 00:13:52.000
workloads it's pretty amazing that

00:13:50.320 --> 00:13:56.399
we're basically at a point where ssds aren't fast enough and yet

00:13:53.920 --> 00:14:00.480
the ssds they're replacing them with are technically slower

00:13:57.920 --> 00:14:05.279
because they're using qlc memory instead of tlc which is just mind-boggling it's

00:14:03.360 --> 00:14:08.480
supposed to be an order of magnitude less efficient and yet they're making it

00:14:06.880 --> 00:14:13.440
work now you might be thinking to yourself these are full computers in

00:14:10.720 --> 00:14:17.199
these blades and server chassis that are basically just being used as SSD arrays

00:14:16.320 --> 00:14:21.120
like what happens after the SSD array is

00:14:19.360 --> 00:14:24.880
retired like when you no longer need it or whether you upgrade to a new one is

00:14:22.800 --> 00:14:30.560
the whole chassis that's kind of thrown out well no what your storage is doing

00:14:27.760 --> 00:14:34.160
is they have a whole bunch of completely empty chassis here

00:14:32.399 --> 00:14:38.320
running virtual machines and other workloads

00:14:36.639 --> 00:14:43.600
that don't require that kind of storage so they're repurposing those xeon

00:14:40.800 --> 00:14:47.519
processors that would otherwise just be i don't know like if you if you retired

00:14:45.839 --> 00:14:51.440
in SSD and put it on yourself what does that make it you know

00:14:49.600 --> 00:14:57.120
so in this case these live on even if the flash has died

00:14:54.160 --> 00:15:02.160
or it's been upgraded because it was too uh smalling capacity and in fact behind

00:15:00.399 --> 00:15:06.399
you there are a whole bunch of processors and RAM and

00:15:04.639 --> 00:15:10.399
other stuff that's just sitting there waiting to be tested or

00:15:09.279 --> 00:15:17.519
reused now you'll be wondering how you even managed to like talk to these things

00:15:14.959 --> 00:15:22.240
well the file system that they use i say file system very loosely it's actually

00:15:20.160 --> 00:15:27.120
more like a database that you can actually then create

00:15:24.160 --> 00:15:32.000
uh what they call uh authorities on top of it with 128 of them split

00:15:29.440 --> 00:15:36.560
across the entire array the larger the array is the faster they go all of those

00:15:34.800 --> 00:15:43.279
authorities can be used to do things like create object stores like amazon s3

00:15:40.079 --> 00:15:44.560
and from there you can also create smb

00:15:43.279 --> 00:15:50.160
so samba Windows file sharing

00:15:46.320 --> 00:15:51.839
or nfs for Linux file sharing support

00:15:50.160 --> 00:15:56.079
so pretty much it doesn't look any different to the end

00:15:53.519 --> 00:15:59.600
user you can use it however you feel like you need to use your data so you

00:15:58.079 --> 00:16:03.759
can pull stuff directly down through amazon for your cloud services

00:16:01.120 --> 00:16:07.759
deployments or you can just use it as network storage if that's what you

00:16:05.040 --> 00:16:13.680
really want and what's really cool they probably wouldn't let me do it

00:16:10.160 --> 00:16:15.519
but if you take out one of the dfms

00:16:13.680 --> 00:16:20.079
uh and like rearrange it or something within 10 minutes it picks right back up

00:16:18.240 --> 00:16:24.800
without having to do anything special just you slot it back in and it's like

00:16:22.079 --> 00:16:29.199
nothing happened so you can say for example if you've got a blade that's

00:16:26.800 --> 00:16:32.880
misbehaving or you want to upgrade it you can completely migrate over without

00:16:31.360 --> 00:16:37.839
having to change anything about your configuration your users basically won't

00:16:35.279 --> 00:16:40.959
know what happened because nothing will have happened

00:16:38.880 --> 00:16:45.440
it's it's kind of magic it's really really warm back here but

00:16:44.160 --> 00:16:48.800
these switches we were looking at earlier with all of these chassis

00:16:47.199 --> 00:16:52.560
plugged in this is one of them not only is this a

00:16:51.199 --> 00:16:57.360
network switch but it's also basically the same type of

00:16:54.880 --> 00:17:02.800
thing you find in the file in the back of a flashblade s

00:17:00.079 --> 00:17:07.520
so it's got an x86 processor in here that handles all of the communications

00:17:05.360 --> 00:17:11.600
and in fact when you plug in multiple flash play chassis

00:17:09.199 --> 00:17:16.000
this thing takes over and actually orchestrates the entire lottery

00:17:13.919 --> 00:17:20.480
so they're no longer doing their own individual arrays this thing takes over

00:17:18.799 --> 00:17:25.439
automatically also

00:17:22.160 --> 00:17:27.199
each one of these ports can do 40 or 100

00:17:25.439 --> 00:17:31.440
gigabits per second depending on the five-stage capacity the older models

00:17:29.200 --> 00:17:36.480
could always be 40 whereas the newer model flash plate s that can do 100

00:17:35.360 --> 00:17:40.640
so it's a massive amount of data that will

00:17:38.400 --> 00:17:44.720
flow between this thing and the rest of iraq now that you've seen

00:17:42.640 --> 00:17:50.080
the tech let's talk about who these guys even are pure storage was founded in

00:17:47.600 --> 00:17:53.840
stealth mode back in 2009 and debuted in 2011 as one of the first companies to

00:17:52.080 --> 00:17:57.360
introduce all flash infrastructure solutions in the industry

00:17:55.679 --> 00:18:01.520
in the early days they used consumer grade ssds which

00:17:59.679 --> 00:18:05.760
if you look back on the state of ssds back then

00:18:02.799 --> 00:18:10.559
was pretty ballsy but the solution was always software driven and they quickly

00:18:08.000 --> 00:18:15.840
began developing their own flash modules which started shipping in 2015. fast

00:18:13.840 --> 00:18:20.400
forward to today and they partnered with companies like cisco and NVIDIA with

00:18:17.679 --> 00:18:24.000
clients across the globe so big thanks to pure storage for sponsoring this

00:18:21.840 --> 00:18:27.919
video and letting us show off their gear you can learn more and maybe deploy one

00:18:25.919 --> 00:18:32.000
of these for yourselves if you're a straight baller or if you're an i.t

00:18:29.600 --> 00:18:35.360
manager at the links below thanks for watching guys maybe go check out one of

00:18:33.280 --> 00:18:39.280
the Intel design center tours that Linus did a little while ago like those are

00:18:38.000 --> 00:18:44.320
really really next level in terms of how like

00:18:42.080 --> 00:18:48.720
behind the scenes we're seeing like Intel basically never lets anybody see

00:18:46.640 --> 00:18:52.480
that kind of stuff and i'm glad that we got to see this kind of stuff here
