|
|
-
Re: Suitability of HDFS for live file store
Brian Bockelman 2012-10-15, 20:35
Hi,
We use HDFS to process data for the LHC - somewhat similar case here. Our files are a bit larger, our total local data size if ~1PB logical, and we "bring our own" batch system, so no Map-Reduce. We perform many random reads, so we are quite sensitive to underlying latency.
I don't see any obvious mismatches between your requirements and HDFS capabilities that you can eliminate it as a candidate without an evaluation. Do note that HDFS does not provide complete POSIX semantics - but you don't appear to need them?
IMHO, if you are looking for the following requirements: 1) Proven petascale data store (never want to be on the bleeding edge of your filesystem's scaling!). 2) Has self-healing semantics (can recover from the loss of RAIDs or entire storage targets). 3) Open source (but do consider commercial companies - your time is worth something!).
You end up at looking at a very small number of candidates. Others filesystems that should be on your list:
1) Gluster. A quite viable alternate. Like HDFS, you can buy commercial support. I personally don't know enough to provide a pros/cons list, but we keep it on our radar. 2) Ceph. Not as proven IMHO. I don't know of multiple petascale deploys. Requires a quite recent kernel. Quite good on-paper design. 3) Lustre. I think you'd be disappointed with the self-healing. A very "traditional" HPC/clustered filesystem design.
For us, HDFS wins. I think it has the possibility of being a winner in your case too.
Brian
On Oct 15, 2012, at 3:21 PM, Jay Vyas <[EMAIL PROTECTED]> wrote:
> Seems like a heavyweight solution unless you are actually processing the images? > > Wow, no mapreduce, no streaming writes, and relatively small files. Im surprised that you are considering hadoop at all ? > > Im surprised there isnt a simpler solution that uses redundancy without all the > daemons and name nodes and task trackers and stuff. > > Might make it kind of awkward as a normal file system. > > On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <[EMAIL PROTECTED]> wrote: > Hey Matt, > > What do you mean by 'real-time' though? While HDFS has pretty good > contiguous data read speeds (and you get N x replicas to read from), > if you're looking to "cache" frequently accessed files into memory > then HDFS does not natively have support for that. Otherwise, I agree > with Brock, seems like you could make it work with HDFS (sans > MapReduce - no need to run it if you don't need it). > > The presence of NameNode audit logging will help your file access > analysis requirement. > > On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <[EMAIL PROTECTED]> wrote: > > Hi, > > > > I am a new Hadoop user, and would really appreciate your opinions on whether > > Hadoop is the right tool for what I'm thinking of using it for. > > > > I am investigating options for scaling an archive of around 100Tb of image > > data. These images are typically TIFF files of around 50-100Mb each and need > > to be made available online in realtime. Access to the files will be > > sporadic and occasional, but writing the files will be a daily activity. > > Speed of write is not particularly important. > > > > Our previous solution was a monolithic, expensive - and very full - SAN so I > > am excited by Hadoop's distributed, extensible, redundant architecture. > > > > My concern is that a lot of the discussion on and use cases for Hadoop is > > regarding data processing with MapReduce and - from what I understand - > > using HDFS for the purpose of input for MapReduce jobs. My other concern is > > vague indication that it's not a 'real-time' system. We may be using > > MapReduce in small components of the application, but it will most likely be > > in file access analysis rather than any processing on the files themselves. > > > > In other words, what I really want is a distributed, resilient, scalable > > filesystem. > > > > Is Hadoop suitable if we just use this facility, or would I be misusing it > > and inviting grief?
-
Re: Suitability of HDFS for live file store
Matt Painter 2012-10-15, 20:59
Sorry, I should have provided a bit more detail. Currently our data set comprises of 50-100Mb TIFF files. In the near future we'd like to store and process preservation-quality digitised film, which will individually exceed this size by orders of magnitude (and has currently been in the "too-hard" basket with our current infrastructure). In general, our thinking thus far has been very much based on what our current infrastructure can provide - so I'm excited to have alternatives available.
There will also be thumbnail generation as well as generation of the screen-resolution JPEGs that I alluded to, and PDF generation. Whether the JPEG/PDF derivatives are stored in HDFS remains to be seen - these can be easily regenerated at any stage and their total size will be relatively small, so it may not be the best fit for storage of these guys.
M On 16 October 2012 09:35, Brian Bockelman <[EMAIL PROTECTED]> wrote:
> Hi, > > We use HDFS to process data for the LHC - somewhat similar case here. Our > files are a bit larger, our total local data size if ~1PB logical, and we > "bring our own" batch system, so no Map-Reduce. We perform many random > reads, so we are quite sensitive to underlying latency. > > I don't see any obvious mismatches between your requirements and HDFS > capabilities that you can eliminate it as a candidate without an > evaluation. Do note that HDFS does not provide complete POSIX semantics - > but you don't appear to need them? > > IMHO, if you are looking for the following requirements: > 1) Proven petascale data store (never want to be on the bleeding edge of > your filesystem's scaling!). > 2) Has self-healing semantics (can recover from the loss of RAIDs or > entire storage targets). > 3) Open source (but do consider commercial companies - your time is worth > something!). > > You end up at looking at a very small number of candidates. Others > filesystems that should be on your list: > > 1) Gluster. A quite viable alternate. Like HDFS, you can buy commercial > support. I personally don't know enough to provide a pros/cons list, but > we keep it on our radar. > 2) Ceph. Not as proven IMHO. I don't know of multiple petascale deploys. > Requires a quite recent kernel. Quite good on-paper design. > 3) Lustre. I think you'd be disappointed with the self-healing. A very > "traditional" HPC/clustered filesystem design. > > For us, HDFS wins. I think it has the possibility of being a winner in > your case too. > > Brian > > On Oct 15, 2012, at 3:21 PM, Jay Vyas <[EMAIL PROTECTED]> wrote: > > Seems like a heavyweight solution unless you are actually processing the > images? > > Wow, no mapreduce, no streaming writes, and relatively small files. Im > surprised that you are considering hadoop at all ? > > Im surprised there isnt a simpler solution that uses redundancy without > all the > daemons and name nodes and task trackers and stuff. > > Might make it kind of awkward as a normal file system. > > On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <[EMAIL PROTECTED]> wrote: > >> Hey Matt, >> >> What do you mean by 'real-time' though? While HDFS has pretty good >> contiguous data read speeds (and you get N x replicas to read from), >> if you're looking to "cache" frequently accessed files into memory >> then HDFS does not natively have support for that. Otherwise, I agree >> with Brock, seems like you could make it work with HDFS (sans >> MapReduce - no need to run it if you don't need it). >> >> The presence of NameNode audit logging will help your file access >> analysis requirement. >> >> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <[EMAIL PROTECTED]> wrote: >> > Hi, >> > >> > I am a new Hadoop user, and would really appreciate your opinions on >> whether >> > Hadoop is the right tool for what I'm thinking of using it for. >> > >> > I am investigating options for scaling an archive of around 100Tb of >> image >> > data. These images are typically TIFF files of around 50-100Mb each and >> need >> > to be made available online in realtime. Access to the files will be Matt Painter [EMAIL PROTECTED] +64 21 115 9378
-
Re: Suitability of HDFS for live file store
Goldstone, Robin J. 2012-10-15, 21:35
If the goal is simply an alternative to SAN for cost-effective storage of large files you might want to take a look at Gluster. It is an open source scale-out distributed filesystem that can utilize local storage. Also, it has distributed metadata and a POSIX interface and can be accessed through a number of clients, including fuse, NFS and CIFS. Supposedly you can even run Hadoop on top of Gluster.
I hope I don't start any sort of flame war by mentioning Gluster on a Hadoop mailing list. Note I have no vested interest in this particular solution, although I am in the process of evaluating it myself.
From: Jay Vyas <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> Reply-To: "[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> Date: Monday, October 15, 2012 1:21 PM To: "[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> Subject: Re: Suitability of HDFS for live file store
Seems like a heavyweight solution unless you are actually processing the images?
Wow, no mapreduce, no streaming writes, and relatively small files. Im surprised that you are considering hadoop at all ?
Im surprised there isnt a simpler solution that uses redundancy without all the daemons and name nodes and task trackers and stuff.
Might make it kind of awkward as a normal file system.
On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Hey Matt,
What do you mean by 'real-time' though? While HDFS has pretty good contiguous data read speeds (and you get N x replicas to read from), if you're looking to "cache" frequently accessed files into memory then HDFS does not natively have support for that. Otherwise, I agree with Brock, seems like you could make it work with HDFS (sans MapReduce - no need to run it if you don't need it).
The presence of NameNode audit logging will help your file access analysis requirement.
On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > Hi, > > I am a new Hadoop user, and would really appreciate your opinions on whether > Hadoop is the right tool for what I'm thinking of using it for. > > I am investigating options for scaling an archive of around 100Tb of image > data. These images are typically TIFF files of around 50-100Mb each and need > to be made available online in realtime. Access to the files will be > sporadic and occasional, but writing the files will be a daily activity. > Speed of write is not particularly important. > > Our previous solution was a monolithic, expensive - and very full - SAN so I > am excited by Hadoop's distributed, extensible, redundant architecture. > > My concern is that a lot of the discussion on and use cases for Hadoop is > regarding data processing with MapReduce and - from what I understand - > using HDFS for the purpose of input for MapReduce jobs. My other concern is > vague indication that it's not a 'real-time' system. We may be using > MapReduce in small components of the application, but it will most likely be > in file access analysis rather than any processing on the files themselves. > > In other words, what I really want is a distributed, resilient, scalable > filesystem. > > Is Hadoop suitable if we just use this facility, or would I be misusing it > and inviting grief? > > M
-- Harsh J
-- Jay Vyas MMSB/UCHC
-
Re: Suitability of HDFS for live file store
Ted Dunning 2012-10-15, 22:17
If you are going to mention commercial distros, you should include MapR as well. Hadoop compatible, very scalable and handles very large numbers of files in a Posix-ish environment.
On Mon, Oct 15, 2012 at 1:35 PM, Brian Bockelman <[EMAIL PROTECTED]>wrote:
> Hi, > > We use HDFS to process data for the LHC - somewhat similar case here. Our > files are a bit larger, our total local data size if ~1PB logical, and we > "bring our own" batch system, so no Map-Reduce. We perform many random > reads, so we are quite sensitive to underlying latency. > > I don't see any obvious mismatches between your requirements and HDFS > capabilities that you can eliminate it as a candidate without an > evaluation. Do note that HDFS does not provide complete POSIX semantics - > but you don't appear to need them? > > IMHO, if you are looking for the following requirements: > 1) Proven petascale data store (never want to be on the bleeding edge of > your filesystem's scaling!). > 2) Has self-healing semantics (can recover from the loss of RAIDs or > entire storage targets). > 3) Open source (but do consider commercial companies - your time is worth > something!). > > You end up at looking at a very small number of candidates. Others > filesystems that should be on your list: > > 1) Gluster. A quite viable alternate. Like HDFS, you can buy commercial > support. I personally don't know enough to provide a pros/cons list, but > we keep it on our radar. > 2) Ceph. Not as proven IMHO. I don't know of multiple petascale deploys. > Requires a quite recent kernel. Quite good on-paper design. > 3) Lustre. I think you'd be disappointed with the self-healing. A very > "traditional" HPC/clustered filesystem design. > > For us, HDFS wins. I think it has the possibility of being a winner in > your case too. > > Brian > > On Oct 15, 2012, at 3:21 PM, Jay Vyas <[EMAIL PROTECTED]> wrote: > > Seems like a heavyweight solution unless you are actually processing the > images? > > Wow, no mapreduce, no streaming writes, and relatively small files. Im > surprised that you are considering hadoop at all ? > > Im surprised there isnt a simpler solution that uses redundancy without > all the > daemons and name nodes and task trackers and stuff. > > Might make it kind of awkward as a normal file system. > > On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <[EMAIL PROTECTED]> wrote: > >> Hey Matt, >> >> What do you mean by 'real-time' though? While HDFS has pretty good >> contiguous data read speeds (and you get N x replicas to read from), >> if you're looking to "cache" frequently accessed files into memory >> then HDFS does not natively have support for that. Otherwise, I agree >> with Brock, seems like you could make it work with HDFS (sans >> MapReduce - no need to run it if you don't need it). >> >> The presence of NameNode audit logging will help your file access >> analysis requirement. >> >> >> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <[EMAIL PROTECTED]> wrote: >> > Hi, >> > >> > I am a new Hadoop user, and would really appreciate your opinions on >> whether >> > Hadoop is the right tool for what I'm thinking of using it for. >> > >> > I am investigating options for scaling an archive of around 100Tb of >> image >> > data. These images are typically TIFF files of around 50-100Mb each and >> need >> > to be made available online in realtime. Access to the files will be >> > sporadic and occasional, but writing the files will be a daily activity. >> > Speed of write is not particularly important. >> > >> > Our previous solution was a monolithic, expensive - and very full - SAN >> so I >> > am excited by Hadoop's distributed, extensible, redundant architecture. >> > >> > My concern is that a lot of the discussion on and use cases for Hadoop >> is >> > regarding data processing with MapReduce and - from what I understand - >> > using HDFS for the purpose of input for MapReduce jobs. My other >> concern is >> > vague indication that it's not a 'real-time' system. We may be using
-
Re: Suitability of HDFS for live file store
Vinod Kumar Vavilapalli 2012-10-15, 23:25
For your original use case, HDFS indeed sounded like an overkill. But once you start thinking of thumbnail generation, PDFs etc, MapReduce obviously fits the bill.
If you wish to do stuff like streaming the stored digital films, clearly, you may want to move your serving somewhere else that works in tandem with Hadoop.
Thanks, +Vinod
On Oct 15, 2012, at 1:59 PM, Matt Painter wrote:
> Sorry, I should have provided a bit more detail. Currently our data set comprises of 50-100Mb TIFF files. In the near future we'd like to store and process preservation-quality digitised film, which will individually exceed this size by orders of magnitude (and has currently been in the "too-hard" basket with our current infrastructure). In general, our thinking thus far has been very much based on what our current infrastructure can provide - so I'm excited to have alternatives available. > > There will also be thumbnail generation as well as generation of the screen-resolution JPEGs that I alluded to, and PDF generation. Whether the JPEG/PDF derivatives are stored in HDFS remains to be seen - these can be easily regenerated at any stage and their total size will be relatively small, so it may not be the best fit for storage of these guys. > > M > > > On 16 October 2012 09:35, Brian Bockelman <[EMAIL PROTECTED]> wrote: > Hi, > > We use HDFS to process data for the LHC - somewhat similar case here. Our files are a bit larger, our total local data size if ~1PB logical, and we "bring our own" batch system, so no Map-Reduce. We perform many random reads, so we are quite sensitive to underlying latency. > > I don't see any obvious mismatches between your requirements and HDFS capabilities that you can eliminate it as a candidate without an evaluation. Do note that HDFS does not provide complete POSIX semantics - but you don't appear to need them? > > IMHO, if you are looking for the following requirements: > 1) Proven petascale data store (never want to be on the bleeding edge of your filesystem's scaling!). > 2) Has self-healing semantics (can recover from the loss of RAIDs or entire storage targets). > 3) Open source (but do consider commercial companies - your time is worth something!). > > You end up at looking at a very small number of candidates. Others filesystems that should be on your list: > > 1) Gluster. A quite viable alternate. Like HDFS, you can buy commercial support. I personally don't know enough to provide a pros/cons list, but we keep it on our radar. > 2) Ceph. Not as proven IMHO. I don't know of multiple petascale deploys. Requires a quite recent kernel. Quite good on-paper design. > 3) Lustre. I think you'd be disappointed with the self-healing. A very "traditional" HPC/clustered filesystem design. > > For us, HDFS wins. I think it has the possibility of being a winner in your case too. > > Brian > > On Oct 15, 2012, at 3:21 PM, Jay Vyas <[EMAIL PROTECTED]> wrote: > >> Seems like a heavyweight solution unless you are actually processing the images? >> >> Wow, no mapreduce, no streaming writes, and relatively small files. Im surprised that you are considering hadoop at all ? >> >> Im surprised there isnt a simpler solution that uses redundancy without all the >> daemons and name nodes and task trackers and stuff. >> >> Might make it kind of awkward as a normal file system. >> >> On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <[EMAIL PROTECTED]> wrote: >> Hey Matt, >> >> What do you mean by 'real-time' though? While HDFS has pretty good >> contiguous data read speeds (and you get N x replicas to read from), >> if you're looking to "cache" frequently accessed files into memory >> then HDFS does not natively have support for that. Otherwise, I agree >> with Brock, seems like you could make it work with HDFS (sans >> MapReduce - no need to run it if you don't need it). >> >> The presence of NameNode audit logging will help your file access >> analysis requirement. >> >> On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <[EMAIL PROTECTED]> wrote:
|
|