|
|
-
Suitability of HDFS for live file store
Matt Painter 2012-10-15, 19:47
Hi,
I am a new Hadoop user, and would really appreciate your opinions on whether Hadoop is the right tool for what I'm thinking of using it for.
I am investigating options for scaling an archive of around 100Tb of image data. These images are typically TIFF files of around 50-100Mb each and need to be made available online in realtime. Access to the files will be sporadic and occasional, but writing the files will be a daily activity. Speed of write is not particularly important.
Our previous solution was a monolithic, expensive - and very full - SAN so I am excited by Hadoop's distributed, extensible, redundant architecture.
My concern is that a lot of the discussion on and use cases for Hadoop is regarding data processing with MapReduce and - from what I understand - using HDFS for the purpose of input for MapReduce jobs. My other concern is vague indication that it's not a 'real-time' system. We may be using MapReduce in small components of the application, but it will most likely be in file access analysis rather than any processing on the files themselves.
In other words, what I really want is a distributed, resilient, scalable filesystem.
Is Hadoop suitable if we just use this facility, or would I be misusing it and inviting grief?
M
-
Re: Suitability of HDFS for live file store
Brock Noland 2012-10-15, 20:05
Hi, Generally I do not see a problem with your plan of using HDFS to store these files, assuming they are updated rarely if ever. Hadoop is traditionally a batch system and MapReduce largely remains a batch system. I'd argue this because minimum job latencies are in the "seconds" range. HDFS, however, has real time systems built on top of it, like HBase. The main issue to be concerned with when using HDFS as simply storage is file size. As HDFS stores it's metadata in RAM, you don't want to create tremendous numbers of "small" files. With 50-100MB files you should fine. Cheers, Brock On Mon, Oct 15, 2012 at 2:47 PM, Matt Painter <[EMAIL PROTECTED]> wrote: > Hi, > > I am a new Hadoop user, and would really appreciate your opinions on whether > Hadoop is the right tool for what I'm thinking of using it for. > > I am investigating options for scaling an archive of around 100Tb of image > data. These images are typically TIFF files of around 50-100Mb each and need > to be made available online in realtime. Access to the files will be > sporadic and occasional, but writing the files will be a daily activity. > Speed of write is not particularly important. > > Our previous solution was a monolithic, expensive - and very full - SAN so I > am excited by Hadoop's distributed, extensible, redundant architecture. > > My concern is that a lot of the discussion on and use cases for Hadoop is > regarding data processing with MapReduce and - from what I understand - > using HDFS for the purpose of input for MapReduce jobs. My other concern is > vague indication that it's not a 'real-time' system. We may be using > MapReduce in small components of the application, but it will most likely be > in file access analysis rather than any processing on the files themselves. > > In other words, what I really want is a distributed, resilient, scalable > filesystem. > > Is Hadoop suitable if we just use this facility, or would I be misusing it > and inviting grief? > > M -- Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
-
Re: Suitability of HDFS for live file store
Harsh J 2012-10-15, 20:08
Hey Matt,
What do you mean by 'real-time' though? While HDFS has pretty good contiguous data read speeds (and you get N x replicas to read from), if you're looking to "cache" frequently accessed files into memory then HDFS does not natively have support for that. Otherwise, I agree with Brock, seems like you could make it work with HDFS (sans MapReduce - no need to run it if you don't need it).
The presence of NameNode audit logging will help your file access analysis requirement.
On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <[EMAIL PROTECTED]> wrote: > Hi, > > I am a new Hadoop user, and would really appreciate your opinions on whether > Hadoop is the right tool for what I'm thinking of using it for. > > I am investigating options for scaling an archive of around 100Tb of image > data. These images are typically TIFF files of around 50-100Mb each and need > to be made available online in realtime. Access to the files will be > sporadic and occasional, but writing the files will be a daily activity. > Speed of write is not particularly important. > > Our previous solution was a monolithic, expensive - and very full - SAN so I > am excited by Hadoop's distributed, extensible, redundant architecture. > > My concern is that a lot of the discussion on and use cases for Hadoop is > regarding data processing with MapReduce and - from what I understand - > using HDFS for the purpose of input for MapReduce jobs. My other concern is > vague indication that it's not a 'real-time' system. We may be using > MapReduce in small components of the application, but it will most likely be > in file access analysis rather than any processing on the files themselves. > > In other words, what I really want is a distributed, resilient, scalable > filesystem. > > Is Hadoop suitable if we just use this facility, or would I be misusing it > and inviting grief? > > M
-- Harsh J
-
Re: Suitability of HDFS for live file store
Matt Painter 2012-10-15, 20:17
Thanks guys; really appreciated.
I was deliberately vague about the notion of real-time because I didn't know what the metrics are that made Hadoop be considered a batch system - if that makes sense!
Essentially, the speed of access to the files stored in HDFS needs to be comparable to files being read off a native file system in order for end-user download. Whereas the bulk of the data on disk will be TIFF files, we will also be including JPEG derivatives which we are intending to be displaying inline in a web-based application.
We typically have sparse access metrics - we have millions of files, but each file may be viewed only zero or one time over a year. Therefore, native in-memory caching isn't so much of an issue.
M
On 16 October 2012 09:08, Harsh J <[EMAIL PROTECTED]> wrote:
> Hey Matt, > > What do you mean by 'real-time' though? While HDFS has pretty good > contiguous data read speeds (and you get N x replicas to read from), > if you're looking to "cache" frequently accessed files into memory > then HDFS does not natively have support for that. Otherwise, I agree > with Brock, seems like you could make it work with HDFS (sans > MapReduce - no need to run it if you don't need it). > > The presence of NameNode audit logging will help your file access > analysis requirement. > > On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <[EMAIL PROTECTED]> wrote: > > Hi, > > > > I am a new Hadoop user, and would really appreciate your opinions on > whether > > Hadoop is the right tool for what I'm thinking of using it for. > > > > I am investigating options for scaling an archive of around 100Tb of > image > > data. These images are typically TIFF files of around 50-100Mb each and > need > > to be made available online in realtime. Access to the files will be > > sporadic and occasional, but writing the files will be a daily activity. > > Speed of write is not particularly important. > > > > Our previous solution was a monolithic, expensive - and very full - SAN > so I > > am excited by Hadoop's distributed, extensible, redundant architecture. > > > > My concern is that a lot of the discussion on and use cases for Hadoop is > > regarding data processing with MapReduce and - from what I understand - > > using HDFS for the purpose of input for MapReduce jobs. My other concern > is > > vague indication that it's not a 'real-time' system. We may be using > > MapReduce in small components of the application, but it will most > likely be > > in file access analysis rather than any processing on the files > themselves. > > > > In other words, what I really want is a distributed, resilient, scalable > > filesystem. > > > > Is Hadoop suitable if we just use this facility, or would I be misusing > it > > and inviting grief? > > > > M > > > > -- > Harsh J >
-- Matt Painter [EMAIL PROTECTED] +64 21 115 9378
-
Re: Suitability of HDFS for live file store
Brock Noland 2012-10-15, 20:18
Hi, Harsh makes a good point, there is no explicit way to say "these files should remain in memory". However, I would note that give available RAM on the datanodes, the operating system will cache recently accessed blocks. Brock On Mon, Oct 15, 2012 at 3:08 PM, Harsh J <[EMAIL PROTECTED]> wrote: > Hey Matt, > > What do you mean by 'real-time' though? While HDFS has pretty good > contiguous data read speeds (and you get N x replicas to read from), > if you're looking to "cache" frequently accessed files into memory > then HDFS does not natively have support for that. Otherwise, I agree > with Brock, seems like you could make it work with HDFS (sans > MapReduce - no need to run it if you don't need it). > > The presence of NameNode audit logging will help your file access > analysis requirement. > > On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <[EMAIL PROTECTED]> wrote: >> Hi, >> >> I am a new Hadoop user, and would really appreciate your opinions on whether >> Hadoop is the right tool for what I'm thinking of using it for. >> >> I am investigating options for scaling an archive of around 100Tb of image >> data. These images are typically TIFF files of around 50-100Mb each and need >> to be made available online in realtime. Access to the files will be >> sporadic and occasional, but writing the files will be a daily activity. >> Speed of write is not particularly important. >> >> Our previous solution was a monolithic, expensive - and very full - SAN so I >> am excited by Hadoop's distributed, extensible, redundant architecture. >> >> My concern is that a lot of the discussion on and use cases for Hadoop is >> regarding data processing with MapReduce and - from what I understand - >> using HDFS for the purpose of input for MapReduce jobs. My other concern is >> vague indication that it's not a 'real-time' system. We may be using >> MapReduce in small components of the application, but it will most likely be >> in file access analysis rather than any processing on the files themselves. >> >> In other words, what I really want is a distributed, resilient, scalable >> filesystem. >> >> Is Hadoop suitable if we just use this facility, or would I be misusing it >> and inviting grief? >> >> M > > > > -- > Harsh J -- Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
-
Re: Suitability of HDFS for live file store
Jay Vyas 2012-10-15, 20:21
Seems like a heavyweight solution unless you are actually processing the images?
Wow, no mapreduce, no streaming writes, and relatively small files. Im surprised that you are considering hadoop at all ?
Im surprised there isnt a simpler solution that uses redundancy without all the daemons and name nodes and task trackers and stuff.
Might make it kind of awkward as a normal file system.
On Mon, Oct 15, 2012 at 4:08 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> Hey Matt, > > What do you mean by 'real-time' though? While HDFS has pretty good > contiguous data read speeds (and you get N x replicas to read from), > if you're looking to "cache" frequently accessed files into memory > then HDFS does not natively have support for that. Otherwise, I agree > with Brock, seems like you could make it work with HDFS (sans > MapReduce - no need to run it if you don't need it). > > The presence of NameNode audit logging will help your file access > analysis requirement. > > On Tue, Oct 16, 2012 at 1:17 AM, Matt Painter <[EMAIL PROTECTED]> wrote: > > Hi, > > > > I am a new Hadoop user, and would really appreciate your opinions on > whether > > Hadoop is the right tool for what I'm thinking of using it for. > > > > I am investigating options for scaling an archive of around 100Tb of > image > > data. These images are typically TIFF files of around 50-100Mb each and > need > > to be made available online in realtime. Access to the files will be > > sporadic and occasional, but writing the files will be a daily activity. > > Speed of write is not particularly important. > > > > Our previous solution was a monolithic, expensive - and very full - SAN > so I > > am excited by Hadoop's distributed, extensible, redundant architecture. > > > > My concern is that a lot of the discussion on and use cases for Hadoop is > > regarding data processing with MapReduce and - from what I understand - > > using HDFS for the purpose of input for MapReduce jobs. My other concern > is > > vague indication that it's not a 'real-time' system. We may be using > > MapReduce in small components of the application, but it will most > likely be > > in file access analysis rather than any processing on the files > themselves. > > > > In other words, what I really want is a distributed, resilient, scalable > > filesystem. > > > > Is Hadoop suitable if we just use this facility, or would I be misusing > it > > and inviting grief? > > > > M > > > > -- > Harsh J >
-- Jay Vyas MMSB/UCHC
|
|