Your understanding is right, hadoop definitely works great with large volume of data. But not necessarily every file should be in the range of Giga,Tera or Peta bytes. Mostly when said hadoop process tera bytes of data, It is the total data processed by a map reduce job(rather jobs, most use cases uses more than one map reduce job for processing). It can be 10K files that make up the whole data. Why not large number of small files? The over head on the name node in housekeeping all these large amount of meta data(file- block information) would be huge and there is definitely limits to it. But you can store smaller files together in splittable compressed formats. In general It is better to keep your file sizes atleast same or more than your hdfs block size. In default it is 64Mb but larger clusters have higher values as multiples of 64. If your hdfs block size or your file sizes are lesser than the map reduce input split size then it is better using InputFormats like CombinedInput Format or so for MR jobs. Usually the MR input split size is equal to your hdfs block size. In short as a better practice your single file size should be at least equal to one hdfs block size.
The approach of keeping a file opened for long to write and then reading the same parallely with a map reduce, I fear it would work. AFAIK it won't. When a write is going on some blocks or the file itself would be locked, not really sure its the full file being locked or not. In short some blocks wouldn't be available for the concurrent Map Reduce Program during its processing.
In your case a quick solution that comes to my mind is keep your real time data writing into the flume queue/buffer . Set it to a desired size once the queue gets full the data would be dumped into hdfs. Then as per your requirement you can kick off your jobs. If you are running MR jobs on very high frequency then make sure that for every run you have enough data to process and choose your max number of mappers and reducers effectively and efficiently
Then as the last one, I don't think for normal cases you don't need to dump your large volume of data into lfs and then do a copyFromLocal into hdfs. Tools like flume are build for those purposes I guess. I'm not an expert on Flume, you may need to do more reading on the same before implementing.
This what I feel on your use case. But let's leave it open for the experts to comment.
Hope it helps.
Bejoy K S
From: Sam Seigal <[EMAIL PROTECTED]>
Sender: [EMAIL PROTECTED]
Date: Sat, 1 Oct 2011 15:50:46
To: <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
Subject: Re: incremental loads into hadoop
Thanks for the response.
While reading about Hadoop, I have come across threads where people
claim that Hadoop is not a good fit for a large amount of small files.
It is good for files that are gigabyes/petabytes in size.
If I am doing incremental loads, let's say every hour. Do I need to
wait until maybe at the end of the day when enough data has been
collected to start off a MapReduce job ? I am wondering if an open
file that is continuously being written to can at the same time be
used as an input to an M/R job ...
Also, let's say I did not want to do a load straight off the DB. The
service, when committing a transaction to the OLTP system, sends a
message for that transaction to a Hadoop Service that then writes the
transaction into HDFS (the services are connected to each other via a
persisted queue, hence are eventually consistent, but that is not a
big deal) .. What should I keep in mind while designing a service like
Should the file be first written to local disk, and when they reach a
large enough size (let us say the cut off is 100G), and then be
uploaded into the cluster using put ? or these can be directly written
into an HDFS file as the data is streaming in.
Thank you for your help.
On Sat, Oct 1, 2011 at 12:19 PM, Bejoy KS <[EMAIL PROTECTED]> wrote: