We have a very serious small file problem. I created a M/R job that combines files as it seemed best to use all the resources of the cluster rather than opening a stream and combining files single threaded or trying to do something via command line.
From: Frank Grimes [mailto:[EMAIL PROTECTED]]
Sent: Friday, January 06, 2012 2:56 PM
To: [EMAIL PROTECTED]
Subject: Re: Combining AVRO files efficiently within HDFS
That's a very good suggestion and might suit us just fine.
However, many of the files will be much smaller than the HDFS block size.
That could affect the performance of the MapReduce jobs, correct?
Also, from my understanding it would put more burden on the name node (memory usage) than is necessary.
Assuming we did want to combine the actual files... how would you suggest we might go about it?
On 2012-01-06, at 1:05 PM, Joey Echeverria wrote:
> I would do it by staging the machine data into a temporary directory
> and then renaming the directory when it's been verified. So, data
> would be written into directories like this:
> After verification, you'd rename the 2012-01/02/00/stage directory to
> 2012-01/02/00/done. Since renaming a directory in HDFS is an atomic
> operation, you get the guarantee the you're looking for without having
> to do extra IO. There shouldn't be a benefit to merging the individual
> files unless they're too small.
> On Fri, Jan 6, 2012 at 9:21 AM, Frank Grimes <[EMAIL PROTECTED]> wrote:
>> Hi Bobby,
>> Actually, the problem we're trying to solve is one of completeness.
>> Say we have 3 machines generating log events and putting them to HDFS
>> on an hourly basis.
>> Sometime after the hour, we would have a scheduled job verify that
>> all the expected machines' log files are present and complete in HDFS.
>> Before launching MapReduce jobs for a given date range, we want to
>> verify that the job will run over complete data.
>> If not, the query would error out.
>> We want our query/MapReduce layer to not need to be aware of logs at
>> the machine level, only the presence or not of an hour's worth of logs.
>> We were thinking that after verifying all in individual log files for
>> an hour, they could be combined into 2012-01/01/00/log.avro.
>> The presence of 2012-01-01-00.log.avro would be all that needs to be
>> However, we're new to both Avro and Hadoop so not sure of the most
>> efficient (and reliable) way to accomplish this.
>> Frank Grimes
>> On 2012-01-06, at 11:46 AM, Robert Evans wrote:
>> That depends on what you mean by combining. It sounds like you are
>> trying to aggregate data from several days, which may involve doing a
>> join so I would say a MapReduce job is your best bet. If you are not
>> going to do any processing at all then why are you trying to combine
>> them? Is there something that requires them all to be part of a
>> single file? MapReduce processing should be able to handle reading
>> in multiple files just as well as reading in a single file.
>> --Bobby Evans
>> On 1/6/12 9:55 AM, "Frank Grimes" <[EMAIL PROTECTED]> wrote:
>> Hi All,
>> I was wondering if there was an easy way to combing multiple .avro
>> files efficiently.
>> e.g. combining multiple hours of logs into a daily aggregate
>> Note that our Avro schema might evolve to have new (nullable) fields
>> added but no fields will be removed.
>> I'd like to avoid needing to pull the data down for combining and
>> subsequent "hadoop dfs -put".
>> Would https://issues.apache.org/jira/browse/HDFS-222 be able to
>> handle that automatically?
>> FYI, the following seems to indicate that Avro files might be easily
The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.