Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Combining AVRO files efficiently within HDFS

Frank Grimes 2012-01-06, 15:55
Copy link to this message
Re: Combining AVRO files efficiently within HDFS

That depends on what you mean by combining. It sounds like you are trying to aggregate data from several days, which may involve doing a join so I would say a MapReduce job is your best bet.  If you are not going to do any processing at all then why are you trying to combine them?  Is there something that requires them all to be part of a single file?  MapReduce processing should be able to handle reading in multiple files just as well as reading in a single file.

--Bobby Evans

On 1/6/12 9:55 AM, "Frank Grimes" <[EMAIL PROTECTED]> wrote:

Hi All,

I was wondering if there was an easy way to combing multiple .avro files efficiently.
e.g. combining multiple hours of logs into a daily aggregate

Note that our Avro schema might evolve to have new (nullable) fields added but no fields will be removed.

I'd like to avoid needing to pull the data down for combining and subsequent "hadoop dfs -put".

Would https://issues.apache.org/jira/browse/HDFS-222 be able to handle that automatically?
FYI, the following seems to indicate that Avro files might be easily combinable: https://issues.apache.org/jira/browse/AVRO-127

Or is an M/R job the best way to go for this?


Frank Grimes

Frank Grimes 2012-01-06, 17:21
Joey Echeverria 2012-01-06, 18:05
Frank Grimes 2012-01-06, 19:55
Dave Shine 2012-01-06, 20:05
Steve Edison 2012-01-06, 20:45
Joey Echeverria 2012-01-06, 20:56
Frank Grimes 2012-01-11, 21:29
Frank Grimes 2012-01-12, 15:42