|
|
-
Combining AVRO files efficiently within HDFS
Frank Grimes 2012-01-06, 15:55
Hi All, I was wondering if there was an easy way to combing multiple .avro files efficiently. e.g. combining multiple hours of logs into a daily aggregate Note that our Avro schema might evolve to have new (nullable) fields added but no fields will be removed. I'd like to avoid needing to pull the data down for combining and subsequent "hadoop dfs -put". Would https://issues.apache.org/jira/browse/HDFS-222 be able to handle that automatically? FYI, the following seems to indicate that Avro files might be easily combinable: https://issues.apache.org/jira/browse/AVRO-127Or is an M/R job the best way to go for this? Thanks, Frank Grimes
+
Frank Grimes 2012-01-06, 15:55
-
Re: Combining AVRO files efficiently within HDFS
Robert Evans 2012-01-06, 16:46
Frank, That depends on what you mean by combining. It sounds like you are trying to aggregate data from several days, which may involve doing a join so I would say a MapReduce job is your best bet. If you are not going to do any processing at all then why are you trying to combine them? Is there something that requires them all to be part of a single file? MapReduce processing should be able to handle reading in multiple files just as well as reading in a single file. --Bobby Evans On 1/6/12 9:55 AM, "Frank Grimes" <[EMAIL PROTECTED]> wrote: Hi All, I was wondering if there was an easy way to combing multiple .avro files efficiently. e.g. combining multiple hours of logs into a daily aggregate Note that our Avro schema might evolve to have new (nullable) fields added but no fields will be removed. I'd like to avoid needing to pull the data down for combining and subsequent "hadoop dfs -put". Would https://issues.apache.org/jira/browse/HDFS-222 be able to handle that automatically? FYI, the following seems to indicate that Avro files might be easily combinable: https://issues.apache.org/jira/browse/AVRO-127Or is an M/R job the best way to go for this? Thanks, Frank Grimes
+
Robert Evans 2012-01-06, 16:46
-
Re: Combining AVRO files efficiently within HDFS
Frank Grimes 2012-01-06, 17:21
Hi Bobby, Actually, the problem we're trying to solve is one of completeness. Say we have 3 machines generating log events and putting them to HDFS on an hourly basis. e.g. 2012-01/01/00/machine1.log.avro 2012-01/01/00/machine2.log.avro 2012-01/01/00/machine3.log.avro Sometime after the hour, we would have a scheduled job verify that all the expected machines' log files are present and complete in HDFS. Before launching MapReduce jobs for a given date range, we want to verify that the job will run over complete data. If not, the query would error out. We want our query/MapReduce layer to not need to be aware of logs at the machine level, only the presence or not of an hour's worth of logs. We were thinking that after verifying all in individual log files for an hour, they could be combined into 2012-01/01/00/log.avro. The presence of 2012-01-01-00.log.avro would be all that needs to be verified. However, we're new to both Avro and Hadoop so not sure of the most efficient (and reliable) way to accomplish this. Thanks, Frank Grimes On 2012-01-06, at 11:46 AM, Robert Evans wrote: > Frank, > > That depends on what you mean by combining. It sounds like you are trying to aggregate data from several days, which may involve doing a join so I would say a MapReduce job is your best bet. If you are not going to do any processing at all then why are you trying to combine them? Is there something that requires them all to be part of a single file? MapReduce processing should be able to handle reading in multiple files just as well as reading in a single file. > > --Bobby Evans > > On 1/6/12 9:55 AM, "Frank Grimes" <[EMAIL PROTECTED]> wrote: > > Hi All, > > I was wondering if there was an easy way to combing multiple .avro files efficiently. > e.g. combining multiple hours of logs into a daily aggregate > > Note that our Avro schema might evolve to have new (nullable) fields added but no fields will be removed. > > I'd like to avoid needing to pull the data down for combining and subsequent "hadoop dfs -put". > > Would https://issues.apache.org/jira/browse/HDFS-222 be able to handle that automatically? > FYI, the following seems to indicate that Avro files might be easily combinable: https://issues.apache.org/jira/browse/AVRO-127> > Or is an M/R job the best way to go for this? > > Thanks, > > Frank Grimes >
+
Frank Grimes 2012-01-06, 17:21
-
Re: Combining AVRO files efficiently within HDFS
Joey Echeverria 2012-01-06, 18:05
I would do it by staging the machine data into a temporary directory and then renaming the directory when it's been verified. So, data would be written into directories like this: 2012-01/02/00/stage/machine1.log.avro 2012-01/02/00/stage/machine2.log.avro 2012-01/02/00/stage/machine3.log.avro After verification, you'd rename the 2012-01/02/00/stage directory to 2012-01/02/00/done. Since renaming a directory in HDFS is an atomic operation, you get the guarantee the you're looking for without having to do extra IO. There shouldn't be a benefit to merging the individual files unless they're too small. -Joey On Fri, Jan 6, 2012 at 9:21 AM, Frank Grimes <[EMAIL PROTECTED]> wrote: > Hi Bobby, > > Actually, the problem we're trying to solve is one of completeness. > > Say we have 3 machines generating log events and putting them to HDFS on an > hourly basis. > e.g. > 2012-01/01/00/machine1.log.avro > 2012-01/01/00/machine2.log.avro > 2012-01/01/00/machine3.log.avro > > Sometime after the hour, we would have a scheduled job verify that all the > expected machines' log files are present and complete in HDFS. > > Before launching MapReduce jobs for a given date range, we want to verify > that the job will run over complete data. > If not, the query would error out. > > We want our query/MapReduce layer to not need to be aware of logs at the > machine level, only the presence or not of an hour's worth of logs. > > We were thinking that after verifying all in individual log files for an > hour, they could be combined into 2012-01/01/00/log.avro. > The presence of 2012-01-01-00.log.avro would be all that needs to be > verified. > > However, we're new to both Avro and Hadoop so not sure of the most efficient > (and reliable) way to accomplish this. > > Thanks, > > Frank Grimes > > > On 2012-01-06, at 11:46 AM, Robert Evans wrote: > > Frank, > > That depends on what you mean by combining. It sounds like you are trying to > aggregate data from several days, which may involve doing a join so I would > say a MapReduce job is your best bet. If you are not going to do any > processing at all then why are you trying to combine them? Is there > something that requires them all to be part of a single file? MapReduce > processing should be able to handle reading in multiple files just as well > as reading in a single file. > > --Bobby Evans > > On 1/6/12 9:55 AM, "Frank Grimes" <[EMAIL PROTECTED]> wrote: > > Hi All, > > I was wondering if there was an easy way to combing multiple .avro files > efficiently. > e.g. combining multiple hours of logs into a daily aggregate > > Note that our Avro schema might evolve to have new (nullable) fields added > but no fields will be removed. > > I'd like to avoid needing to pull the data down for combining and subsequent > "hadoop dfs -put". > > Would https://issues.apache.org/jira/browse/HDFS-222 be able to handle that > automatically? > FYI, the following seems to indicate that Avro files might be easily > combinable: https://issues.apache.org/jira/browse/AVRO-127> > Or is an M/R job the best way to go for this? > > Thanks, > > Frank Grimes > > -- Joseph Echeverria Cloudera, Inc. 443.305.9434
+
Joey Echeverria 2012-01-06, 18:05
-
Re: Combining AVRO files efficiently within HDFS
Frank Grimes 2012-01-06, 19:55
Hi Joey, That's a very good suggestion and might suit us just fine. However, many of the files will be much smaller than the HDFS block size. That could affect the performance of the MapReduce jobs, correct? Also, from my understanding it would put more burden on the name node (memory usage) than is necessary. Assuming we did want to combine the actual files... how would you suggest we might go about it? Thanks, Frank Grimes On 2012-01-06, at 1:05 PM, Joey Echeverria wrote: > I would do it by staging the machine data into a temporary directory > and then renaming the directory when it's been verified. So, data > would be written into directories like this: > > 2012-01/02/00/stage/machine1.log.avro > 2012-01/02/00/stage/machine2.log.avro > 2012-01/02/00/stage/machine3.log.avro > > After verification, you'd rename the 2012-01/02/00/stage directory to > 2012-01/02/00/done. Since renaming a directory in HDFS is an atomic > operation, you get the guarantee the you're looking for without having > to do extra IO. There shouldn't be a benefit to merging the individual > files unless they're too small. > > -Joey > > On Fri, Jan 6, 2012 at 9:21 AM, Frank Grimes <[EMAIL PROTECTED]> wrote: >> Hi Bobby, >> >> Actually, the problem we're trying to solve is one of completeness. >> >> Say we have 3 machines generating log events and putting them to HDFS on an >> hourly basis. >> e.g. >> 2012-01/01/00/machine1.log.avro >> 2012-01/01/00/machine2.log.avro >> 2012-01/01/00/machine3.log.avro >> >> Sometime after the hour, we would have a scheduled job verify that all the >> expected machines' log files are present and complete in HDFS. >> >> Before launching MapReduce jobs for a given date range, we want to verify >> that the job will run over complete data. >> If not, the query would error out. >> >> We want our query/MapReduce layer to not need to be aware of logs at the >> machine level, only the presence or not of an hour's worth of logs. >> >> We were thinking that after verifying all in individual log files for an >> hour, they could be combined into 2012-01/01/00/log.avro. >> The presence of 2012-01-01-00.log.avro would be all that needs to be >> verified. >> >> However, we're new to both Avro and Hadoop so not sure of the most efficient >> (and reliable) way to accomplish this. >> >> Thanks, >> >> Frank Grimes >> >> >> On 2012-01-06, at 11:46 AM, Robert Evans wrote: >> >> Frank, >> >> That depends on what you mean by combining. It sounds like you are trying to >> aggregate data from several days, which may involve doing a join so I would >> say a MapReduce job is your best bet. If you are not going to do any >> processing at all then why are you trying to combine them? Is there >> something that requires them all to be part of a single file? MapReduce >> processing should be able to handle reading in multiple files just as well >> as reading in a single file. >> >> --Bobby Evans >> >> On 1/6/12 9:55 AM, "Frank Grimes" <[EMAIL PROTECTED]> wrote: >> >> Hi All, >> >> I was wondering if there was an easy way to combing multiple .avro files >> efficiently. >> e.g. combining multiple hours of logs into a daily aggregate >> >> Note that our Avro schema might evolve to have new (nullable) fields added >> but no fields will be removed. >> >> I'd like to avoid needing to pull the data down for combining and subsequent >> "hadoop dfs -put". >> >> Would https://issues.apache.org/jira/browse/HDFS-222 be able to handle that >> automatically? >> FYI, the following seems to indicate that Avro files might be easily >> combinable: https://issues.apache.org/jira/browse/AVRO-127>> >> Or is an M/R job the best way to go for this? >> >> Thanks, >> >> Frank Grimes >> >> > > > > -- > Joseph Echeverria > Cloudera, Inc. > 443.305.9434
+
Frank Grimes 2012-01-06, 19:55
-
RE: Combining AVRO files efficiently within HDFS
Dave Shine 2012-01-06, 20:05
Frank, We have a very serious small file problem. I created a M/R job that combines files as it seemed best to use all the resources of the cluster rather than opening a stream and combining files single threaded or trying to do something via command line. Dave -----Original Message----- From: Frank Grimes [mailto:[EMAIL PROTECTED]] Sent: Friday, January 06, 2012 2:56 PM To: [EMAIL PROTECTED] Subject: Re: Combining AVRO files efficiently within HDFS Hi Joey, That's a very good suggestion and might suit us just fine. However, many of the files will be much smaller than the HDFS block size. That could affect the performance of the MapReduce jobs, correct? Also, from my understanding it would put more burden on the name node (memory usage) than is necessary. Assuming we did want to combine the actual files... how would you suggest we might go about it? Thanks, Frank Grimes On 2012-01-06, at 1:05 PM, Joey Echeverria wrote: > I would do it by staging the machine data into a temporary directory > and then renaming the directory when it's been verified. So, data > would be written into directories like this: > > 2012-01/02/00/stage/machine1.log.avro > 2012-01/02/00/stage/machine2.log.avro > 2012-01/02/00/stage/machine3.log.avro > > After verification, you'd rename the 2012-01/02/00/stage directory to > 2012-01/02/00/done. Since renaming a directory in HDFS is an atomic > operation, you get the guarantee the you're looking for without having > to do extra IO. There shouldn't be a benefit to merging the individual > files unless they're too small. > > -Joey > > On Fri, Jan 6, 2012 at 9:21 AM, Frank Grimes <[EMAIL PROTECTED]> wrote: >> Hi Bobby, >> >> Actually, the problem we're trying to solve is one of completeness. >> >> Say we have 3 machines generating log events and putting them to HDFS >> on an hourly basis. >> e.g. >> 2012-01/01/00/machine1.log.avro >> 2012-01/01/00/machine2.log.avro >> 2012-01/01/00/machine3.log.avro >> >> Sometime after the hour, we would have a scheduled job verify that >> all the expected machines' log files are present and complete in HDFS. >> >> Before launching MapReduce jobs for a given date range, we want to >> verify that the job will run over complete data. >> If not, the query would error out. >> >> We want our query/MapReduce layer to not need to be aware of logs at >> the machine level, only the presence or not of an hour's worth of logs. >> >> We were thinking that after verifying all in individual log files for >> an hour, they could be combined into 2012-01/01/00/log.avro. >> The presence of 2012-01-01-00.log.avro would be all that needs to be >> verified. >> >> However, we're new to both Avro and Hadoop so not sure of the most >> efficient (and reliable) way to accomplish this. >> >> Thanks, >> >> Frank Grimes >> >> >> On 2012-01-06, at 11:46 AM, Robert Evans wrote: >> >> Frank, >> >> That depends on what you mean by combining. It sounds like you are >> trying to aggregate data from several days, which may involve doing a >> join so I would say a MapReduce job is your best bet. If you are not >> going to do any processing at all then why are you trying to combine >> them? Is there something that requires them all to be part of a >> single file? MapReduce processing should be able to handle reading >> in multiple files just as well as reading in a single file. >> >> --Bobby Evans >> >> On 1/6/12 9:55 AM, "Frank Grimes" <[EMAIL PROTECTED]> wrote: >> >> Hi All, >> >> I was wondering if there was an easy way to combing multiple .avro >> files efficiently. >> e.g. combining multiple hours of logs into a daily aggregate >> >> Note that our Avro schema might evolve to have new (nullable) fields >> added but no fields will be removed. >> >> I'd like to avoid needing to pull the data down for combining and >> subsequent "hadoop dfs -put". >> >> Would https://issues.apache.org/jira/browse/HDFS-222 be able to >> handle that automatically? >> FYI, the following seems to indicate that Avro files might be easily The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.
+
Dave Shine 2012-01-06, 20:05
-
Re: Combining AVRO files efficiently within HDFS
Steve Edison 2012-01-06, 20:45
I was exploring .har based hadop archive files for a similar small log file scenario I have. I have millions of log files which are less than 64MB each and I want to put them into HDFS and run analysis. Still exploring if HDFS is a good options. Traditionally what I have learnt is that HDFS isn't good for small files.
-Steve
On Fri, Jan 6, 2012 at 12:05 PM, Dave Shine < [EMAIL PROTECTED]> wrote:
> Frank, > > We have a very serious small file problem. I created a M/R job that > combines files as it seemed best to use all the resources of the cluster > rather than opening a stream and combining files single threaded or trying > to do something via command line. > > Dave > > > -----Original Message----- > From: Frank Grimes [mailto:[EMAIL PROTECTED]] > Sent: Friday, January 06, 2012 2:56 PM > To: [EMAIL PROTECTED] > Subject: Re: Combining AVRO files efficiently within HDFS > > Hi Joey, > > That's a very good suggestion and might suit us just fine. > > However, many of the files will be much smaller than the HDFS block size. > That could affect the performance of the MapReduce jobs, correct? > Also, from my understanding it would put more burden on the name node > (memory usage) than is necessary. > > Assuming we did want to combine the actual files... how would you suggest > we might go about it? > > Thanks, > > Frank Grimes > > > On 2012-01-06, at 1:05 PM, Joey Echeverria wrote: > > > I would do it by staging the machine data into a temporary directory > > and then renaming the directory when it's been verified. So, data > > would be written into directories like this: > > > > 2012-01/02/00/stage/machine1.log.avro > > 2012-01/02/00/stage/machine2.log.avro > > 2012-01/02/00/stage/machine3.log.avro > > > > After verification, you'd rename the 2012-01/02/00/stage directory to > > 2012-01/02/00/done. Since renaming a directory in HDFS is an atomic > > operation, you get the guarantee the you're looking for without having > > to do extra IO. There shouldn't be a benefit to merging the individual > > files unless they're too small. > > > > -Joey > > > > On Fri, Jan 6, 2012 at 9:21 AM, Frank Grimes <[EMAIL PROTECTED]> > wrote: > >> Hi Bobby, > >> > >> Actually, the problem we're trying to solve is one of completeness. > >> > >> Say we have 3 machines generating log events and putting them to HDFS > >> on an hourly basis. > >> e.g. > >> 2012-01/01/00/machine1.log.avro > >> 2012-01/01/00/machine2.log.avro > >> 2012-01/01/00/machine3.log.avro > >> > >> Sometime after the hour, we would have a scheduled job verify that > >> all the expected machines' log files are present and complete in HDFS. > >> > >> Before launching MapReduce jobs for a given date range, we want to > >> verify that the job will run over complete data. > >> If not, the query would error out. > >> > >> We want our query/MapReduce layer to not need to be aware of logs at > >> the machine level, only the presence or not of an hour's worth of logs. > >> > >> We were thinking that after verifying all in individual log files for > >> an hour, they could be combined into 2012-01/01/00/log.avro. > >> The presence of 2012-01-01-00.log.avro would be all that needs to be > >> verified. > >> > >> However, we're new to both Avro and Hadoop so not sure of the most > >> efficient (and reliable) way to accomplish this. > >> > >> Thanks, > >> > >> Frank Grimes > >> > >> > >> On 2012-01-06, at 11:46 AM, Robert Evans wrote: > >> > >> Frank, > >> > >> That depends on what you mean by combining. It sounds like you are > >> trying to aggregate data from several days, which may involve doing a > >> join so I would say a MapReduce job is your best bet. If you are not > >> going to do any processing at all then why are you trying to combine > >> them? Is there something that requires them all to be part of a > >> single file? MapReduce processing should be able to handle reading > >> in multiple files just as well as reading in a single file. > >> > >> --Bobby Evans
+
Steve Edison 2012-01-06, 20:45
-
Re: Combining AVRO files efficiently within HDFS
Joey Echeverria 2012-01-06, 20:56
I would use a MapReduce job to merge them. -Joey On Fri, Jan 6, 2012 at 11:55 AM, Frank Grimes <[EMAIL PROTECTED]> wrote: > Hi Joey, > > That's a very good suggestion and might suit us just fine. > > However, many of the files will be much smaller than the HDFS block size. > That could affect the performance of the MapReduce jobs, correct? > Also, from my understanding it would put more burden on the name node (memory usage) than is necessary. > > Assuming we did want to combine the actual files... how would you suggest we might go about it? > > Thanks, > > Frank Grimes > > > On 2012-01-06, at 1:05 PM, Joey Echeverria wrote: > >> I would do it by staging the machine data into a temporary directory >> and then renaming the directory when it's been verified. So, data >> would be written into directories like this: >> >> 2012-01/02/00/stage/machine1.log.avro >> 2012-01/02/00/stage/machine2.log.avro >> 2012-01/02/00/stage/machine3.log.avro >> >> After verification, you'd rename the 2012-01/02/00/stage directory to >> 2012-01/02/00/done. Since renaming a directory in HDFS is an atomic >> operation, you get the guarantee the you're looking for without having >> to do extra IO. There shouldn't be a benefit to merging the individual >> files unless they're too small. >> >> -Joey >> >> On Fri, Jan 6, 2012 at 9:21 AM, Frank Grimes <[EMAIL PROTECTED]> wrote: >>> Hi Bobby, >>> >>> Actually, the problem we're trying to solve is one of completeness. >>> >>> Say we have 3 machines generating log events and putting them to HDFS on an >>> hourly basis. >>> e.g. >>> 2012-01/01/00/machine1.log.avro >>> 2012-01/01/00/machine2.log.avro >>> 2012-01/01/00/machine3.log.avro >>> >>> Sometime after the hour, we would have a scheduled job verify that all the >>> expected machines' log files are present and complete in HDFS. >>> >>> Before launching MapReduce jobs for a given date range, we want to verify >>> that the job will run over complete data. >>> If not, the query would error out. >>> >>> We want our query/MapReduce layer to not need to be aware of logs at the >>> machine level, only the presence or not of an hour's worth of logs. >>> >>> We were thinking that after verifying all in individual log files for an >>> hour, they could be combined into 2012-01/01/00/log.avro. >>> The presence of 2012-01-01-00.log.avro would be all that needs to be >>> verified. >>> >>> However, we're new to both Avro and Hadoop so not sure of the most efficient >>> (and reliable) way to accomplish this. >>> >>> Thanks, >>> >>> Frank Grimes >>> >>> >>> On 2012-01-06, at 11:46 AM, Robert Evans wrote: >>> >>> Frank, >>> >>> That depends on what you mean by combining. It sounds like you are trying to >>> aggregate data from several days, which may involve doing a join so I would >>> say a MapReduce job is your best bet. If you are not going to do any >>> processing at all then why are you trying to combine them? Is there >>> something that requires them all to be part of a single file? MapReduce >>> processing should be able to handle reading in multiple files just as well >>> as reading in a single file. >>> >>> --Bobby Evans >>> >>> On 1/6/12 9:55 AM, "Frank Grimes" <[EMAIL PROTECTED]> wrote: >>> >>> Hi All, >>> >>> I was wondering if there was an easy way to combing multiple .avro files >>> efficiently. >>> e.g. combining multiple hours of logs into a daily aggregate >>> >>> Note that our Avro schema might evolve to have new (nullable) fields added >>> but no fields will be removed. >>> >>> I'd like to avoid needing to pull the data down for combining and subsequent >>> "hadoop dfs -put". >>> >>> Would https://issues.apache.org/jira/browse/HDFS-222 be able to handle that >>> automatically? >>> FYI, the following seems to indicate that Avro files might be easily >>> combinable: https://issues.apache.org/jira/browse/AVRO-127>>> >>> Or is an M/R job the best way to go for this? >>> >>> Thanks, >>> >>> Frank Grimes >>> >>> >> >> >> >> -- >> Joseph Echeverria Joseph Echeverria Cloudera, Inc. 443.305.9434
+
Joey Echeverria 2012-01-06, 20:56
-
Re: Combining AVRO files efficiently within HDFS
Frank Grimes 2012-01-11, 21:29
Ok, so I wrote a MapReduce job to merge the files and it appears to be working with a limited input set. Thanks again, BTW.
However, if I increase the amount of input data I start getting the following types of errors:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/file.out/file.out or org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/map_0.out
Are there any logs I should be looking at to determine the exact cause of these errors? Are there any settings I could/should be increasing?
Note that in order to avoid unnecessary sorting overhead, I made each key a constant (1L) so that the logs are combined but ordering isn't necessarily preserved. i.e.
public static class AvroReachMapper extends AvroMapper<DeliveryLogEvent, Pair<Long, DeliveryLogEvent>> { public void map(DeliveryLogEvent levent, AvroCollector<Pair<Long, DeliveryLogEvent>> collector, Reporter reporter) throws IOException { collector.collect(new Pair<Long, DeliveryLogEvent>(1L, levent)); } } public static class Reduce extends AvroReducer<Long, DeliveryLogEvent, DeliveryLogEvent> {
@Override public void reduce(Long key, Iterable<DeliveryLogEvent> values, AvroCollector<DeliveryLogEvent> collector, Reporter reporter) throws IOException {
for (DeliveryLogEvent event : values) { collector.collect(event); } }
}
I've also noticed that /tmp/mapred seems to fill up and doesn't automatically get cleaned out. Is Hadoop itself supposed to clean up those old temporary work files or do we need a Cron job for that?
Thanks,
Frank Grimes On 2012-01-06, at 3:56 PM, Joey Echeverria wrote:
> I would use a MapReduce job to merge them. > > -Joey > > On Fri, Jan 6, 2012 at 11:55 AM, Frank Grimes <[EMAIL PROTECTED]> wrote: >> Hi Joey, >> >> That's a very good suggestion and might suit us just fine. >> >> However, many of the files will be much smaller than the HDFS block size. >> That could affect the performance of the MapReduce jobs, correct? >> Also, from my understanding it would put more burden on the name node (memory usage) than is necessary. >> >> Assuming we did want to combine the actual files... how would you suggest we might go about it? >> >> Thanks, >> >> Frank Grimes >> >> >> On 2012-01-06, at 1:05 PM, Joey Echeverria wrote: >> >>> I would do it by staging the machine data into a temporary directory >>> and then renaming the directory when it's been verified. So, data >>> would be written into directories like this: >>> >>> 2012-01/02/00/stage/machine1.log.avro >>> 2012-01/02/00/stage/machine2.log.avro >>> 2012-01/02/00/stage/machine3.log.avro >>> >>> After verification, you'd rename the 2012-01/02/00/stage directory to >>> 2012-01/02/00/done. Since renaming a directory in HDFS is an atomic >>> operation, you get the guarantee the you're looking for without having >>> to do extra IO. There shouldn't be a benefit to merging the individual >>> files unless they're too small. >>> >>> -Joey >>> >>> On Fri, Jan 6, 2012 at 9:21 AM, Frank Grimes <[EMAIL PROTECTED]> wrote: >>>> Hi Bobby, >>>> >>>> Actually, the problem we're trying to solve is one of completeness. >>>> >>>> Say we have 3 machines generating log events and putting them to HDFS on an >>>> hourly basis. >>>> e.g. >>>> 2012-01/01/00/machine1.log.avro >>>> 2012-01/01/00/machine2.log.avro >>>> 2012-01/01/00/machine3.log.avro >>>> >>>> Sometime after the hour, we would have a scheduled job verify that all the >>>> expected machines' log files are present and complete in HDFS. >>>> >>>> Before launching MapReduce jobs for a given date range, we want to verify >>>> that the job will run over complete data. >>>> If not, the query would error out. >>>> >>>> We want our query/MapReduce layer to not need to be aware of logs at the >>>> machine level, only the presence or not of an hour's worth of logs. >>>> >>>> We were thinking that after verifying all in individual log files for an
+
Frank Grimes 2012-01-11, 21:29
-
Re: Combining AVRO files efficiently within HDFS
Frank Grimes 2012-01-12, 15:42
As it turns out, this is due to our /tmp partition being too small. We'll either need to increase it or put hadoop.tmp.dir on a bigger partition. On 2012-01-11, at 4:29 PM, Frank Grimes wrote:
> Ok, so I wrote a MapReduce job to merge the files and it appears to be working with a limited input set. > Thanks again, BTW. > > However, if I increase the amount of input data I start getting the following types of errors: > > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/file.out/file.out > or > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/map_0.out > > Are there any logs I should be looking at to determine the exact cause of these errors? > Are there any settings I could/should be increasing? > > Note that in order to avoid unnecessary sorting overhead, I made each key a constant (1L) so that the logs are combined but ordering isn't necessarily preserved. > i.e. > > public static class AvroReachMapper extends AvroMapper<DeliveryLogEvent, Pair<Long, DeliveryLogEvent>> { > public void map(DeliveryLogEvent levent, AvroCollector<Pair<Long, DeliveryLogEvent>> collector, Reporter reporter) > throws IOException { > > collector.collect(new Pair<Long, DeliveryLogEvent>(1L, levent)); > } > } > > public static class Reduce extends AvroReducer<Long, DeliveryLogEvent, DeliveryLogEvent> { > > @Override > public void reduce(Long key, Iterable<DeliveryLogEvent> values, > AvroCollector<DeliveryLogEvent> collector, Reporter reporter) > throws IOException { > > for (DeliveryLogEvent event : values) { > collector.collect(event); > } > } > > } > > I've also noticed that /tmp/mapred seems to fill up and doesn't automatically get cleaned out. > Is Hadoop itself supposed to clean up those old temporary work files or do we need a Cron job for that? > > Thanks, > > Frank Grimes > > > > > On 2012-01-06, at 3:56 PM, Joey Echeverria wrote: > >> I would use a MapReduce job to merge them. >> >> -Joey >> >> On Fri, Jan 6, 2012 at 11:55 AM, Frank Grimes <[EMAIL PROTECTED]> wrote: >>> Hi Joey, >>> >>> That's a very good suggestion and might suit us just fine. >>> >>> However, many of the files will be much smaller than the HDFS block size. >>> That could affect the performance of the MapReduce jobs, correct? >>> Also, from my understanding it would put more burden on the name node (memory usage) than is necessary. >>> >>> Assuming we did want to combine the actual files... how would you suggest we might go about it? >>> >>> Thanks, >>> >>> Frank Grimes >>> >>> >>> On 2012-01-06, at 1:05 PM, Joey Echeverria wrote: >>> >>>> I would do it by staging the machine data into a temporary directory >>>> and then renaming the directory when it's been verified. So, data >>>> would be written into directories like this: >>>> >>>> 2012-01/02/00/stage/machine1.log.avro >>>> 2012-01/02/00/stage/machine2.log.avro >>>> 2012-01/02/00/stage/machine3.log.avro >>>> >>>> After verification, you'd rename the 2012-01/02/00/stage directory to >>>> 2012-01/02/00/done. Since renaming a directory in HDFS is an atomic >>>> operation, you get the guarantee the you're looking for without having >>>> to do extra IO. There shouldn't be a benefit to merging the individual >>>> files unless they're too small. >>>> >>>> -Joey >>>> >>>> On Fri, Jan 6, 2012 at 9:21 AM, Frank Grimes <[EMAIL PROTECTED]> wrote: >>>>> Hi Bobby, >>>>> >>>>> Actually, the problem we're trying to solve is one of completeness. >>>>> >>>>> Say we have 3 machines generating log events and putting them to HDFS on an >>>>> hourly basis. >>>>> e.g. >>>>> 2012-01/01/00/machine1.log.avro >>>>> 2012-01/01/00/machine2.log.avro >>>>> 2012-01/01/00/machine3.log.avro >>>>> >>>>> Sometime after the hour, we would have a scheduled job verify that all the >>>>> expected machines' log files are present and complete in HDFS. >>>>>
+
Frank Grimes 2012-01-12, 15:42
|
|