|
|
-
Preferred ways to specify input and output directories to Hadoop jobs
W.P. McNeill 2012-02-08, 18:00
How do you like to specify input and output directories to your Hadoop jobs?
I have been using positional arguments. All but the last argument are input directories and the last one is an output directory. These override any mapred.output.dir configuration parameter and augment any mapred.input.dir. I like positional arguments because it's a very natural UNIXy way of doing things. However, the more I use this convention, the more complex it seems to me. For instance, you have to decide what to do when there's only one positional argument. Or maybe there are scenarios in which you want the positional input directories to overwrite the configurational ones. More generally, you have to figure out how to reconcile positional and configurational arguments. Now I'm leaning towards only using the mapred.input.dir and mapred.output.dir parameters.
What do other people do?
-
Re: Preferred ways to specify input and output directories to Hadoop jobs
bejoy.hadoop@... 2012-02-08, 18:13
Hi When you give in the arguments on CLI in your driver class you are making it assign to mapred.input.dir and mapred.output.dir . I believe no such default exists in map reduce frame work that would assign the position arguments to input and output dir. If you don't want this assignment in your driver class from the arguments, you can specify the same from CLI as -D mapred.input.dir = myInputDir and -D mapred.output.dir = myOutputDir . In both cases you are doing the same, no difference. Choose any that is comfortable for you. Regards Bejoy K S
From handheld, Please excuse typos.
-----Original Message----- From: "W.P. McNeill" <[EMAIL PROTECTED]> Date: Wed, 8 Feb 2012 10:00:55 To: Hadoop Mailing List<[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: Preferred ways to specify input and output directories to Hadoop jobs
How do you like to specify input and output directories to your Hadoop jobs?
I have been using positional arguments. All but the last argument are input directories and the last one is an output directory. These override any mapred.output.dir configuration parameter and augment any mapred.input.dir. I like positional arguments because it's a very natural UNIXy way of doing things. However, the more I use this convention, the more complex it seems to me. For instance, you have to decide what to do when there's only one positional argument. Or maybe there are scenarios in which you want the positional input directories to overwrite the configurational ones. More generally, you have to figure out how to reconcile positional and configurational arguments. Now I'm leaning towards only using the mapred.input.dir and mapred.output.dir parameters.
What do other people do?
-
Re: Preferred ways to specify input and output directories to Hadoop jobs
W.P. McNeill 2012-02-08, 18:17
Right. There is no default mechanism in Hadoop for using positional arguments as input/output directory parameters. What I'm wondering is if other people have done like me and written this mechanism themselves.
-
Re: Preferred ways to specify input and output directories to Hadoop jobs
bejoy.hadoop@... 2012-02-08, 18:21
From what I know, a lot of guys including us provide the input and output dirs as position arguments for our custom mapreduce jobs along with other position arguments required for each job. These position arguments are assigned to mapred.input.dir, mapred.output.dir and other mapred params in our custom driver classes for each jobs. Regards Bejoy K S
From handheld, Please excuse typos.
-----Original Message----- From: "W.P. McNeill" <[EMAIL PROTECTED]> Date: Wed, 8 Feb 2012 10:17:53 To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Subject: Re: Preferred ways to specify input and output directories to Hadoop jobs
Right. There is no default mechanism in Hadoop for using positional arguments as input/output directory parameters. What I'm wondering is if other people have done like me and written this mechanism themselves.
|
|
All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by
Sematext