Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> bug in streaming?


Copy link to this message
-
Re: bug in streaming?
Yang,

I would not call it a bug, I would call it a potential optimization.  The
default input format for streaming will try to create one mapper per
block, but if there is only one block it will create two mappers for it.
You can override the streaming input format to get a different behavior.
The reality is that Hadoop does not do a very good job with small inputs.
We are going to be much slower than running without Hadoop for such small
jobs.  We have created the UberAM in Hadoop 2.0 to be able to address some
of this.  There is still a lot of work to be done and I am not sure what
priority it is for the different developers.  Feel free to file a JIRA if
you want to.  

--Bobby Evans

On 7/2/12 4:16 PM, "Yang" <[EMAIL PROTECTED]> wrote:

>if I set input to a file that contains just 1 line (which does not even
>contain "\n")
>
>and the mapper is
>
>-mapper " bash -c './a.sh' "
>
>and a.sh is
>
>echo -n "|"
>cat
>echo -n "|"
>
>
>
>
>I see 2 part- files generated in the output, which means 2 mappers were
>invoked, and one mapper consumed empty input , producing an output
>'||'
>but given such a small input file,  we should definitely see only one
>mapper.
>
>this looks like a bug
>
>Yang