Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> MultipleOutputs is not working properly when dfs.block.size is changed


+
Dino Kečo 2011-08-18, 08:30
+
Harsh J 2011-08-18, 10:09
Copy link to this message
-
Re: MultipleOutputs is not working properly when dfs.block.size is changed
Hi Harsh,

I am using CDH3_U0 (0.20.2 hadoop version).

I can't share my code because of company rules, but these are steps which I
perform:
CASE1:
 - Use text input format to read content from file
 - Perform record transformation in mapper
 - Write output using text output format

While running this step I am passing -Ddfs.block.size parameter using
Generic Option Parser.

In this case everything is working as expected.

CASE2:
 - Use text input format to read content from file
 - Perform record transformation in mapper
 - if transformation is successful write output to successful file using
multiple outputs
 - if transformation is failed write output to failed file using multiple
outputs

In mapper setup method i create instance of MultipleOutputs (MultipleOutputs
outputs = new MultipleOuputs(context)). In map method i am calling
outputs.write("successful",K,V) or outputs.write("failed", K, V) based on
result of transformation logic.

I configure multiple outputs using generic option parser

-Dmapreduce.inputformat.class=org.apache.hadoop.mapreduce.lib.input.TextInputFormat
-Dmapreduce.map.class=MyMapper
-Dmapreduce.multipleoutputs="successful failed"
-Dmapreduce.multipleoutputs.namedOutput.successful.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
-Dmapreduce.multipleoutputs.namedOutput.successful.key=org.apache.hadoop.io.Text
-Dmapreduce.multipleoutputs.namedOutput.successful.value=org.apache.hadoop.io.Text
-Dmapreduce.multipleoutputs.namedOutput.failed.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat
-Dmapreduce.multipleoutputs.namedOutput.failed.key=org.apache.hadoop.io.Text
-Dmapreduce.multipleoutputs.namedOutput.failed.value=org.apache.hadoop.io.Text

While running this step I am passing -Ddfs.block.size parameter using
Generic Option Parser. Based on block size a lose data in output file. In
some cases half of line is missing, in some cases couple of last lines. Also
one thing that I have noticed is that file size is always equal
<integer>*<block_size>. There is no block which is not fully populated.

Hope this helps.

thanks,
dino

On Thu, Aug 18, 2011 at 12:09 PM, Harsh J <[EMAIL PROTECTED]> wrote:

> Dino,
>
> Need some more information:
> - Version of Hadoop?
> - Do you have a runnable sample test case to reproduce this? Or can
> you describe roughly the steps you are performing to create an output?
>
> FWIW, I ran the trunk's MO tests and those seem to pass for both APIs,
> but they do not change dfs.block.size, although I fail to see the
> relation between these.
>
> On Thu, Aug 18, 2011 at 2:00 PM, Dino Kečo <[EMAIL PROTECTED]> wrote:
> > Hi all,
> > I have been working on hadoop jobs which are writing output into multiple
> > files. In Hadoop API I have found class MultipleOutputs which implement
> this
> > functionality.
> > My use case is to change hdfs block size in one job to increase
> parallelism
> > and I am doing that using dfs.block.size configuration property. Part of
> > output file is missing when I change this property (couple of last lines
> in
> > some cases half of line is missing).
> > I was doing debugging and everything looks fine before calling
> outputs.write
> > ("sucessfull", KEY, VALUE);
> > For output format I am using TextOutputFormat.
> > When I remove MultipleOutputs from my code everything is working ok.
> > Is there something i am doing wrong or there is issue with multiple
> outputs
> > ?
> > regards,
> > dino
> >
>
>
>
> --
> Harsh J
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB