Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS, mail # user - How to best decide mapper output/reducer input for a huge string?


Copy link to this message
-
How to best decide mapper output/reducer input for a huge string?
Pavan Sudheendra 2013-09-21, 06:32
I need to improve my MR jobs which uses HBase as source as well as sink..

Basically, i'm reading data from 3 HBase Tables in the mapper, writing them
out as one huge string for the reducer to do some computation and dump into
a HBase Table..

Table1 ~ 19 million rows.Table2 ~ 2 million rows.Table3 ~ 900,000 rows.

The output of the mapper is something like this :

HouseHoldId contentID name duration genre type channelId personId
televisionID timestamp

I'm interested in sorting it on the basis of the HouseHoldID value so i'm
using this technique. I'm not interested in the V part of pair so i'm kind
of ignoring it. My mapper class is defined as follows:

public static class AnalyzeMapper extends TableMapper<Text, IntWritable> { }

For my MR job to be completed, it takes 22 hours to complete which is not
desirable at all. I'm supposed to optimize this somehow to run a lot faster
somehow..

scan.setCaching(750);
scan.setCacheBlocks(false); TableMapReduceUtil.initTableMapperJob (
                                       Table1,           // input
HBase table name
                                       scan,
                                       AnalyzeMapper.class,    // mapper
                                       Text.class,             //
mapper output key
                                       IntWritable.class,      //
mapper output value
                                       job);

                TableMapReduceUtil.initTableReducerJob(
                                        OutputTable,                //
output table
                                        AnalyzeReducerTable.class,  //
reducer class
                                        job);
                job.setNumReduceTasks(RegionCount);

My HBase Table1 has 21 regions so 21 mappers are spawned. We are running a
8 node cloudera cluster.

Should i use a custom SortComparator or a Group Comparator?
--
Regards-
Pavan