Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> MapReduce: Reducers partitions.


Copy link to this message
-
Re: MapReduce: Reducers partitions.
So.

I looked at the code, and I have one comment/suggestion here.

If the table we are outputing to has regions, then partitions are build
around that, and that's fine. But if the table is totally empty with a
single region, even if we setNumReduceTasks to 2 or more, all the keys will
go on the same first reducer because of this:
    if (this.startKeys.length == 1){
      return 0;
    }
I think it will be better to return something like keycrc%numPartitions
instead. That still allow the application to spread the reducing process
over multinode(racks) even if there is only one region in the table.

In my usecase, I have millions of lines producing some statistics. At the
end, I will have only about 600 lines, but it will take a lot of map and
reduce time to go from millions to 600, that's why I'm looking to have more
than one reducer. However, with only 600 lines, it's very difficult to
pre-split the table. Keys are all very close.

Does anyone see anything wrong with changing this default behaviour when
startKeys.length == 1? If not, I will open a JIRA and upload a patch.

JM

2013/4/10 Jean-Marc Spaggiari <[EMAIL PROTECTED]>

> Thanks Ted.
>
> It's exactly where I was looking at now. I was close. I will take a deeper
> look.
>
> Thanks Nitin for the link. I will read that too.
>
> JM
>
> 2013/4/10 Nitin Pawar <[EMAIL PROTECTED]>
>
>> To add what Ted said,
>>
>> the same discussion happened on the question Jean asked
>>
>> https://issues.apache.org/jira/browse/HBASE-1287
>>
>>
>> On Wed, Apr 10, 2013 at 7:28 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>>
>> > Jean-Marc:
>> > Take a look at HRegionPartitioner which is in both mapred and mapreduce
>> > packages:
>> >
>> >  * This is used to partition the output keys into groups of keys.
>> >
>> >  * Keys are grouped according to the regions that currently exist
>> >
>> >  * so that each reducer fills a single region so load is distributed.
>> >
>> > Cheers
>> >
>> > On Wed, Apr 10, 2013 at 6:54 AM, Jean-Marc Spaggiari <
>> > [EMAIL PROTECTED]> wrote:
>> >
>> > > Hi Nitin,
>> > >
>> > > You got my question correctly.
>> > >
>> > > However, I'm wondering how it's working when it's done into HBase. Do
>> > > we have defaults partionners so we have the same garantee that records
>> > > mapping to one key go to the same reducer. Or do we have to implement
>> > > this one our own.
>> > >
>> > > JM
>> > >
>> > > 2013/4/10 Nitin Pawar <[EMAIL PROTECTED]>:
>> > > > I hope i understood what you are asking is this . If not then
>> pardon me
>> > > :)
>> > > > from the hadoop developer handbook few lines
>> > > >
>> > > > The*Partitioner* class determines which partition a given (key,
>> value)
>> > > pair
>> > > > will go to. The default partitioner computes a hash value for the
>> key
>> > and
>> > > > assigns the partition based on this result. It garantees that all
>> the
>> > > > records mapping to one key go to same reducer
>> > > >
>> > > > You can write your custom partitioner as well
>> > > > here is the link :
>> > > >
>> http://developer.yahoo.com/hadoop/tutorial/module5.html#partitioning
>> > > >
>> > > >
>> > > >
>> > > >
>> > > > On Wed, Apr 10, 2013 at 6:19 PM, Jean-Marc Spaggiari <
>> > > > [EMAIL PROTECTED]> wrote:
>> > > >
>> > > >> Hi,
>> > > >>
>> > > >> quick question. How are the data from the map tasks partitionned
>> for
>> > > >> the reducers?
>> > > >>
>> > > >> If there is 1 reducer, it's easy, but if there is more, are all
>> they
>> > > >> same keys garanteed to end on the same reducer? Or not necessary?
>>  If
>> > > >> they are not, how can we provide a partionning function?
>> > > >>
>> > > >> Thanks,
>> > > >>
>> > > >> JM
>> > > >>
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > Nitin Pawar
>> > >
>> >
>>
>>
>>
>> --
>> Nitin Pawar
>>
>
>