Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - MapReduce: Reducers partitions.


Copy link to this message
-
Re: MapReduce: Reducers partitions.
Jean-Marc Spaggiari 2013-04-10, 19:01
Hi Greame,

No. The reducer will simply write on the table the same way you are doing a
regular Put. If a split is required because of the size, then the region
will be split, but at the end, there will not necessary be any region
split.

In the usecase described below, all the 600 lines will "simply" go into the
only region in the table and no split will occur.

The goal is to partition the data for the reducer only. Not in the table.

JM

2013/4/10 Graeme Wallace <[EMAIL PROTECTED]>

> Whats the behavior then if you return hash % num_reducers and you have no
> splits defined. When the reducer writes to the table does the region server
> local to the reducer create a new region ?
>
> Graeme
>
>
> On Wed, Apr 10, 2013 at 1:26 PM, Jean-Marc Spaggiari <
> [EMAIL PROTECTED]> wrote:
>
> > So.
> >
> > I looked at the code, and I have one comment/suggestion here.
> >
> > If the table we are outputing to has regions, then partitions are build
> > around that, and that's fine. But if the table is totally empty with a
> > single region, even if we setNumReduceTasks to 2 or more, all the keys
> will
> > go on the same first reducer because of this:
> >     if (this.startKeys.length == 1){
> >       return 0;
> >     }
> > I think it will be better to return something like keycrc%numPartitions
> > instead. That still allow the application to spread the reducing process
> > over multinode(racks) even if there is only one region in the table.
> >
> > In my usecase, I have millions of lines producing some statistics. At the
> > end, I will have only about 600 lines, but it will take a lot of map and
> > reduce time to go from millions to 600, that's why I'm looking to have
> more
> > than one reducer. However, with only 600 lines, it's very difficult to
> > pre-split the table. Keys are all very close.
> >
> > Does anyone see anything wrong with changing this default behaviour when
> > startKeys.length == 1? If not, I will open a JIRA and upload a patch.
> >
> > JM
> >
> > 2013/4/10 Jean-Marc Spaggiari <[EMAIL PROTECTED]>
> >
> > > Thanks Ted.
> > >
> > > It's exactly where I was looking at now. I was close. I will take a
> > deeper
> > > look.
> > >
> > > Thanks Nitin for the link. I will read that too.
> > >
> > > JM
> > >
> > > 2013/4/10 Nitin Pawar <[EMAIL PROTECTED]>
> > >
> > >> To add what Ted said,
> > >>
> > >> the same discussion happened on the question Jean asked
> > >>
> > >> https://issues.apache.org/jira/browse/HBASE-1287
> > >>
> > >>
> > >> On Wed, Apr 10, 2013 at 7:28 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> > >>
> > >> > Jean-Marc:
> > >> > Take a look at HRegionPartitioner which is in both mapred and
> > mapreduce
> > >> > packages:
> > >> >
> > >> >  * This is used to partition the output keys into groups of keys.
> > >> >
> > >> >  * Keys are grouped according to the regions that currently exist
> > >> >
> > >> >  * so that each reducer fills a single region so load is
> distributed.
> > >> >
> > >> > Cheers
> > >> >
> > >> > On Wed, Apr 10, 2013 at 6:54 AM, Jean-Marc Spaggiari <
> > >> > [EMAIL PROTECTED]> wrote:
> > >> >
> > >> > > Hi Nitin,
> > >> > >
> > >> > > You got my question correctly.
> > >> > >
> > >> > > However, I'm wondering how it's working when it's done into HBase.
> > Do
> > >> > > we have defaults partionners so we have the same garantee that
> > records
> > >> > > mapping to one key go to the same reducer. Or do we have to
> > implement
> > >> > > this one our own.
> > >> > >
> > >> > > JM
> > >> > >
> > >> > > 2013/4/10 Nitin Pawar <[EMAIL PROTECTED]>:
> > >> > > > I hope i understood what you are asking is this . If not then
> > >> pardon me
> > >> > > :)
> > >> > > > from the hadoop developer handbook few lines
> > >> > > >
> > >> > > > The*Partitioner* class determines which partition a given (key,
> > >> value)
> > >> > > pair
> > >> > > > will go to. The default partitioner computes a hash value for
> the
> > >> key
> > >> > and
> > >> > > > assigns the partition based on this result. It garantees that