Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Question on Key Grouping


Copy link to this message
-
RE: Question on Key Grouping
First rule to be wary of is your use of the combiner. The combiner *might*
be run, it *might not* be run, and it *might be run multiple times*. The
combiner is only for reducing the amount of data going to the reducer, and
it will only be run *if and when* it's deemed likely to be useful by Hadoop.
Don't use it for logic.

 

Although I didn't quite follow your example (it's not clear what your keys
and values are), I think what you need to do is just run 2 map/reduce phases
here. The first map/reduce phase groups the first set of keys you need, then
reduce, write it to disk (hdfs probably), and run a 2nd map/reduce phase
that reads that input and does the mapping you need. Most even modestly
complex applications are going through multiple map/reduce phases to
accomplish their task. If you need 2 map phases, then the first reduce phase
might just be the identity reducer (org.apache.hadoop.mapreduce.Reducer),
which just writes the results of the first map phase straight out.

 

David

 

 

From: Joey Krabacher [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, December 05, 2012 6:37 AM
To: [EMAIL PROTECTED]
Subject: Question on Key Grouping

 

Is there a way to group Keys a second time before sending results to the
Reducer in the same job? I thought maybe a combiner would do this for me,
but it just acts like a reducer, so I need an intermediate step that acts
like another mapper instead.

 

To try to visualize this, how I want it to work:

 

Map output:

 

<1, [{2, "John",""},{1, "",""},{1, "", "Doe"}]>

 

Combiner Output:

 

<1, [{1, "John",""},{1, "",""},{1, "", "Doe"}]>

 

Reduce Output:

 

<1, "John","Doe">

 

 

How it currently works:

 

Map output:

 

<1, [{2, "John",""},{1, "",""},{1, "", "Doe"}]>

 

Combiner Output:

 

<1, {1, "John",""}>

<1, {1, "",""}>

<1, {1, "", "Doe"}>

 

Reduce Output:

 

<1, "John","Doe">

<1, "John","Doe">

<1, "John","Doe">

 

 

So, basically the issue is that even though the 2 in the first map record
should really be a one, I still need to extract the value of "John" and have
it included in the output for key 1.

 

Hope this makes sense.

 

Thanks in advance,

/* Joey */

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB