|
|
-
Question on Key Grouping
Joey Krabacher 2012-12-04, 23:37
Is there a way to group Keys a second time before sending results to the Reducer in the same job? I thought maybe a combiner would do this for me, but it just acts like a reducer, so I need an intermediate step that acts like another mapper instead.
To try to visualize this, how I want it to work:
Map output:
<1, [{2, "John",""},{1, "",""},{1, "", "Doe"}]>
Combiner Output:
<1, [{1, "John",""},{1, "",""},{1, "", "Doe"}]>
Reduce Output:
<1, "John","Doe"> How it currently works:
Map output:
<1, [{2, "John",""},{1, "",""},{1, "", "Doe"}]>
Combiner Output:
<1, {1, "John",""}> <1, {1, "",""}> <1, {1, "", "Doe"}>
Reduce Output:
<1, "John","Doe"> <1, "John","Doe"> <1, "John","Doe"> So, basically the issue is that even though the 2 in the first map record should really be a one, I still need to extract the value of "John" and have it included in the output for key 1.
Hope this makes sense.
Thanks in advance, /* Joey */
+
Joey Krabacher 2012-12-04, 23:37
-
RE: Question on Key Grouping
David Parks 2012-12-05, 01:43
First rule to be wary of is your use of the combiner. The combiner *might* be run, it *might not* be run, and it *might be run multiple times*. The combiner is only for reducing the amount of data going to the reducer, and it will only be run *if and when* it's deemed likely to be useful by Hadoop. Don't use it for logic.
Although I didn't quite follow your example (it's not clear what your keys and values are), I think what you need to do is just run 2 map/reduce phases here. The first map/reduce phase groups the first set of keys you need, then reduce, write it to disk (hdfs probably), and run a 2nd map/reduce phase that reads that input and does the mapping you need. Most even modestly complex applications are going through multiple map/reduce phases to accomplish their task. If you need 2 map phases, then the first reduce phase might just be the identity reducer (org.apache.hadoop.mapreduce.Reducer), which just writes the results of the first map phase straight out.
David
From: Joey Krabacher [mailto:[EMAIL PROTECTED]] Sent: Wednesday, December 05, 2012 6:37 AM To: [EMAIL PROTECTED] Subject: Question on Key Grouping
Is there a way to group Keys a second time before sending results to the Reducer in the same job? I thought maybe a combiner would do this for me, but it just acts like a reducer, so I need an intermediate step that acts like another mapper instead.
To try to visualize this, how I want it to work:
Map output:
<1, [{2, "John",""},{1, "",""},{1, "", "Doe"}]>
Combiner Output:
<1, [{1, "John",""},{1, "",""},{1, "", "Doe"}]>
Reduce Output:
<1, "John","Doe">
How it currently works:
Map output:
<1, [{2, "John",""},{1, "",""},{1, "", "Doe"}]>
Combiner Output:
<1, {1, "John",""}>
<1, {1, "",""}>
<1, {1, "", "Doe"}>
Reduce Output:
<1, "John","Doe">
<1, "John","Doe">
<1, "John","Doe">
So, basically the issue is that even though the 2 in the first map record should really be a one, I still need to extract the value of "John" and have it included in the output for key 1.
Hope this makes sense.
Thanks in advance,
/* Joey */
+
David Parks 2012-12-05, 01:43
|
|