Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Re: GroupingComparator

Alberto Cordioli 2012-10-15, 20:39
Dave Beech 2012-10-15, 20:49
Alberto Cordioli 2012-10-16, 08:42
Copy link to this message
Re: GroupingComparator
Great! Glad the problem is solved.

You're right - the object returned by iterator.next() is re-used too.
So yes, you would need to clone in this case and you'd have no choice
but to create new objects.

Please be sure though that you really do need to store values in a
list to do what you're trying to do. Keeping an in-memory collection
might not be very scalable. Obviously, if you've got loads of RAM or
not a lot of data (or both), then that's fine! Just something else to
think about...


On 16 October 2012 09:42, Alberto Cordioli <[EMAIL PROTECTED]> wrote:
> Thanks Dave.
> You solved my problem. Just a little question about your tip:
> I suppose also the value returned by iterator.next() is re-used.
> So if want to store some values of the Iterable list in the reducer, I
> should create a List and put cloned objects inside it.
> In this case there is no possibility to avoid the "new" operator, right?
> On 15 October 2012 22:49, Dave Beech <[EMAIL PROTECTED]> wrote:
>> Well, if all you need is the tag (the 1 or 2), why not just use a Text
>> or IntWritable instance variable. You wouldn't need to clone the whole
>> key.
>> Then, instead of tag = key.getSecondField() you'd say
>> tag.set(key.getSecondField().get());
>> I don't know what type of object tag is (if it's Text you'll say
>> toString() rather than get()), but you see what I mean.
>> Also - just a tip - try to avoid creating new objects wherever
>> possible. You'll get better performance if you create one Text object
>> as an instance variable and re-use it by setting the value instead of
>> calling new Text("") on every output.
>> Thanks,
>> Dave
>> On 15 October 2012 21:39, Alberto Cordioli <[EMAIL PROTECTED]> wrote:
>>> Hi Dave,
>>> thanks for your reply. Now it's more clear; in fact the code that I
>>> wrote is inspired to the old api, where the behavior is another.
>>> So, how can I achieve the same behavior as the old api? I need the
>>> second field of the first key object to stay the same among the
>>> iterations, in order to compare it with other objects. Do I have to
>>> clone the object?
>>> Thanks.
>>> On 15 October 2012 21:27, Dave Beech <[EMAIL PROTECTED]> wrote:
>>>> Hi Alberto
>>>> The iterator you are looping over in your reduce method isn't a
>>>> self-contained list of values. What's actually happening is that
>>>> you're iterating through *part* of the sorted key/value set that was
>>>> sent to that reduce node, and it is the grouping comparator that
>>>> decides when to break that loop and call reduce again on the next key.
>>>> Moreover, the "key" object is re-used. So, as you're iterating through
>>>> the values, what's actually happening is this pointer to the
>>>> associated key data moves with it - and you're seeing it change.
>>>> This only happens in the new "mapreduce" API - in the older "mapred"
>>>> API you get the first key, and it appears to stay the same during the
>>>> loop.
>>>> It's sometimes useful behaviour, but it's confusing how the two APIs
>>>> don't act the same.
>>>> Hope that helps,
>>>> Dave
>>>> On 15 October 2012 20:11, Alberto Cordioli <[EMAIL PROTECTED]> wrote:
>>>>> Hi all,
>>>>> a very strange thing is happening with my hadoop program.
>>>>> My map simply emits tuples with a custom object as key (which
>>>>> implement WritableComparable).
>>>>> The object is made of 2 fields, and I implement my partitioner and
>>>>> groupingclass in such a way that only the first field is taken into
>>>>> account.
>>>>> The second field is just a tag and could be 1 or 2.
>>>>> This is the reducer's snippet:
>>>>> tag = key.getSecondField();
>>>>> Iterator it1 = values.iterator();
>>>>> while(it1.hasNext()){
>>>>>         it1.next();
>>>>>         collector.emit(new Text("dummy"), tag);
>>>>> }
>>>>> I would expect in my output all the lines with:
>>>>> dummy       1
>>>>> ...
>>>>> dummy       1
Alberto Cordioli 2012-10-16, 09:45
Vinod Kumar Vavilapalli 2012-10-16, 18:44