Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Re: GroupingComparator

Alberto Cordioli 2012-10-15, 20:39
Dave Beech 2012-10-15, 20:49
Alberto Cordioli 2012-10-16, 08:42
Dave Beech 2012-10-16, 09:08
Copy link to this message
Re: GroupingComparator
Yes, I know that keeping an in-memory collection ins't a good idea.
The problem is that I need to perform a join, so there is no other
possibilities! :(


On 16 October 2012 11:08, Dave Beech <[EMAIL PROTECTED]> wrote:
> Great! Glad the problem is solved.
> You're right - the object returned by iterator.next() is re-used too.
> So yes, you would need to clone in this case and you'd have no choice
> but to create new objects.
> Please be sure though that you really do need to store values in a
> list to do what you're trying to do. Keeping an in-memory collection
> might not be very scalable. Obviously, if you've got loads of RAM or
> not a lot of data (or both), then that's fine! Just something else to
> think about...
> Cheers,
> Dave
> On 16 October 2012 09:42, Alberto Cordioli <[EMAIL PROTECTED]> wrote:
>> Thanks Dave.
>> You solved my problem. Just a little question about your tip:
>> I suppose also the value returned by iterator.next() is re-used.
>> So if want to store some values of the Iterable list in the reducer, I
>> should create a List and put cloned objects inside it.
>> In this case there is no possibility to avoid the "new" operator, right?
>> On 15 October 2012 22:49, Dave Beech <[EMAIL PROTECTED]> wrote:
>>> Well, if all you need is the tag (the 1 or 2), why not just use a Text
>>> or IntWritable instance variable. You wouldn't need to clone the whole
>>> key.
>>> Then, instead of tag = key.getSecondField() you'd say
>>> tag.set(key.getSecondField().get());
>>> I don't know what type of object tag is (if it's Text you'll say
>>> toString() rather than get()), but you see what I mean.
>>> Also - just a tip - try to avoid creating new objects wherever
>>> possible. You'll get better performance if you create one Text object
>>> as an instance variable and re-use it by setting the value instead of
>>> calling new Text("") on every output.
>>> Thanks,
>>> Dave
>>> On 15 October 2012 21:39, Alberto Cordioli <[EMAIL PROTECTED]> wrote:
>>>> Hi Dave,
>>>> thanks for your reply. Now it's more clear; in fact the code that I
>>>> wrote is inspired to the old api, where the behavior is another.
>>>> So, how can I achieve the same behavior as the old api? I need the
>>>> second field of the first key object to stay the same among the
>>>> iterations, in order to compare it with other objects. Do I have to
>>>> clone the object?
>>>> Thanks.
>>>> On 15 October 2012 21:27, Dave Beech <[EMAIL PROTECTED]> wrote:
>>>>> Hi Alberto
>>>>> The iterator you are looping over in your reduce method isn't a
>>>>> self-contained list of values. What's actually happening is that
>>>>> you're iterating through *part* of the sorted key/value set that was
>>>>> sent to that reduce node, and it is the grouping comparator that
>>>>> decides when to break that loop and call reduce again on the next key.
>>>>> Moreover, the "key" object is re-used. So, as you're iterating through
>>>>> the values, what's actually happening is this pointer to the
>>>>> associated key data moves with it - and you're seeing it change.
>>>>> This only happens in the new "mapreduce" API - in the older "mapred"
>>>>> API you get the first key, and it appears to stay the same during the
>>>>> loop.
>>>>> It's sometimes useful behaviour, but it's confusing how the two APIs
>>>>> don't act the same.
>>>>> Hope that helps,
>>>>> Dave
>>>>> On 15 October 2012 20:11, Alberto Cordioli <[EMAIL PROTECTED]> wrote:
>>>>>> Hi all,
>>>>>> a very strange thing is happening with my hadoop program.
>>>>>> My map simply emits tuples with a custom object as key (which
>>>>>> implement WritableComparable).
>>>>>> The object is made of 2 fields, and I implement my partitioner and
>>>>>> groupingclass in such a way that only the first field is taken into
>>>>>> account.
>>>>>> The second field is just a tag and could be 1 or 2.
>>>>>> This is the reducer's snippet:

Alberto Cordioli
Vinod Kumar Vavilapalli 2012-10-16, 18:44