Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Map output records/reducer input records mismatch


+
Vyacheslav Zholudev 2011-08-16, 15:39
+
Scott Carey 2011-08-16, 20:22
+
Vyacheslav Zholudev 2011-08-16, 22:56
+
Scott Carey 2011-08-17, 01:56
+
Vyacheslav Zholudev 2011-08-17, 08:32
+
Scott Carey 2011-08-17, 17:06
+
Vyacheslav Zholudev 2011-08-17, 12:02
+
Scott Carey 2011-08-17, 17:18
+
Vyacheslav Zholudev 2011-08-17, 18:09
Copy link to this message
-
Re: Map output records/reducer input records mismatch
Hi Scott,

thanks for all the suggestions. I really appreciate your support.

Unfortunately, I could not solve the problem so far.

That's what I have tried:

1. Switched to UTF8 everywhere, including changing the interface to <Utf8, SomeSpecificJavaClass>
2. Always generate new instances before collecting (new Utf8("fromString") for the key, clone for the value)

The problem persists - records seem to get lost between mapper and reducer.

Interestingly, it's only reproducible with large datasets. So, if I run a relatively small set of 6 million input rows, I do not get any differences, however, on a 10 million input dataset the difference shows up:
Map input records: 10,000,000
Map input bytes: 11,458,340,172
Map output bytes: 30,420,106,592
Map output records: 28,196,842
Reduce input records: 28,053,314
I'm trying to simplify the job further.

Do you have any further ideas?

Thanks,
Vyacheslav
On Aug 17, 2011, at 7:18 PM, Scott Carey wrote:

> On 8/17/11 5:02 AM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> wrote:
>
>> btw,
>>
>> I was thinking to try it with Utf8 objects instead of strings and I wanted to reuse the same Utf8 object instead of creating new from String upon each map() call.
>> Why does not the Utf8 class have a method for setting bytes via a String object?
>
>
> We could add that, but it won't help performance much in this case since the performance improvement from reuse has more to do with the underlying byte[] than the Utf8 object.
> The expensive part of String is the conversion from an underlying char[] to a byte[] (Utf8.getBytesFor()), so this would not help much.  It would probably be faster to use String directly rather than wrap it with Utf8 each time.
>
> Rather than have a static method like the below, I would propose that an instance method be made that does the same thing, something like
>
> public void setValue(String val) {
>    // gets bytes, replaces private byte array, replaces cached string — no system array copy.
> }
>
> which would be much more efficient.  
>
>
>>
>> I created the following code snippet:
>>
>>     public static Utf8 reuseUtf8Object(Utf8 container, String strToReuse) {
>>         byte[] strBytes = Utf8.getBytesFor(strToReuse);
>>         container.setByteLength(strBytes.length);
>>         System.arraycopy(strBytes, 0, container.getBytes(), 0, strBytes.length);
>>         return container;
>>     }
>>
>> Would that be useful if this code is encapsulated into the Utf8 class?
>>
>> Best,
>> Vyacheslav
>>
>> On Aug 17, 2011, at 3:56 AM, Scott Carey wrote:
>>
>>> On 8/16/11 3:56 PM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>> Hi, Scott,
>>>>
>>>> thanks for your reply.
>>>>
>>>>> What Avro version is this happening with? What JVM version?
>>>>
>>>> We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have
>>>> to look up.
>>>>
>>>>>
>>>>> On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args
>>>>> if
>>>>> it is Sun and JRE 6u21 or later? (some issues in loop predicates affect
>>>>> Java 6 too, just not as many as the recent news on Java7).
>>>>>
>>>>> Otherwise, it may likely be the same thing as AVRO-782.  Any extra
>>>>> information related to that issue would be welcome.
>>>>
>>>> I will have to collect it. In the meanwhile, do you have any reasonable
>>>> explanations of the issue besides it being something like AVRO-782?
>>>
>>> What is your key type (map output schema, first type argument of Pair)?
>>> Is your key a Utf8 or String?  I don't have a reasonable explanation at
>>> this point, I haven't looked into it in depth with a good reproducible
>>> case.  I have my suspicions with how recycling of the key works since Utf8
>>> is mutable and its backing byte[] can end up shared.
>>>
>>>
>>>
>>>>
>>>> Thanks a lot,
>>>> Vyacheslav
>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>> -Scott
>>>>>
>>>>>
>>>>>
>>>>> On 8/16/11 8:39 AM, "Vyacheslav Zholudev"
>>>>> <[EMAIL PROTECTED]>
+
Vyacheslav Zholudev 2011-08-17, 22:59
+
Scott Carey 2011-08-17, 23:47
+
Vyacheslav Zholudev 2011-08-18, 12:50
+
Vyacheslav Zholudev 2011-08-17, 15:49