Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Map output records/reducer input records mismatch


Copy link to this message
-
Re: Map output records/reducer input records mismatch
Hi Scott,

thanks for all the suggestions. I really appreciate your support.

Unfortunately, I could not solve the problem so far.

That's what I have tried:

1. Switched to UTF8 everywhere, including changing the interface to <Utf8, SomeSpecificJavaClass>
2. Always generate new instances before collecting (new Utf8("fromString") for the key, clone for the value)

The problem persists - records seem to get lost between mapper and reducer.

Interestingly, it's only reproducible with large datasets. So, if I run a relatively small set of 6 million input rows, I do not get any differences, however, on a 10 million input dataset the difference shows up:
Map input records: 10,000,000
Map input bytes: 11,458,340,172
Map output bytes: 30,420,106,592
Map output records: 28,196,842
Reduce input records: 28,053,314
I'm trying to simplify the job further.

Do you have any further ideas?

Thanks,
Vyacheslav
On Aug 17, 2011, at 7:18 PM, Scott Carey wrote:

> On 8/17/11 5:02 AM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> wrote:
>
>> btw,
>>
>> I was thinking to try it with Utf8 objects instead of strings and I wanted to reuse the same Utf8 object instead of creating new from String upon each map() call.
>> Why does not the Utf8 class have a method for setting bytes via a String object?
>
>
> We could add that, but it won't help performance much in this case since the performance improvement from reuse has more to do with the underlying byte[] than the Utf8 object.
> The expensive part of String is the conversion from an underlying char[] to a byte[] (Utf8.getBytesFor()), so this would not help much.  It would probably be faster to use String directly rather than wrap it with Utf8 each time.
>
> Rather than have a static method like the below, I would propose that an instance method be made that does the same thing, something like
>
> public void setValue(String val) {
>    // gets bytes, replaces private byte array, replaces cached string — no system array copy.
> }
>
> which would be much more efficient.  
>
>
>>
>> I created the following code snippet:
>>
>>     public static Utf8 reuseUtf8Object(Utf8 container, String strToReuse) {
>>         byte[] strBytes = Utf8.getBytesFor(strToReuse);
>>         container.setByteLength(strBytes.length);
>>         System.arraycopy(strBytes, 0, container.getBytes(), 0, strBytes.length);
>>         return container;
>>     }
>>
>> Would that be useful if this code is encapsulated into the Utf8 class?
>>
>> Best,
>> Vyacheslav
>>
>> On Aug 17, 2011, at 3:56 AM, Scott Carey wrote:
>>
>>> On 8/16/11 3:56 PM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>> Hi, Scott,
>>>>
>>>> thanks for your reply.
>>>>
>>>>> What Avro version is this happening with? What JVM version?
>>>>
>>>> We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have
>>>> to look up.
>>>>
>>>>>
>>>>> On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args
>>>>> if
>>>>> it is Sun and JRE 6u21 or later? (some issues in loop predicates affect
>>>>> Java 6 too, just not as many as the recent news on Java7).
>>>>>
>>>>> Otherwise, it may likely be the same thing as AVRO-782.  Any extra
>>>>> information related to that issue would be welcome.
>>>>
>>>> I will have to collect it. In the meanwhile, do you have any reasonable
>>>> explanations of the issue besides it being something like AVRO-782?
>>>
>>> What is your key type (map output schema, first type argument of Pair)?
>>> Is your key a Utf8 or String?  I don't have a reasonable explanation at
>>> this point, I haven't looked into it in depth with a good reproducible
>>> case.  I have my suspicions with how recycling of the key works since Utf8
>>> is mutable and its backing byte[] can end up shared.
>>>
>>>
>>>
>>>>
>>>> Thanks a lot,
>>>> Vyacheslav
>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>> -Scott
>>>>>
>>>>>
>>>>>
>>>>> On 8/16/11 8:39 AM, "Vyacheslav Zholudev"
>>>>> <[EMAIL PROTECTED]>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB