Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Map output records/reducer input records mismatch


+
Vyacheslav Zholudev 2011-08-16, 15:39
+
Scott Carey 2011-08-16, 20:22
+
Vyacheslav Zholudev 2011-08-16, 22:56
+
Scott Carey 2011-08-17, 01:56
+
Vyacheslav Zholudev 2011-08-17, 08:32
+
Scott Carey 2011-08-17, 17:06
+
Vyacheslav Zholudev 2011-08-17, 12:02
+
Scott Carey 2011-08-17, 17:18
+
Vyacheslav Zholudev 2011-08-17, 18:09
+
Vyacheslav Zholudev 2011-08-17, 22:02
Copy link to this message
-
Re: Map output records/reducer input records mismatch
There is a possible reason:
It seems that there is an upper limit of 10,001 records per reduce input group. (or is there a setting?)
If I output one million rows with the same key, I get:
Map output records: 1,000,000
Reduce input groups: 1
Reduce input records: 10,001

If I output one million rows with 20 different keys, I get:
Map output records: 1,000,000
Reduce input groups: 20
Reduce input records: 200,020

If I output one million rows with unique keys, I get:
Map output records: 1,000,000
Reduce input groups: 1,000,000
Reduce input records: 1,000,000
Btw., I am running on 5 nodes with total map task capacity of 10 and total reduce task capacity of 10.

Thanks,
Vyacheslav

On Aug 17, 2011, at 7:18 PM, Scott Carey wrote:

> On 8/17/11 5:02 AM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> wrote:
>
>> btw,
>>
>> I was thinking to try it with Utf8 objects instead of strings and I wanted to reuse the same Utf8 object instead of creating new from String upon each map() call.
>> Why does not the Utf8 class have a method for setting bytes via a String object?
>
>
> We could add that, but it won't help performance much in this case since the performance improvement from reuse has more to do with the underlying byte[] than the Utf8 object.
> The expensive part of String is the conversion from an underlying char[] to a byte[] (Utf8.getBytesFor()), so this would not help much.  It would probably be faster to use String directly rather than wrap it with Utf8 each time.
>
> Rather than have a static method like the below, I would propose that an instance method be made that does the same thing, something like
>
> public void setValue(String val) {
>    // gets bytes, replaces private byte array, replaces cached string — no system array copy.
> }
>
> which would be much more efficient.  
>
>
>>
>> I created the following code snippet:
>>
>>     public static Utf8 reuseUtf8Object(Utf8 container, String strToReuse) {
>>         byte[] strBytes = Utf8.getBytesFor(strToReuse);
>>         container.setByteLength(strBytes.length);
>>         System.arraycopy(strBytes, 0, container.getBytes(), 0, strBytes.length);
>>         return container;
>>     }
>>
>> Would that be useful if this code is encapsulated into the Utf8 class?
>>
>> Best,
>> Vyacheslav
>>
>> On Aug 17, 2011, at 3:56 AM, Scott Carey wrote:
>>
>>> On 8/16/11 3:56 PM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>> Hi, Scott,
>>>>
>>>> thanks for your reply.
>>>>
>>>>> What Avro version is this happening with? What JVM version?
>>>>
>>>> We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have
>>>> to look up.
>>>>
>>>>>
>>>>> On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args
>>>>> if
>>>>> it is Sun and JRE 6u21 or later? (some issues in loop predicates affect
>>>>> Java 6 too, just not as many as the recent news on Java7).
>>>>>
>>>>> Otherwise, it may likely be the same thing as AVRO-782.  Any extra
>>>>> information related to that issue would be welcome.
>>>>
>>>> I will have to collect it. In the meanwhile, do you have any reasonable
>>>> explanations of the issue besides it being something like AVRO-782?
>>>
>>> What is your key type (map output schema, first type argument of Pair)?
>>> Is your key a Utf8 or String?  I don't have a reasonable explanation at
>>> this point, I haven't looked into it in depth with a good reproducible
>>> case.  I have my suspicions with how recycling of the key works since Utf8
>>> is mutable and its backing byte[] can end up shared.
>>>
>>>
>>>
>>>>
>>>> Thanks a lot,
>>>> Vyacheslav
>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>> -Scott
>>>>>
>>>>>
>>>>>
>>>>> On 8/16/11 8:39 AM, "Vyacheslav Zholudev"
>>>>> <[EMAIL PROTECTED]>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm having multiple hadoop jobs that use the avro mapred API.
>>>>>> Only in one of the jobs I have a visible mismatch between a number of
>>>>>> map
>>>>>> output records and reducer input records.

Best,
Vyacheslav
+
Scott Carey 2011-08-17, 23:47
+
Vyacheslav Zholudev 2011-08-18, 12:50
+
Vyacheslav Zholudev 2011-08-17, 15:49
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB