Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Map output records/reducer input records mismatch


Copy link to this message
-
Re: Map output records/reducer input records mismatch
Hi Scott,

The problem is found. In the reduce job when there were too many values for some key, I stopped reading values from an iterator. So apparently the rest of the values were not counted. I thought, in case of sequence files unread values were counted in any case. That's why I didn't think about it from the very beginning.

Thanks for the support,
Vyacheslav
On Aug 18, 2011, at 1:47 AM, Scott Carey wrote:

> That is very interesting… I don't see how Avro could affect that.  
>
> Does anyone else have any ideas how Avro might cause the below?
>
> -Scott
>
> On 8/17/11 3:59 PM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> wrote:
>
>> There is a possible reason:
>> It seems that there is an upper limit of 10,001 records per reduce input group. (or is there a setting?)
>>
>>
>> If I output one million rows with the same key, I get:
>> Map output records: 1,000,000
>> Reduce input groups: 1
>> Reduce input records: 10,001
>>
>> If I output one million rows with 20 different keys, I get:
>> Map output records: 1,000,000
>> Reduce input groups: 20
>> Reduce input records: 200,020
>>
>> If I output one million rows with unique keys, I get:
>> Map output records: 1,000,000
>> Reduce input groups: 1,000,000
>> Reduce input records: 1,000,000
>>
>>
>> Btw., I am running on 5 nodes with total map task capacity of 10 and total reduce task capacity of 10.
>>
>> Thanks,
>> Vyacheslav
>>
>> On Aug 17, 2011, at 7:18 PM, Scott Carey wrote:
>>
>>> On 8/17/11 5:02 AM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> wrote:
>>>
>>>> btw,
>>>>
>>>> I was thinking to try it with Utf8 objects instead of strings and I wanted to reuse the same Utf8 object instead of creating new from String upon each map() call.
>>>> Why does not the Utf8 class have a method for setting bytes via a String object?
>>>
>>>
>>> We could add that, but it won't help performance much in this case since the performance improvement from reuse has more to do with the underlying byte[] than the Utf8 object.
>>> The expensive part of String is the conversion from an underlying char[] to a byte[] (Utf8.getBytesFor()), so this would not help much.  It would probably be faster to use String directly rather than wrap it with Utf8 each time.
>>>
>>> Rather than have a static method like the below, I would propose that an instance method be made that does the same thing, something like
>>>
>>> public void setValue(String val) {
>>>    // gets bytes, replaces private byte array, replaces cached string — no system array copy.
>>> }
>>>
>>> which would be much more efficient.  
>>>
>>>
>>>>
>>>> I created the following code snippet:
>>>>
>>>>     public static Utf8 reuseUtf8Object(Utf8 container, String strToReuse) {
>>>>         byte[] strBytes = Utf8.getBytesFor(strToReuse);
>>>>         container.setByteLength(strBytes.length);
>>>>         System.arraycopy(strBytes, 0, container.getBytes(), 0, strBytes.length);
>>>>         return container;
>>>>     }
>>>>
>>>> Would that be useful if this code is encapsulated into the Utf8 class?
>>>>
>>>> Best,
>>>> Vyacheslav
>>>>
>>>> On Aug 17, 2011, at 3:56 AM, Scott Carey wrote:
>>>>
>>>>> On 8/16/11 3:56 PM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]>
>>>>> wrote:
>>>>>
>>>>>> Hi, Scott,
>>>>>>
>>>>>> thanks for your reply.
>>>>>>
>>>>>>> What Avro version is this happening with? What JVM version?
>>>>>>
>>>>>> We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have
>>>>>> to look up.
>>>>>>
>>>>>>>
>>>>>>> On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args
>>>>>>> if
>>>>>>> it is Sun and JRE 6u21 or later? (some issues in loop predicates affect
>>>>>>> Java 6 too, just not as many as the recent news on Java7).
>>>>>>>
>>>>>>> Otherwise, it may likely be the same thing as AVRO-782.  Any extra
>>>>>>> information related to that issue would be welcome.
>>>>>>
>>>>>> I will have to collect it. In the meanwhile, do you have any reasonable
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB