Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Map output records/reducer input records mismatch


+
Vyacheslav Zholudev 2011-08-16, 15:39
+
Scott Carey 2011-08-16, 20:22
+
Vyacheslav Zholudev 2011-08-16, 22:56
+
Scott Carey 2011-08-17, 01:56
+
Vyacheslav Zholudev 2011-08-17, 08:32
+
Scott Carey 2011-08-17, 17:06
+
Vyacheslav Zholudev 2011-08-17, 12:02
+
Scott Carey 2011-08-17, 17:18
Copy link to this message
-
Re: Map output records/reducer input records mismatch

On Aug 17, 2011, at 7:18 PM, Scott Carey wrote:

> On 8/17/11 5:02 AM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> wrote:
>
>> btw,
>>
>> I was thinking to try it with Utf8 objects instead of strings and I wanted to reuse the same Utf8 object instead of creating new from String upon each map() call.
>> Why does not the Utf8 class have a method for setting bytes via a String object?
>
>
> We could add that, but it won't help performance much in this case since the performance improvement from reuse has more to do with the underlying byte[] than the Utf8 object.
> The expensive part of String is the conversion from an underlying char[] to a byte[] (Utf8.getBytesFor()), so this would not help much.  It would probably be faster to use String directly rather than wrap it with Utf8 each time.
>
> Rather than have a static method like the below, I would propose that an instance method be made that does the same thing, something like
>
> public void setValue(String val) {
>    // gets bytes, replaces private byte array, replaces cached string — no system array copy.
> }
>
> which would be much more efficient.  

Thanks for the reply.

Yes, true by encapsulating this code into the Utf8 class. I just couldn't do the replacement of the private array outside the class scope, obviously.
Vyacheslav

>
>
>>
>> I created the following code snippet:
>>
>>     public static Utf8 reuseUtf8Object(Utf8 container, String strToReuse) {
>>         byte[] strBytes = Utf8.getBytesFor(strToReuse);
>>         container.setByteLength(strBytes.length);
>>         System.arraycopy(strBytes, 0, container.getBytes(), 0, strBytes.length);
>>         return container;
>>     }
>>
>> Would that be useful if this code is encapsulated into the Utf8 class?
>>
>> Best,
>> Vyacheslav
>>
>> On Aug 17, 2011, at 3:56 AM, Scott Carey wrote:
>>
>>> On 8/16/11 3:56 PM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>> Hi, Scott,
>>>>
>>>> thanks for your reply.
>>>>
>>>>> What Avro version is this happening with? What JVM version?
>>>>
>>>> We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have
>>>> to look up.
>>>>
>>>>>
>>>>> On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args
>>>>> if
>>>>> it is Sun and JRE 6u21 or later? (some issues in loop predicates affect
>>>>> Java 6 too, just not as many as the recent news on Java7).
>>>>>
>>>>> Otherwise, it may likely be the same thing as AVRO-782.  Any extra
>>>>> information related to that issue would be welcome.
>>>>
>>>> I will have to collect it. In the meanwhile, do you have any reasonable
>>>> explanations of the issue besides it being something like AVRO-782?
>>>
>>> What is your key type (map output schema, first type argument of Pair)?
>>> Is your key a Utf8 or String?  I don't have a reasonable explanation at
>>> this point, I haven't looked into it in depth with a good reproducible
>>> case.  I have my suspicions with how recycling of the key works since Utf8
>>> is mutable and its backing byte[] can end up shared.
>>>
>>>
>>>
>>>>
>>>> Thanks a lot,
>>>> Vyacheslav
>>>>
>>>>>
>>>>> Thanks!
>>>>>
>>>>> -Scott
>>>>>
>>>>>
>>>>>
>>>>> On 8/16/11 8:39 AM, "Vyacheslav Zholudev"
>>>>> <[EMAIL PROTECTED]>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm having multiple hadoop jobs that use the avro mapred API.
>>>>>> Only in one of the jobs I have a visible mismatch between a number of
>>>>>> map
>>>>>> output records and reducer input records.
>>>>>>
>>>>>> Does anybody encountered such a behavior? Can anybody think of possible
>>>>>> explanations of this phenomenon?
>>>>>>
>>>>>> Any pointers/thoughts are highly appreciated!
>>>>>>
>>>>>> Best,
>>>>>> Vyacheslav
>>>>>
>>>>>
>>>>
>>>> Best,
>>>> Vyacheslav
>>>>
>>>>
>>>>
>>>
>>>
>>

+
Vyacheslav Zholudev 2011-08-17, 22:02
+
Vyacheslav Zholudev 2011-08-17, 22:59
+
Scott Carey 2011-08-17, 23:47
+
Vyacheslav Zholudev 2011-08-18, 12:50
+
Vyacheslav Zholudev 2011-08-17, 15:49
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB