|
Vyacheslav Zholudev
2011-08-16, 15:39
Scott Carey
2011-08-16, 20:22
Vyacheslav Zholudev
2011-08-16, 22:56
Scott Carey
2011-08-17, 01:56
Vyacheslav Zholudev
2011-08-17, 08:32
Scott Carey
2011-08-17, 17:06
Vyacheslav Zholudev
2011-08-17, 12:02
Scott Carey
2011-08-17, 17:18
Vyacheslav Zholudev
2011-08-17, 18:09
Vyacheslav Zholudev
2011-08-17, 22:02
Vyacheslav Zholudev
2011-08-17, 22:59
Scott Carey
2011-08-17, 23:47
Vyacheslav Zholudev
2011-08-18, 12:50
Vyacheslav Zholudev
2011-08-17, 15:49
|
-
Map output records/reducer input records mismatchVyacheslav Zholudev 2011-08-16, 15:39
Hi,
I'm having multiple hadoop jobs that use the avro mapred API. Only in one of the jobs I have a visible mismatch between a number of map output records and reducer input records. Does anybody encountered such a behavior? Can anybody think of possible explanations of this phenomenon? Any pointers/thoughts are highly appreciated! Best, Vyacheslav +
Vyacheslav Zholudev 2011-08-16, 15:39
-
Re: Map output records/reducer input records mismatchScott Carey 2011-08-16, 20:22
We have had one other report of something similar happening.
https://issues.apache.org/jira/browse/AVRO-782 What Avro version is this happening with? What JVM version? On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args if it is Sun and JRE 6u21 or later? (some issues in loop predicates affect Java 6 too, just not as many as the recent news on Java7). Otherwise, it may likely be the same thing as AVRO-782. Any extra information related to that issue would be welcome. Thanks! -Scott On 8/16/11 8:39 AM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> wrote: >Hi, > >I'm having multiple hadoop jobs that use the avro mapred API. >Only in one of the jobs I have a visible mismatch between a number of map >output records and reducer input records. > >Does anybody encountered such a behavior? Can anybody think of possible >explanations of this phenomenon? > >Any pointers/thoughts are highly appreciated! > >Best, >Vyacheslav +
Scott Carey 2011-08-16, 20:22
-
Re: Map output records/reducer input records mismatchVyacheslav Zholudev 2011-08-16, 22:56
Hi, Scott,
thanks for your reply. > What Avro version is this happening with? What JVM version? We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have to look up. > > On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args if > it is Sun and JRE 6u21 or later? (some issues in loop predicates affect > Java 6 too, just not as many as the recent news on Java7). > > Otherwise, it may likely be the same thing as AVRO-782. Any extra > information related to that issue would be welcome. I will have to collect it. In the meanwhile, do you have any reasonable explanations of the issue besides it being something like AVRO-782? Thanks a lot, Vyacheslav > > Thanks! > > -Scott > > > > On 8/16/11 8:39 AM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> > wrote: > >> Hi, >> >> I'm having multiple hadoop jobs that use the avro mapred API. >> Only in one of the jobs I have a visible mismatch between a number of map >> output records and reducer input records. >> >> Does anybody encountered such a behavior? Can anybody think of possible >> explanations of this phenomenon? >> >> Any pointers/thoughts are highly appreciated! >> >> Best, >> Vyacheslav > > Best, Vyacheslav +
Vyacheslav Zholudev 2011-08-16, 22:56
-
Re: Map output records/reducer input records mismatchScott Carey 2011-08-17, 01:56
On 8/16/11 3:56 PM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]>
wrote: >Hi, Scott, > >thanks for your reply. > >> What Avro version is this happening with? What JVM version? > >We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have >to look up. > >> >> On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args >>if >> it is Sun and JRE 6u21 or later? (some issues in loop predicates affect >> Java 6 too, just not as many as the recent news on Java7). >> >> Otherwise, it may likely be the same thing as AVRO-782. Any extra >> information related to that issue would be welcome. > >I will have to collect it. In the meanwhile, do you have any reasonable >explanations of the issue besides it being something like AVRO-782? What is your key type (map output schema, first type argument of Pair)? Is your key a Utf8 or String? I don't have a reasonable explanation at this point, I haven't looked into it in depth with a good reproducible case. I have my suspicions with how recycling of the key works since Utf8 is mutable and its backing byte[] can end up shared. > >Thanks a lot, >Vyacheslav > >> >> Thanks! >> >> -Scott >> >> >> >> On 8/16/11 8:39 AM, "Vyacheslav Zholudev" >><[EMAIL PROTECTED]> >> wrote: >> >>> Hi, >>> >>> I'm having multiple hadoop jobs that use the avro mapred API. >>> Only in one of the jobs I have a visible mismatch between a number of >>>map >>> output records and reducer input records. >>> >>> Does anybody encountered such a behavior? Can anybody think of possible >>> explanations of this phenomenon? >>> >>> Any pointers/thoughts are highly appreciated! >>> >>> Best, >>> Vyacheslav >> >> > >Best, >Vyacheslav > > > +
Scott Carey 2011-08-17, 01:56
-
Re: Map output records/reducer input records mismatchVyacheslav Zholudev 2011-08-17, 08:32
Hi Scott,
The pair types are Pair<CharSequence, SomeSpecificJavaClass>, but in essence when I call "collect()" then I always provide a java.lang.String object. The reduce method is reduce(CharSequence key, Iterable<SomeSpecificJavaClass> values, .....) Some more detailed info: the jobtracker and namenode run with: java version "1.6.0_22" Java(TM) SE Runtime Environment (build 1.6.0_22-b04) Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode) the tasktrackers and datanodes run with: java version "1.6.0_24" Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode) Hadoop version is: cdh3u1 Thanks for suggestions, Vyacheslav On Aug 17, 2011, at 3:56 AM, Scott Carey wrote: > On 8/16/11 3:56 PM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> > wrote: > >> Hi, Scott, >> >> thanks for your reply. >> >>> What Avro version is this happening with? What JVM version? >> >> We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have >> to look up. >> >>> >>> On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args >>> if >>> it is Sun and JRE 6u21 or later? (some issues in loop predicates affect >>> Java 6 too, just not as many as the recent news on Java7). >>> >>> Otherwise, it may likely be the same thing as AVRO-782. Any extra >>> information related to that issue would be welcome. >> >> I will have to collect it. In the meanwhile, do you have any reasonable >> explanations of the issue besides it being something like AVRO-782? > > What is your key type (map output schema, first type argument of Pair)? > Is your key a Utf8 or String? I don't have a reasonable explanation at > this point, I haven't looked into it in depth with a good reproducible > case. I have my suspicions with how recycling of the key works since Utf8 > is mutable and its backing byte[] can end up shared. > > > >> >> Thanks a lot, >> Vyacheslav >> >>> >>> Thanks! >>> >>> -Scott >>> >>> >>> >>> On 8/16/11 8:39 AM, "Vyacheslav Zholudev" >>> <[EMAIL PROTECTED]> >>> wrote: >>> >>>> Hi, >>>> >>>> I'm having multiple hadoop jobs that use the avro mapred API. >>>> Only in one of the jobs I have a visible mismatch between a number of >>>> map >>>> output records and reducer input records. >>>> >>>> Does anybody encountered such a behavior? Can anybody think of possible >>>> explanations of this phenomenon? >>>> >>>> Any pointers/thoughts are highly appreciated! >>>> >>>> Best, >>>> Vyacheslav >>> >>> >> >> Best, >> Vyacheslav >> >> >> > > +
Vyacheslav Zholudev 2011-08-17, 08:32
-
Re: Map output records/reducer input records mismatchScott Carey 2011-08-17, 17:06
On 8/17/11 1:32 AM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]>
wrote: > Hi Scott, > > The pair types are Pair<CharSequence, SomeSpecificJavaClass>, but in essence > when I call "collect()" then I always provide a java.lang.String object. > > The reduce method is > reduce(CharSequence key, Iterable<SomeSpecificJavaClass> values, .....) What happens if you change it to Pair<String, SomeSpecificJavaClass> or <Utf8, SomeSpecificJavaClass> ? Does the problem persist? > > Some more detailed info: > the jobtracker and namenode run with: > java version "1.6.0_22" > Java(TM) SE Runtime Environment (build 1.6.0_22-b04) > Java HotSpot(TM) 64-Bit Server VM (build 17.1-b03, mixed mode) > > the tasktrackers and datanodes run with: > java version "1.6.0_24" > Java(TM) SE Runtime Environment (build 1.6.0_24-b07) > Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode) > > Hadoop version is: > cdh3u1 > > Thanks for suggestions, > Vyacheslav > > > > > On Aug 17, 2011, at 3:56 AM, Scott Carey wrote: > >> On 8/16/11 3:56 PM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> >> wrote: >> >>> Hi, Scott, >>> >>> thanks for your reply. >>> >>>> What Avro version is this happening with? What JVM version? >>> >>> We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have >>> to look up. >>> >>>> >>>> On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args >>>> if >>>> it is Sun and JRE 6u21 or later? (some issues in loop predicates affect >>>> Java 6 too, just not as many as the recent news on Java7). >>>> >>>> Otherwise, it may likely be the same thing as AVRO-782. Any extra >>>> information related to that issue would be welcome. >>> >>> I will have to collect it. In the meanwhile, do you have any reasonable >>> explanations of the issue besides it being something like AVRO-782? >> >> What is your key type (map output schema, first type argument of Pair)? >> Is your key a Utf8 or String? I don't have a reasonable explanation at >> this point, I haven't looked into it in depth with a good reproducible >> case. I have my suspicions with how recycling of the key works since Utf8 >> is mutable and its backing byte[] can end up shared. >> >> >> >>> >>> Thanks a lot, >>> Vyacheslav >>> >>>> >>>> Thanks! >>>> >>>> -Scott >>>> >>>> >>>> >>>> On 8/16/11 8:39 AM, "Vyacheslav Zholudev" >>>> <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I'm having multiple hadoop jobs that use the avro mapred API. >>>>> Only in one of the jobs I have a visible mismatch between a number of >>>>> map >>>>> output records and reducer input records. >>>>> >>>>> Does anybody encountered such a behavior? Can anybody think of possible >>>>> explanations of this phenomenon? >>>>> >>>>> Any pointers/thoughts are highly appreciated! >>>>> >>>>> Best, >>>>> Vyacheslav >>>> >>>> >>> >>> Best, >>> Vyacheslav >>> >>> >>> >> >> > +
Scott Carey 2011-08-17, 17:06
-
Re: Map output records/reducer input records mismatchVyacheslav Zholudev 2011-08-17, 12:02
btw,
I was thinking to try it with Utf8 objects instead of strings and I wanted to reuse the same Utf8 object instead of creating new from String upon each map() call. Why does not the Utf8 class have a method for setting bytes via a String object? I created the following code snippet: public static Utf8 reuseUtf8Object(Utf8 container, String strToReuse) { byte[] strBytes = Utf8.getBytesFor(strToReuse); container.setByteLength(strBytes.length); System.arraycopy(strBytes, 0, container.getBytes(), 0, strBytes.length); return container; } Would that be useful if this code is encapsulated into the Utf8 class? Best, Vyacheslav On Aug 17, 2011, at 3:56 AM, Scott Carey wrote: > On 8/16/11 3:56 PM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> > wrote: > >> Hi, Scott, >> >> thanks for your reply. >> >>> What Avro version is this happening with? What JVM version? >> >> We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have >> to look up. >> >>> >>> On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args >>> if >>> it is Sun and JRE 6u21 or later? (some issues in loop predicates affect >>> Java 6 too, just not as many as the recent news on Java7). >>> >>> Otherwise, it may likely be the same thing as AVRO-782. Any extra >>> information related to that issue would be welcome. >> >> I will have to collect it. In the meanwhile, do you have any reasonable >> explanations of the issue besides it being something like AVRO-782? > > What is your key type (map output schema, first type argument of Pair)? > Is your key a Utf8 or String? I don't have a reasonable explanation at > this point, I haven't looked into it in depth with a good reproducible > case. I have my suspicions with how recycling of the key works since Utf8 > is mutable and its backing byte[] can end up shared. > > > >> >> Thanks a lot, >> Vyacheslav >> >>> >>> Thanks! >>> >>> -Scott >>> >>> >>> >>> On 8/16/11 8:39 AM, "Vyacheslav Zholudev" >>> <[EMAIL PROTECTED]> >>> wrote: >>> >>>> Hi, >>>> >>>> I'm having multiple hadoop jobs that use the avro mapred API. >>>> Only in one of the jobs I have a visible mismatch between a number of >>>> map >>>> output records and reducer input records. >>>> >>>> Does anybody encountered such a behavior? Can anybody think of possible >>>> explanations of this phenomenon? >>>> >>>> Any pointers/thoughts are highly appreciated! >>>> >>>> Best, >>>> Vyacheslav >>> >>> >> >> Best, >> Vyacheslav >> >> >> > > +
Vyacheslav Zholudev 2011-08-17, 12:02
-
Re: Map output records/reducer input records mismatchScott Carey 2011-08-17, 17:18
On 8/17/11 5:02 AM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]>
wrote: > btw, > > I was thinking to try it with Utf8 objects instead of strings and I wanted to > reuse the same Utf8 object instead of creating new from String upon each map() > call. > Why does not the Utf8 class have a method for setting bytes via a String > object? We could add that, but it won't help performance much in this case since the performance improvement from reuse has more to do with the underlying byte[] than the Utf8 object. The expensive part of String is the conversion from an underlying char[] to a byte[] (Utf8.getBytesFor()), so this would not help much. It would probably be faster to use String directly rather than wrap it with Utf8 each time. Rather than have a static method like the below, I would propose that an instance method be made that does the same thing, something like public void setValue(String val) { // gets bytes, replaces private byte array, replaces cached string no system array copy. } which would be much more efficient. > > I created the following code snippet: > > public static Utf8 reuseUtf8Object(Utf8 container, String strToReuse) { > byte[] strBytes = Utf8.getBytesFor(strToReuse); > container.setByteLength(strBytes.length); > System.arraycopy(strBytes, 0, container.getBytes(), 0, > strBytes.length); > return container; > } > > Would that be useful if this code is encapsulated into the Utf8 class? > > Best, > Vyacheslav > > On Aug 17, 2011, at 3:56 AM, Scott Carey wrote: > >> On 8/16/11 3:56 PM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> >> wrote: >> >>> Hi, Scott, >>> >>> thanks for your reply. >>> >>>> What Avro version is this happening with? What JVM version? >>> >>> We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have >>> to look up. >>> >>>> >>>> On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args >>>> if >>>> it is Sun and JRE 6u21 or later? (some issues in loop predicates affect >>>> Java 6 too, just not as many as the recent news on Java7). >>>> >>>> Otherwise, it may likely be the same thing as AVRO-782. Any extra >>>> information related to that issue would be welcome. >>> >>> I will have to collect it. In the meanwhile, do you have any reasonable >>> explanations of the issue besides it being something like AVRO-782? >> >> What is your key type (map output schema, first type argument of Pair)? >> Is your key a Utf8 or String? I don't have a reasonable explanation at >> this point, I haven't looked into it in depth with a good reproducible >> case. I have my suspicions with how recycling of the key works since Utf8 >> is mutable and its backing byte[] can end up shared. >> >> >> >>> >>> Thanks a lot, >>> Vyacheslav >>> >>>> >>>> Thanks! >>>> >>>> -Scott >>>> >>>> >>>> >>>> On 8/16/11 8:39 AM, "Vyacheslav Zholudev" >>>> <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I'm having multiple hadoop jobs that use the avro mapred API. >>>>> Only in one of the jobs I have a visible mismatch between a number of >>>>> map >>>>> output records and reducer input records. >>>>> >>>>> Does anybody encountered such a behavior? Can anybody think of possible >>>>> explanations of this phenomenon? >>>>> >>>>> Any pointers/thoughts are highly appreciated! >>>>> >>>>> Best, >>>>> Vyacheslav >>>> >>>> >>> >>> Best, >>> Vyacheslav >>> >>> >>> >> >> > +
Scott Carey 2011-08-17, 17:18
-
Re: Map output records/reducer input records mismatchVyacheslav Zholudev 2011-08-17, 18:09
On Aug 17, 2011, at 7:18 PM, Scott Carey wrote: > On 8/17/11 5:02 AM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> wrote: > >> btw, >> >> I was thinking to try it with Utf8 objects instead of strings and I wanted to reuse the same Utf8 object instead of creating new from String upon each map() call. >> Why does not the Utf8 class have a method for setting bytes via a String object? > > > We could add that, but it won't help performance much in this case since the performance improvement from reuse has more to do with the underlying byte[] than the Utf8 object. > The expensive part of String is the conversion from an underlying char[] to a byte[] (Utf8.getBytesFor()), so this would not help much. It would probably be faster to use String directly rather than wrap it with Utf8 each time. > > Rather than have a static method like the below, I would propose that an instance method be made that does the same thing, something like > > public void setValue(String val) { > // gets bytes, replaces private byte array, replaces cached string — no system array copy. > } > > which would be much more efficient. Thanks for the reply. Yes, true by encapsulating this code into the Utf8 class. I just couldn't do the replacement of the private array outside the class scope, obviously. Vyacheslav > > >> >> I created the following code snippet: >> >> public static Utf8 reuseUtf8Object(Utf8 container, String strToReuse) { >> byte[] strBytes = Utf8.getBytesFor(strToReuse); >> container.setByteLength(strBytes.length); >> System.arraycopy(strBytes, 0, container.getBytes(), 0, strBytes.length); >> return container; >> } >> >> Would that be useful if this code is encapsulated into the Utf8 class? >> >> Best, >> Vyacheslav >> >> On Aug 17, 2011, at 3:56 AM, Scott Carey wrote: >> >>> On 8/16/11 3:56 PM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> >>> wrote: >>> >>>> Hi, Scott, >>>> >>>> thanks for your reply. >>>> >>>>> What Avro version is this happening with? What JVM version? >>>> >>>> We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have >>>> to look up. >>>> >>>>> >>>>> On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args >>>>> if >>>>> it is Sun and JRE 6u21 or later? (some issues in loop predicates affect >>>>> Java 6 too, just not as many as the recent news on Java7). >>>>> >>>>> Otherwise, it may likely be the same thing as AVRO-782. Any extra >>>>> information related to that issue would be welcome. >>>> >>>> I will have to collect it. In the meanwhile, do you have any reasonable >>>> explanations of the issue besides it being something like AVRO-782? >>> >>> What is your key type (map output schema, first type argument of Pair)? >>> Is your key a Utf8 or String? I don't have a reasonable explanation at >>> this point, I haven't looked into it in depth with a good reproducible >>> case. I have my suspicions with how recycling of the key works since Utf8 >>> is mutable and its backing byte[] can end up shared. >>> >>> >>> >>>> >>>> Thanks a lot, >>>> Vyacheslav >>>> >>>>> >>>>> Thanks! >>>>> >>>>> -Scott >>>>> >>>>> >>>>> >>>>> On 8/16/11 8:39 AM, "Vyacheslav Zholudev" >>>>> <[EMAIL PROTECTED]> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I'm having multiple hadoop jobs that use the avro mapred API. >>>>>> Only in one of the jobs I have a visible mismatch between a number of >>>>>> map >>>>>> output records and reducer input records. >>>>>> >>>>>> Does anybody encountered such a behavior? Can anybody think of possible >>>>>> explanations of this phenomenon? >>>>>> >>>>>> Any pointers/thoughts are highly appreciated! >>>>>> >>>>>> Best, >>>>>> Vyacheslav >>>>> >>>>> >>>> >>>> Best, >>>> Vyacheslav >>>> >>>> >>>> >>> >>> >> +
Vyacheslav Zholudev 2011-08-17, 18:09
-
Re: Map output records/reducer input records mismatchVyacheslav Zholudev 2011-08-17, 22:02
Hi Scott,
thanks for all the suggestions. I really appreciate your support. Unfortunately, I could not solve the problem so far. That's what I have tried: 1. Switched to UTF8 everywhere, including changing the interface to <Utf8, SomeSpecificJavaClass> 2. Always generate new instances before collecting (new Utf8("fromString") for the key, clone for the value) The problem persists - records seem to get lost between mapper and reducer. Interestingly, it's only reproducible with large datasets. So, if I run a relatively small set of 6 million input rows, I do not get any differences, however, on a 10 million input dataset the difference shows up: Map input records: 10,000,000 Map input bytes: 11,458,340,172 Map output bytes: 30,420,106,592 Map output records: 28,196,842 Reduce input records: 28,053,314 I'm trying to simplify the job further. Do you have any further ideas? Thanks, Vyacheslav On Aug 17, 2011, at 7:18 PM, Scott Carey wrote: > On 8/17/11 5:02 AM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> wrote: > >> btw, >> >> I was thinking to try it with Utf8 objects instead of strings and I wanted to reuse the same Utf8 object instead of creating new from String upon each map() call. >> Why does not the Utf8 class have a method for setting bytes via a String object? > > > We could add that, but it won't help performance much in this case since the performance improvement from reuse has more to do with the underlying byte[] than the Utf8 object. > The expensive part of String is the conversion from an underlying char[] to a byte[] (Utf8.getBytesFor()), so this would not help much. It would probably be faster to use String directly rather than wrap it with Utf8 each time. > > Rather than have a static method like the below, I would propose that an instance method be made that does the same thing, something like > > public void setValue(String val) { > // gets bytes, replaces private byte array, replaces cached string — no system array copy. > } > > which would be much more efficient. > > >> >> I created the following code snippet: >> >> public static Utf8 reuseUtf8Object(Utf8 container, String strToReuse) { >> byte[] strBytes = Utf8.getBytesFor(strToReuse); >> container.setByteLength(strBytes.length); >> System.arraycopy(strBytes, 0, container.getBytes(), 0, strBytes.length); >> return container; >> } >> >> Would that be useful if this code is encapsulated into the Utf8 class? >> >> Best, >> Vyacheslav >> >> On Aug 17, 2011, at 3:56 AM, Scott Carey wrote: >> >>> On 8/16/11 3:56 PM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> >>> wrote: >>> >>>> Hi, Scott, >>>> >>>> thanks for your reply. >>>> >>>>> What Avro version is this happening with? What JVM version? >>>> >>>> We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have >>>> to look up. >>>> >>>>> >>>>> On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args >>>>> if >>>>> it is Sun and JRE 6u21 or later? (some issues in loop predicates affect >>>>> Java 6 too, just not as many as the recent news on Java7). >>>>> >>>>> Otherwise, it may likely be the same thing as AVRO-782. Any extra >>>>> information related to that issue would be welcome. >>>> >>>> I will have to collect it. In the meanwhile, do you have any reasonable >>>> explanations of the issue besides it being something like AVRO-782? >>> >>> What is your key type (map output schema, first type argument of Pair)? >>> Is your key a Utf8 or String? I don't have a reasonable explanation at >>> this point, I haven't looked into it in depth with a good reproducible >>> case. I have my suspicions with how recycling of the key works since Utf8 >>> is mutable and its backing byte[] can end up shared. >>> >>> >>> >>>> >>>> Thanks a lot, >>>> Vyacheslav >>>> >>>>> >>>>> Thanks! >>>>> >>>>> -Scott >>>>> >>>>> >>>>> >>>>> On 8/16/11 8:39 AM, "Vyacheslav Zholudev" >>>>> <[EMAIL PROTECTED]> +
Vyacheslav Zholudev 2011-08-17, 22:02
-
Re: Map output records/reducer input records mismatchVyacheslav Zholudev 2011-08-17, 22:59
There is a possible reason:
It seems that there is an upper limit of 10,001 records per reduce input group. (or is there a setting?) If I output one million rows with the same key, I get: Map output records: 1,000,000 Reduce input groups: 1 Reduce input records: 10,001 If I output one million rows with 20 different keys, I get: Map output records: 1,000,000 Reduce input groups: 20 Reduce input records: 200,020 If I output one million rows with unique keys, I get: Map output records: 1,000,000 Reduce input groups: 1,000,000 Reduce input records: 1,000,000 Btw., I am running on 5 nodes with total map task capacity of 10 and total reduce task capacity of 10. Thanks, Vyacheslav On Aug 17, 2011, at 7:18 PM, Scott Carey wrote: > On 8/17/11 5:02 AM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> wrote: > >> btw, >> >> I was thinking to try it with Utf8 objects instead of strings and I wanted to reuse the same Utf8 object instead of creating new from String upon each map() call. >> Why does not the Utf8 class have a method for setting bytes via a String object? > > > We could add that, but it won't help performance much in this case since the performance improvement from reuse has more to do with the underlying byte[] than the Utf8 object. > The expensive part of String is the conversion from an underlying char[] to a byte[] (Utf8.getBytesFor()), so this would not help much. It would probably be faster to use String directly rather than wrap it with Utf8 each time. > > Rather than have a static method like the below, I would propose that an instance method be made that does the same thing, something like > > public void setValue(String val) { > // gets bytes, replaces private byte array, replaces cached string — no system array copy. > } > > which would be much more efficient. > > >> >> I created the following code snippet: >> >> public static Utf8 reuseUtf8Object(Utf8 container, String strToReuse) { >> byte[] strBytes = Utf8.getBytesFor(strToReuse); >> container.setByteLength(strBytes.length); >> System.arraycopy(strBytes, 0, container.getBytes(), 0, strBytes.length); >> return container; >> } >> >> Would that be useful if this code is encapsulated into the Utf8 class? >> >> Best, >> Vyacheslav >> >> On Aug 17, 2011, at 3:56 AM, Scott Carey wrote: >> >>> On 8/16/11 3:56 PM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> >>> wrote: >>> >>>> Hi, Scott, >>>> >>>> thanks for your reply. >>>> >>>>> What Avro version is this happening with? What JVM version? >>>> >>>> We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have >>>> to look up. >>>> >>>>> >>>>> On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args >>>>> if >>>>> it is Sun and JRE 6u21 or later? (some issues in loop predicates affect >>>>> Java 6 too, just not as many as the recent news on Java7). >>>>> >>>>> Otherwise, it may likely be the same thing as AVRO-782. Any extra >>>>> information related to that issue would be welcome. >>>> >>>> I will have to collect it. In the meanwhile, do you have any reasonable >>>> explanations of the issue besides it being something like AVRO-782? >>> >>> What is your key type (map output schema, first type argument of Pair)? >>> Is your key a Utf8 or String? I don't have a reasonable explanation at >>> this point, I haven't looked into it in depth with a good reproducible >>> case. I have my suspicions with how recycling of the key works since Utf8 >>> is mutable and its backing byte[] can end up shared. >>> >>> >>> >>>> >>>> Thanks a lot, >>>> Vyacheslav >>>> >>>>> >>>>> Thanks! >>>>> >>>>> -Scott >>>>> >>>>> >>>>> >>>>> On 8/16/11 8:39 AM, "Vyacheslav Zholudev" >>>>> <[EMAIL PROTECTED]> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I'm having multiple hadoop jobs that use the avro mapred API. >>>>>> Only in one of the jobs I have a visible mismatch between a number of >>>>>> map >>>>>> output records and reducer input records. Best, Vyacheslav +
Vyacheslav Zholudev 2011-08-17, 22:59
-
Re: Map output records/reducer input records mismatchScott Carey 2011-08-17, 23:47
That is very interesting I don't see how Avro could affect that.
Does anyone else have any ideas how Avro might cause the below? -Scott On 8/17/11 3:59 PM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> wrote: > There is a possible reason: > It seems that there is an upper limit of 10,001 records per reduce input > group. (or is there a setting?) > > > If I output one million rows with the same key, I get: > Map output records: 1,000,000 > Reduce input groups: 1 > Reduce input records: 10,001 > > If I output one million rows with 20 different keys, I get: > Map output records: 1,000,000 > Reduce input groups: 20 > Reduce input records: 200,020 > > If I output one million rows with unique keys, I get: > Map output records: 1,000,000 > Reduce input groups: 1,000,000 > Reduce input records: 1,000,000 > > > Btw., I am running on 5 nodes with total map task capacity of 10 and total > reduce task capacity of 10. > > Thanks, > Vyacheslav > > On Aug 17, 2011, at 7:18 PM, Scott Carey wrote: > >> On 8/17/11 5:02 AM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> >> wrote: >> >>> btw, >>> >>> I was thinking to try it with Utf8 objects instead of strings and I wanted >>> to reuse the same Utf8 object instead of creating new from String upon each >>> map() call. >>> Why does not the Utf8 class have a method for setting bytes via a String >>> object? >> >> We could add that, but it won't help performance much in this case since the >> performance improvement from reuse has more to do with the underlying byte[] >> than the Utf8 object. >> The expensive part of String is the conversion from an underlying char[] to a >> byte[] (Utf8.getBytesFor()), so this would not help much. It would probably >> be faster to use String directly rather than wrap it with Utf8 each time. >> >> Rather than have a static method like the below, I would propose that an >> instance method be made that does the same thing, something like >> >> public void setValue(String val) { >> // gets bytes, replaces private byte array, replaces cached string no >> system array copy. >> } >> >> which would be much more efficient. >> >> >>> >>> I created the following code snippet: >>> >>> public static Utf8 reuseUtf8Object(Utf8 container, String strToReuse) { >>> byte[] strBytes = Utf8.getBytesFor(strToReuse); >>> container.setByteLength(strBytes.length); >>> System.arraycopy(strBytes, 0, container.getBytes(), 0, >>> strBytes.length); >>> return container; >>> } >>> >>> Would that be useful if this code is encapsulated into the Utf8 class? >>> >>> Best, >>> Vyacheslav >>> >>> On Aug 17, 2011, at 3:56 AM, Scott Carey wrote: >>> >>>> On 8/16/11 3:56 PM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>>> Hi, Scott, >>>>> >>>>> thanks for your reply. >>>>> >>>>>> What Avro version is this happening with? What JVM version? >>>>> >>>>> We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have >>>>> to look up. >>>>> >>>>>> >>>>>> On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args >>>>>> if >>>>>> it is Sun and JRE 6u21 or later? (some issues in loop predicates affect >>>>>> Java 6 too, just not as many as the recent news on Java7). >>>>>> >>>>>> Otherwise, it may likely be the same thing as AVRO-782. Any extra >>>>>> information related to that issue would be welcome. >>>>> >>>>> I will have to collect it. In the meanwhile, do you have any reasonable >>>>> explanations of the issue besides it being something like AVRO-782? >>>> >>>> What is your key type (map output schema, first type argument of Pair)? >>>> Is your key a Utf8 or String? I don't have a reasonable explanation at >>>> this point, I haven't looked into it in depth with a good reproducible >>>> case. I have my suspicions with how recycling of the key works since Utf8 >>>> is mutable and its backing byte[] can end up shared. >>>> >>>> >>>> >>>>> >>>>> Thanks a lot, +
Scott Carey 2011-08-17, 23:47
-
Re: Map output records/reducer input records mismatchVyacheslav Zholudev 2011-08-18, 12:50
Hi Scott,
The problem is found. In the reduce job when there were too many values for some key, I stopped reading values from an iterator. So apparently the rest of the values were not counted. I thought, in case of sequence files unread values were counted in any case. That's why I didn't think about it from the very beginning. Thanks for the support, Vyacheslav On Aug 18, 2011, at 1:47 AM, Scott Carey wrote: > That is very interesting… I don't see how Avro could affect that. > > Does anyone else have any ideas how Avro might cause the below? > > -Scott > > On 8/17/11 3:59 PM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> wrote: > >> There is a possible reason: >> It seems that there is an upper limit of 10,001 records per reduce input group. (or is there a setting?) >> >> >> If I output one million rows with the same key, I get: >> Map output records: 1,000,000 >> Reduce input groups: 1 >> Reduce input records: 10,001 >> >> If I output one million rows with 20 different keys, I get: >> Map output records: 1,000,000 >> Reduce input groups: 20 >> Reduce input records: 200,020 >> >> If I output one million rows with unique keys, I get: >> Map output records: 1,000,000 >> Reduce input groups: 1,000,000 >> Reduce input records: 1,000,000 >> >> >> Btw., I am running on 5 nodes with total map task capacity of 10 and total reduce task capacity of 10. >> >> Thanks, >> Vyacheslav >> >> On Aug 17, 2011, at 7:18 PM, Scott Carey wrote: >> >>> On 8/17/11 5:02 AM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> wrote: >>> >>>> btw, >>>> >>>> I was thinking to try it with Utf8 objects instead of strings and I wanted to reuse the same Utf8 object instead of creating new from String upon each map() call. >>>> Why does not the Utf8 class have a method for setting bytes via a String object? >>> >>> >>> We could add that, but it won't help performance much in this case since the performance improvement from reuse has more to do with the underlying byte[] than the Utf8 object. >>> The expensive part of String is the conversion from an underlying char[] to a byte[] (Utf8.getBytesFor()), so this would not help much. It would probably be faster to use String directly rather than wrap it with Utf8 each time. >>> >>> Rather than have a static method like the below, I would propose that an instance method be made that does the same thing, something like >>> >>> public void setValue(String val) { >>> // gets bytes, replaces private byte array, replaces cached string — no system array copy. >>> } >>> >>> which would be much more efficient. >>> >>> >>>> >>>> I created the following code snippet: >>>> >>>> public static Utf8 reuseUtf8Object(Utf8 container, String strToReuse) { >>>> byte[] strBytes = Utf8.getBytesFor(strToReuse); >>>> container.setByteLength(strBytes.length); >>>> System.arraycopy(strBytes, 0, container.getBytes(), 0, strBytes.length); >>>> return container; >>>> } >>>> >>>> Would that be useful if this code is encapsulated into the Utf8 class? >>>> >>>> Best, >>>> Vyacheslav >>>> >>>> On Aug 17, 2011, at 3:56 AM, Scott Carey wrote: >>>> >>>>> On 8/16/11 3:56 PM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> >>>>> wrote: >>>>> >>>>>> Hi, Scott, >>>>>> >>>>>> thanks for your reply. >>>>>> >>>>>>> What Avro version is this happening with? What JVM version? >>>>>> >>>>>> We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have >>>>>> to look up. >>>>>> >>>>>>> >>>>>>> On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args >>>>>>> if >>>>>>> it is Sun and JRE 6u21 or later? (some issues in loop predicates affect >>>>>>> Java 6 too, just not as many as the recent news on Java7). >>>>>>> >>>>>>> Otherwise, it may likely be the same thing as AVRO-782. Any extra >>>>>>> information related to that issue would be welcome. >>>>>> >>>>>> I will have to collect it. In the meanwhile, do you have any reasonable +
Vyacheslav Zholudev 2011-08-18, 12:50
-
Re: Map output records/reducer input records mismatchVyacheslav Zholudev 2011-08-17, 15:49
One more update:
running the job with the -XX:-UseLoopPredicate option gave the same results. The difference between mapper output records and reducer input records is persistent. Thanks! Vyacheslav On Aug 17, 2011, at 3:56 AM, Scott Carey wrote: > On 8/16/11 3:56 PM, "Vyacheslav Zholudev" <[EMAIL PROTECTED]> > wrote: > >> Hi, Scott, >> >> thanks for your reply. >> >>> What Avro version is this happening with? What JVM version? >> >> We are using Avro 1.5.1 and Sun JDK 6, but the exact version I will have >> to look up. >> >>> >>> On a hunch, have you tried adding -XX:-UseLoopPredicate to the JVM args >>> if >>> it is Sun and JRE 6u21 or later? (some issues in loop predicates affect >>> Java 6 too, just not as many as the recent news on Java7). >>> >>> Otherwise, it may likely be the same thing as AVRO-782. Any extra >>> information related to that issue would be welcome. >> >> I will have to collect it. In the meanwhile, do you have any reasonable >> explanations of the issue besides it being something like AVRO-782? > > What is your key type (map output schema, first type argument of Pair)? > Is your key a Utf8 or String? I don't have a reasonable explanation at > this point, I haven't looked into it in depth with a good reproducible > case. I have my suspicions with how recycling of the key works since Utf8 > is mutable and its backing byte[] can end up shared. > > > >> >> Thanks a lot, >> Vyacheslav >> >>> >>> Thanks! >>> >>> -Scott >>> >>> >>> >>> On 8/16/11 8:39 AM, "Vyacheslav Zholudev" >>> <[EMAIL PROTECTED]> >>> wrote: >>> >>>> Hi, >>>> >>>> I'm having multiple hadoop jobs that use the avro mapred API. >>>> Only in one of the jobs I have a visible mismatch between a number of >>>> map >>>> output records and reducer input records. >>>> >>>> Does anybody encountered such a behavior? Can anybody think of possible >>>> explanations of this phenomenon? >>>> >>>> Any pointers/thoughts are highly appreciated! >>>> >>>> Best, >>>> Vyacheslav >>> >>> >> >> Best, >> Vyacheslav >> >> >> > > +
Vyacheslav Zholudev 2011-08-17, 15:49
|