Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Avro MR job problem with empty strings


Copy link to this message
-
Re: Avro MR job problem with empty strings
Some ideas:

A String is encoded as a Long length, followed by that number of bytes in
Utf8.
An empty string is therefore encoded as the number 0L -- which is one
byte, 0x00.
It appears that it is trying to skip a string or Long, but it is the end
of the byte[].

So either it is expecting a Long or String to skip, and there is nothing
there.  Perhaps the empty String was not encoded as an empty string, but
skipped.  Perhaps a Long count or other number (What is the Schema being
compared?)  

WordCount is often key = word, val = count, and so it would need to read
the string word, and skip the long count.  If either of these is left out
and not written, I would expect the sort of error below.

I hope that helps,

-Scott

On 9/1/11 5:42 AM, "Friso van Vollenhoven" <[EMAIL PROTECTED]>
wrote:

>Hi All,
>
>I am working on a modified version of the Avro MapReduce support to make
>it play nice with the new Hadoop API (0.20.2). Most of the code if
>borrowed from the Avro mapred package, but I decided not to fully
>abstract away the Mapper and Reducer classes (like Avro does now using
>HadoopMapper and HadoopReducer classes). All else is much the same as the
>mapred implementation.
>
>When testing, I ran into a issues when emitting empty strings (empty
>Utf8) from the mapper as key. I get the following:
>org.apache.avro.AvroRuntimeException: java.io.EOFException
> at org.apache.avro.io.BinaryData.compare(BinaryData.java:74)
> at org.apache.avro.io.BinaryData.compare(BinaryData.java:60)
> at
>org.apache.avro.mapreduce.AvroKeyComparator.compare(AvroKeyComparator.java
>:45)        <== this is my own code
> at
>org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:
>120)
> at
>org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:175)
> at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:572)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:414)
> at
>org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:256)
>Caused by: java.io.EOFException
> at org.apache.avro.io.BinaryDecoder.readLong(BinaryDecoder.java:182)
> at
>org.apache.avro.generic.GenericDatumReader.skip(GenericDatumReader.java:38
>9)
> at org.apache.avro.io.BinaryData.compare(BinaryData.java:86)
> at org.apache.avro.io.BinaryData.compare(BinaryData.java:72)
> ... 8 more
>
>
>The root cause stack trace is as follows (taken from debugger, breakpoint
>on the throw new EOFException(); line):
>Thread [Thread-11] (Suspended (breakpoint at line 182 in BinaryDecoder))
> BinaryDecoder.readLong() line: 182
> GenericDatumReader<D>.skip(Schema, Decoder) line: 389
> BinaryData.compare(BinaryData$Decoders, Schema) line: 86
> BinaryData.compare(byte[], int, int, byte[], int, int, Schema) line: 72
> BinaryData.compare(byte[], int, byte[], int, Schema) line: 60
> AvroKeyComparator<T>.compare(byte[], int, int, byte[], int, int) line:
>45
> Reducer$Context(ReduceContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>).nextKeyValu
>e() line: 120
> Reducer$Context(ReduceContext<KEYIN,VALUEIN,KEYOUT,VALUEOUT>).nextKey()
>line: 92
> AvroMapReduceTest$WordCountingAvroReducer(Reducer<KEYIN,VALUEIN,KEYOUT,VA
>LUEOUT>).run(Reducer<KEYIN,VALUEIN,KEYOUT,Contex>) line: 175
> ReduceTask.runNewReducer(JobConf, TaskUmbilicalProtocol, TaskReporter,
>RawKeyValueIterator, RawComparator<INKEY>, Class<INKEY>, Class<INVALUE>)
>line: 572
> ReduceTask.run(JobConf, TaskUmbilicalProtocol) line: 414
> LocalJobRunner$Job.run() line: 256
>
>I went through the decoding code to see where this comes from, but I
>can't immediately spot where it goes wrong. I am guessing the actual
>problem is earlier during execution where it possibly increases pos too
>often.
>
>Has anyone experienced this? I can live without emitting empty keys from
>MR jobs, but I ran into this implementing a word count job on a text file
>with empty lines (counting those could be a valid use case). I am using
>Avro 1.5.2.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB