-Re: Avro MapReduce (MR1): Prevent Key from being output by reducer when using Pair schema
Harsh J 2014-01-16, 13:05
Thanks Ed! Can you also file an improvement JIRA under
https://issues.apache.org/jira/browse/AVRO with a patch that changes
it to make more sense?
On Thu, Jan 16, 2014 at 5:14 PM, ed <[EMAIL PROTECTED]> wrote:
> Hi Harsh,
> Thank you for your response which was invaluable in helping me to figure out
> my issue. The Java-Doc is in fact incorrect when it states that
> AvroJob.setOutputSchema cannot accept non-Pair configs as it turns out it
> can. What was throwing me off is that if you use AvroJob.setOutputSchema to
> set a non-Pair config, then you also need to call AvroJob.setMapOutputSchema
> (which does require the use of Pair). Otherwise, by default, the map output
> schema gets set to whatever you set in setOutputSchema and if that is
> non-pair you'll get an error at runtime.
> Maybe the JavaDoc should say something along the lines of:
>> Configure a job's output schema. If this is a not a Pair-schema then you
>> must explicitly set the job's map output schema using setMapOutputSchema
> Thank you!
> Best Regards,
> On Thu, Jan 16, 2014 at 6:47 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>> Hello Ed,
>> The AvroReducer per
>> has a simple spec of <K,V,OUT>, where OUT can be any record type and
>> not necessarily a Pair<KO,VO> type.
>> AvroJob.setOutputSchema(…) should accept non-pair configs. I think its
>> java-doc is incorrect though. I wrote a test case yesterday at
>> http://issues.apache.org/jira/browse/AVRO-1439, in which I set a
>> non-Pair schema via the same call without any trouble. We could get
>> the java-doc fixed, if it is indeed wrong.
>> On Thu, Jan 16, 2014 at 2:14 PM, ed <[EMAIL PROTECTED]> wrote:
>> > Hello,
>> > I am currently reading in lots of small avro files and then writing them
>> > out
>> > into one large avro file using Map Reduce MR1. I'm trying to do this
>> > using
>> > the AvroMapper and AvroReducer and it's almost working how I want.
>> > The problem right now is that it looks like I have to use
>> > "org.apache.avro.mapred.Pair" if I use "AvroJob.setOutputSchema". Is
>> > there
>> > a way to output a Pair schema from AvroReducer and have the "key" in
>> > that
>> > schema be ignored (i.e., not included in the output from the reducer)?
>> > Right now when I check the Reducer output there is an added field in
>> > each
>> > record called "key" which I'd like to not have there.
>> > Essentially I'm looking for something like NullWritable where the key
>> > will
>> > just be ignored in the final output.
>> > Thank you for any assistance or guidance you can provide!
>> > Best Regards,
>> > Ed
>> Harsh J