|
|
-
Collecting union-ed Records in AvroReducer
Andrew Kenworthy 2011-12-08, 12:10
Hallo,
is it possible to write/collect a union-ed record from an avro reducer?
I have a reduce class (extending AvroReducer), and the output schema is a union schema of record type A and record type B. In the reduce logic I want to combine instances of A and B in the same datum, passing it to my Avrocollector. My code looks a bit like this:
Record unionRecord = new GenericData.Record(myUnionSchema); // not legal! unionRecord.put("type A", recordA); unionRecord.put("type B", recordB);
collector.collect(unionRecord);
but GenericData.Record constructor expects a Record Schema. How can I write both records such that they appear in the same output datum?
Andrew
-
Re: Collecting union-ed Records in AvroReducer
Gaurav Nanda 2011-12-08, 14:32
You don't need to construct a record object. You can just write your RecordA/RecorbB objects directly.
Sample Writer: DatumWriter<Object> datum = new GenericDatumWriter<Object>(schema); DataFileWriter<Object> writer = new DataFileWriter<Object>(datum);
FileOutputStream out = new FileOutputStream("h:\\TestFile.avro"); writer.create(schema, out); writer.append(1050324); //You can write your recordA/recordB here. writer.close();
Sample Reader:
File out = new File("h:\\TestFile.avro"); GenericDatumReader<Object> datum = new GenericDatumReader<Object>(); DataFileReader<Object> reader = new DataFileReader<Object>(out, datum);
while (reader.hasNext()) { System.out.println(reader.next()); } reader.close();
Hope this helps.
Thanks, Gaurav Nanda
On Thu, Dec 8, 2011 at 5:40 PM, Andrew Kenworthy <[EMAIL PROTECTED]> wrote: > Hallo, > > is it possible to write/collect a union-ed record from an avro reducer? > > I have a reduce class (extending AvroReducer), and the output schema is a > union schema of record type A and record type B. In the reduce logic I want > to combine instances of A and B in the same datum, passing it to my > Avrocollector. My code looks a bit like this: > > Record unionRecord = new GenericData.Record(myUnionSchema); // not legal! > unionRecord.put("type A", recordA); > unionRecord.put("type B", recordB); > collector.collect(unionRecord); > > but GenericData.Record constructor expects a Record Schema. How can I write > both records such that they appear in the same output datum? > > Andrew
-
Fw: Collecting union-ed Records in AvroReducer
Andrew Kenworthy 2011-12-08, 15:03
----- Forwarded Message ----- >From: Andrew Kenworthy <[EMAIL PROTECTED]> >To: Gaurav Nanda <[EMAIL PROTECTED]> >Sent: Thursday, December 8, 2011 3:47 PM >Subject: Re: Collecting union-ed Records in AvroReducer > > >Hallo Gaurav, > > >Thank you for your reply. My problem is that the writer is implemented by GenericDatumWriter which is called via hadoop i.e. in my code I only have direct access to an AvroCollector object, which - several layers later - invokes a GenericDatumWriter. I don't really want to have re-implement a lot of code that the avro-mapred package provides for me. > > >But I think I can get around this by defining my output schema as being one with a nested record structure, and "embed" my type B record within the type "A". That way i am emitting a single record, albeit holding a composition of my two entities. > > >Andrew > > > >>________________________________ >> From: Gaurav Nanda <[EMAIL PROTECTED]> >>To: [EMAIL PROTECTED]; Andrew Kenworthy <[EMAIL PROTECTED]> >>Sent: Thursday, December 8, 2011 3:32 PM >>Subject: Re: Collecting union-ed Records in AvroReducer >> >>You don't need to construct a record object. You can just write your >>RecordA/RecorbB objects directly. >> >>Sample Writer: >> DatumWriter<Object> datum = new GenericDatumWriter<Object>(schema); >> DataFileWriter<Object> writer = new DataFileWriter<Object>(datum); >> >> FileOutputStream out = new FileOutputStream("h:\\TestFile.avro"); >> >> writer.create(schema, out); >> writer.append(1050324); //You can write your recordA/recordB here. >> >> writer.close(); >> >>Sample Reader: >> >> File out = new File("h:\\TestFile.avro"); >> GenericDatumReader<Object> datum = new GenericDatumReader<Object>(); >> DataFileReader<Object> reader = new DataFileReader<Object>(out, datum); >> >> while (reader.hasNext()) { >> System.out.println(reader.next()); >> } >> reader.close(); >> >>Hope this helps. >> >>Thanks, >>Gaurav Nanda >> >>On Thu, Dec 8, 2011 at 5:40 PM, Andrew Kenworthy <[EMAIL PROTECTED]> wrote: >>> Hallo, >>> >>> is it possible to write/collect a union-ed record from an avro reducer? >>> >>> I have a reduce class (extending AvroReducer), and the output schema is a >>> union schema of record type A and record type B. In the reduce logic I want >>> to combine instances of A and B in the same datum, passing it to my >>> Avrocollector. My code looks a bit like this: >>> >>> Record unionRecord = new GenericData.Record(myUnionSchema); // not legal! >>> unionRecord.put("type A", recordA); >>> unionRecord.put("type B", recordB); >>> collector.collect(unionRecord); >>> >>> but GenericData.Record constructor expects a Record Schema. How can I write >>> both records such that they appear in the same output datum? >>> >>> Andrew >> >> >> > >
-
Re: Collecting union-ed Records in AvroReducer
Doug Cutting 2011-12-08, 17:05
On 12/08/2011 04:10 AM, Andrew Kenworthy wrote: > is it possible to write/collect a union-ed record from an avro reducer? > > I have a reduce class (extending AvroReducer), and the output schema is > a union schema of record type A and record type B. In the reduce logic I > want to combine instances of A and B in the same datum, passing it to my > Avrocollector.
I think you mean you want to pass instances of either A or B to the collector, right? With a union of A and B, you should be able to just:
collector.collect(recordA);
or
collector.collect(recordB);
Does this not work for you?
Doug
-
Re: Collecting union-ed Records in AvroReducer
Scott Carey 2011-12-08, 17:45
On 12/8/11 4:10 AM, "Andrew Kenworthy" <[EMAIL PROTECTED]> wrote: >Hallo, > >is it possible to write/collect a union-ed record from an avro reducer? > >I have a reduce class (extending AvroReducer), and the output schema is a >union schema of record type A and record type B. In the reduce logic I >want to combine instances of A and B in the same datum, passing it to my >Avrocollector. My code looks a bit like this: > > >
If both records were created in the reducer, you can call collect twice, once with each record. Collect in general can be called as many times as you wish.
If you want to combine two records into a single datum rather than emit multiple datums, you do not want a union, you need a Record. A union is a single datum that may be only one of its branches in a single datum.
In short, do you want to emit both records individually or as a pair? If it is a pair, you need a Record, if it is multiple outputs or either/or, it is a Union. > >Record unionRecord = new GenericData.Record(myUnionSchema); // not legal! >unionRecord.put("type A", recordA); >unionRecord.put("type B", recordB); > >collector.collect(unionRecord); > >but GenericData.Record constructor expects a Record Schema. How can I >write both records such that they appear in the same output > datum?
If your output is either one type or another, see Doug's answer.
for multiple datums, it is
output schema is a union of two records (a datum is either one or the other): ["RecordA", "RecordB"] then the code is:
collector.collect(recordA); collector.collect(recordB); If you want a single datum that contains both a RecordA and a RecordB you need to have your output schema be a Record with two fields:
{"type":"record", "fields":[ {"name":"recordA", "type":"RecordA"}, {"name":"recordB", "type":"RecordB"} ]}
And you would use this record schema to create the GenericRecord, and then populate the fields with the inner records, then call collect once with the outer record.
Another choice is to output the output be an avro array of the union type that may have any number of RecordA and RecordB's in a single datum.
> >Andrew
-
Re: Collecting union-ed Records in AvroReducer
Andrew Kenworthy 2011-12-13, 10:27
Thank you, Scott. That has cleared up some misunderstanding on my part. I want to emit both records as a Pair, and have now implemented that by using a Record schema holding two sub-records, one for type A and one for type B, so I can just write the relevant datum to the correct sub-record, which gives me exactly what I need.
Andrew
>________________________________ > From: Scott Carey <[EMAIL PROTECTED]> >To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; Andrew Kenworthy <[EMAIL PROTECTED]> >Sent: Thursday, December 8, 2011 6:45 PM >Subject: Re: Collecting union-ed Records in AvroReducer > > > >On 12/8/11 4:10 AM, "Andrew Kenworthy" <[EMAIL PROTECTED]> wrote: > > >>Hallo, >> >>is it possible to write/collect a union-ed record from an avro reducer? >> >>I have a reduce class (extending AvroReducer), and the output schema is a >>union schema of record type A and record type B. In the reduce logic I >>want to combine instances of A and B in the same datum, passing it to my >>Avrocollector. My code looks a bit like this: >> >> >> > >If both records were created in the reducer, you can call collect twice, >once with each record. Collect in general can be called as many times as >you wish. > >If you want to combine two records into a single datum rather than emit >multiple datums, you do not want a union, you need a Record. A union is a >single datum that may be only one of its branches in a single datum. > >In short, do you want to emit both records individually or as a pair? If >it is a pair, you need a Record, if it is multiple outputs or either/or, >it is a Union. > > > >> >>Record unionRecord = new GenericData.Record(myUnionSchema); // not legal! >>unionRecord.put("type A", recordA); >>unionRecord.put("type B", recordB); >> >>collector.collect(unionRecord); >> >>but GenericData.Record constructor expects a Record Schema. How can I >>write both records such that they appear in the same output >> datum? > >If your output is either one type or another, see Doug's answer. > >for multiple datums, it is > >output schema is a union of two records (a datum is either one or the >other): >["RecordA", "RecordB"] >then the code is: > >collector.collect(recordA); >collector.collect(recordB); > > >If you want a single datum that contains both a RecordA and a RecordB you >need to have your output schema be a Record with two fields: > >{"type":"record", "fields":[ > {"name":"recordA", "type":"RecordA"}, > {"name":"recordB", "type":"RecordB"} >]} > >And you would use this record schema to create the GenericRecord, and then >populate the fields with the inner records, then call collect once with >the outer record. > >Another choice is to output the output be an avro array of the union type >that may have any number of RecordA and RecordB's in a single datum. > >> >>Andrew > > > > >
|
|