Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Generic data extraction from an Avro file


+
Public Network Services 2013-02-05, 19:53
+
Doug Cutting 2013-02-05, 22:58
+
Public Network Services 2013-02-05, 23:30
Copy link to this message
-
Re: Generic data extraction from an Avro file
Yes, that should be possible,  A given JsonEncoder instance only works
for a given schema.  And every generic record conforms to a schema.

http://avro.apache.org/docs/current/api/java/org/apache/avro/io/EncoderFactory.html#jsonEncoder(org.apache.avro.Schema,
java.io.OutputStream)

Doug

On Tue, Feb 5, 2013 at 3:30 PM, Public Network Services
<[EMAIL PROTECTED]> wrote:
> Thanks for the clarification.
>
> Is there any way to use JsonEncoder in the scenario I mentioned, i.e. in
> totally schema-agnostic data extraction from either binary or JSON files?
>
>
> On Tue, Feb 5, 2013 at 2:58 PM, Doug Cutting <[EMAIL PROTECTED]> wrote:
>>
>> Yes, GenericData.Record#toString() should generate valid Json.  It
>> does lose some information, e.g.:
>>  - record names; and
>>  - the distinction between strings & enum symbols, ints & longs,
>> floats & doubles, and maps & records.
>>
>> JsonEncoder loses less information.  It saves enough information to,
>> with the schema, always reconstitute an equivalent object.
>>
>> Doug
>>
>>
>> On Tue, Feb 5, 2013 at 11:53 AM, Public Network Services
>> <[EMAIL PROTECTED]> wrote:
>> > Folks,
>> >
>> > Assuming an application that only needs to quickly examine the contents
>> > of a
>> > bunch of Avro data files (irrespective of binary or JSON encoding and
>> > without any prior schema or object structure knowledge), an approach
>> > could
>> > be to just extract the Avro records as text JSON records. To this
>> > effect, a
>> > simple approach could be:
>> >
>> > Create a DataFileStream<GenericRecord>(FileInputStream,
>> > GenericDatumReader<GenericRecord>) from a FileInputStream to the file.
>> > (If
>> > the file is not an Avro data file, an IOException is caused.)
>> > Read GenericRecord records from the DataFileStream object, while its
>> > hasNext() method returns true.
>> > Convert each GenericRecord object read into a JSON string, via calling
>> > its
>> > toString() method.
>> >
>> > For the test datasets in the Avro 1.7.3 distribution, this actually
>> > works
>> > fine.
>> >
>> > My question is, does anyone see any potential problems for (binary or
>> > JSON
>> > encoded) Avro data files, given the above logic? For example, should the
>> > GenericRecord.toString() method always produce a valid JSON string?
>> >
>> > Thanks!
>> >
>
>