Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Generic data extraction from an Avro file


Copy link to this message
-
Re: Generic data extraction from an Avro file
Yes, GenericData.Record#toString() should generate valid Json.  It
does lose some information, e.g.:
 - record names; and
 - the distinction between strings & enum symbols, ints & longs,
floats & doubles, and maps & records.

JsonEncoder loses less information.  It saves enough information to,
with the schema, always reconstitute an equivalent object.

Doug
On Tue, Feb 5, 2013 at 11:53 AM, Public Network Services
<[EMAIL PROTECTED]> wrote:
> Folks,
>
> Assuming an application that only needs to quickly examine the contents of a
> bunch of Avro data files (irrespective of binary or JSON encoding and
> without any prior schema or object structure knowledge), an approach could
> be to just extract the Avro records as text JSON records. To this effect, a
> simple approach could be:
>
> Create a DataFileStream<GenericRecord>(FileInputStream,
> GenericDatumReader<GenericRecord>) from a FileInputStream to the file. (If
> the file is not an Avro data file, an IOException is caused.)
> Read GenericRecord records from the DataFileStream object, while its
> hasNext() method returns true.
> Convert each GenericRecord object read into a JSON string, via calling its
> toString() method.
>
> For the test datasets in the Avro 1.7.3 distribution, this actually works
> fine.
>
> My question is, does anyone see any potential problems for (binary or JSON
> encoded) Avro data files, given the above logic? For example, should the
> GenericRecord.toString() method always produce a valid JSON string?
>
> Thanks!
>