Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Generic data extraction from an Avro file


Copy link to this message
-
Generic data extraction from an Avro file
Folks,

Assuming an application that only needs to quickly examine the contents of
a bunch of Avro data files (irrespective of binary or JSON encoding and
without any prior schema or object structure knowledge), an approach could
be to just extract the Avro records as text JSON records. To this effect, a
simple approach could be:

   1. Create a DataFileStream<GenericRecord>(FileInputStream,
   GenericDatumReader<GenericRecord>) from a FileInputStream to the file. (If
   the file is not an Avro data file, an IOException is caused.)
   2. Read GenericRecord records from the DataFileStream object, while its
   hasNext() method returns true.
   3. Convert each GenericRecord object read into a JSON string, via
   calling its toString() method.

For the test datasets in the Avro 1.7.3 distribution, this actually works
fine.

My question is, does anyone see any potential problems for (binary or JSON
encoded) Avro data files, given the above logic? For example, should the
GenericRecord.toString() method always produce a valid JSON string?

Thanks!
+
Doug Cutting 2013-02-05, 22:58
+
Public Network Services 2013-02-05, 23:30
+
Doug Cutting 2013-02-06, 00:00