Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Re: Dynamic Schema


Copy link to this message
-
Re: Dynamic Schema
Hi Amit,

The Avro data file format requires the writer to know the schema from the start, because all records in the file are then written with the same schema. So there probably isn't an alternative to what you're doing -- to buffer as much as you can in memory, write it out to file when the memory buffer is full, and then start a new file.

You can't change the schema of a data file once it has been written, but you can run a background process which merges several data files together, and writes the result to a new file. You can make the merged file's schema the union of all the input file schemas, or you can write some application-specific code which combines the schemas into one, and evolve all the records into that merged schema. This can be done by streaming through the files -- you don't need to keep all the data in memory.

Martin

On 1 Apr 2014, at 21:55, amit nanda <[EMAIL PROTECTED]> wrote:
 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB