You'd like the compile-time type-checking of specific, but the
run-time flexibility of generic, right?  Here's a way we might achieve
this.

Given the following schemas:

{"type":"enum", "name":"Color", "symbols":["RED", "GREEN", "BLUE"]}

{"type":"record", "name":"Shape", "fields":[
  {"name":"xPosition", "type":"int"},
  {"name":"yPosition", "type":"int"},
  {"name":"color", "type":"Color"},
  ]}

We might generate Java code like:

public class Shape extends GenericData.Record {
  public Shape(Schema schema) { super(schema); }
  public int getXPosition() { return (Number)get("xPosition"); }
  public int getYPosition() { return (Number)get("yPosition"); }
  public Color getColor { return (Color)get("color"); }
}

public class Color extends GenericData.EnumSymbol {
  public Color(Schema schema, String label) {
    super(schema, label);
  }
  public static final Color RED = new Color("RED");
  public static final Color GREEN = new Color("GREEN");
  public static final Color BLUE = new Color("BLUE");
}

If one reads data using the writer's schema into such classes, then
missing fields and enum symbols would be preserved in the generic
representation.  For example, you might have a filtering mapper that
removes all red shapes:

public void map(Shape shape, ...) {
  if (!shape.getColor().equals(Color.RED)) {
    collect shape;
  }
}

This would still function correctly without recompilation even if the
schema of the input data is very different, e.g., missing "xPosition"
and "yPosition", containing a new color, PURPLE or a new field,
"region", etc.

I think Christophe Taton once requested something like this, to permit
one to preserve fields not in the schema used to generate the code
that's reading.  An interesting variation would read things using a
union of the writer's schema and the schema used for code generation,
so that missing fields are given default values.

The actual implementation should probably generate interfaces that
extend the GenericRecord and GenericEnumSymbol interfaces, with
private concrete implementations like the above, and a builder.  This
would permit greater flexibility and optimizations.  One could, e.g.,
when a builder is created, generate, compile and load optimized record
implementations so that little performance penalty is paid.

The end result would be that compiled code would reference interfaces
that don't correspond exactly to the runtime data, but rather provide
a view on that data.  We might not alter specific, but instead add a
new FlexData, FlexDatumReader, etc., that builds on generic.

Thoughts?

Doug
On Sun, Jan 26, 2014 at 2:31 AM, Amihay Zer-Kavod <[EMAIL PROTECTED]> wrote:

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB