Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Efficient load for data with large number of columns


Copy link to this message
-
Re: Efficient load for data with large number of columns
i did a store to figure out how to write the schema in json and then used
that as a template to create a schema for load.

from my experiments, for data with three columns (int, charray, float) i
figured this is the minimal schema
{"fields":
  [
    {"name":"year","type":10},
    {"name":"name","type":55},
    {"name":"num","type":20}
  ]
}

is there any literature on how to write proper json for schemas?

thanks
vkh

Sadly, there isn't. For a simple, flat schema, it isn't hard. You just have to add another field, with its name, and corresponding DataType:
http://pig.apache.org/docs/r0.10.0/api/constant-values.html#org.apache.pig.data.DataType.GENERIC_WRITABLECOMPARABLE

For a more complex schema, it's easier to actually construct a ResourceSchema object and serialize it with Jackson:

http://pig.apache.org/docs/r0.10.0/api/index.html?org/apache/pig/ResourceSchema.html

Regards,
Marcos
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB