Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Efficient load for data with large number of columns


Copy link to this message
-
Re: Efficient load for data with large number of columns
MARCOS MEDRADO RUBINELLI 2013-03-28, 10:50
i did a store to figure out how to write the schema in json and then used
that as a template to create a schema for load.

from my experiments, for data with three columns (int, charray, float) i
figured this is the minimal schema
{"fields":
  [
    {"name":"year","type":10},
    {"name":"name","type":55},
    {"name":"num","type":20}
  ]
}

is there any literature on how to write proper json for schemas?

thanks
vkh

Sadly, there isn't. For a simple, flat schema, it isn't hard. You just have to add another field, with its name, and corresponding DataType:
http://pig.apache.org/docs/r0.10.0/api/constant-values.html#org.apache.pig.data.DataType.GENERIC_WRITABLECOMPARABLE

For a more complex schema, it's easier to actually construct a ResourceSchema object and serialize it with Jackson:

http://pig.apache.org/docs/r0.10.0/api/index.html?org/apache/pig/ResourceSchema.html

Regards,
Marcos