Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - possible to infer schema from TSV header?


Copy link to this message
-
Re: possible to infer schema from TSV header?
Bill Graham 2013-01-15, 23:17
Take a look at the org.apache.pig.builtin.PigStorage.getSchema(..) method.
You can subclass PigStorage and implement that method to read the schema
from the first line of the file. Or you can just implement the LoadMetaData
in the loader you're using.
On Tue, Jan 15, 2013 at 2:27 PM, Mason <[EMAIL PROTECTED]> wrote:

> Actually, I'll probably just end up computing positions to use, rather
> than pasting in a schema, but the general point is that I'd love to do
> it some other way, because little hacks like these make my data
> pipeline feel fragile.
>
> I'm willing to write some Java if anyone could point me in the write
> direction.
>
> -Mason
>
> On Tue, Jan 15, 2013 at 2:23 PM, Mason <[EMAIL PROTECTED]> wrote:
> > I have TSVs with a lot of columns, and I would like to address them by
> > name, as specified in the header line (first row), within Pig.
> >
> > The best I can come up with a.t.m is to write a script that strips the
> > header line from the file and converts it to the form (col1:string,
> > col2:string, ...), then plug that schema string into the AS portion of
> > my LOAD statement. Then I'll project columns I want and manually
> > typecast them.
> >
> > Is there a better, simple way?
> >
> > -Mason
>

--
*Note that I'm no longer using my Yahoo! email address. Please email me at
[EMAIL PROTECTED] going forward.*