Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - Parsing variable schema


+
Prashant Kommireddi 2012-12-12, 07:47
+
Jonathan Coveney 2012-12-12, 18:07
Copy link to this message
-
Re: Parsing variable schema
Prashant Kommireddi 2012-12-13, 07:51
Let's say we have semi-structured logs in which the first column is always
the LogType (could be A, B, C, xyz...)

A,20120101,Ax8221Za,1233122
B,Ux231asd,20120101,

Each LogType has its own schema - you would notice date appears at index 1
for LogType=A and at index 2 for LogType=B.

My question is whether there is a good way to deal with variable schema in
LoadFunc?

-Prashant
On Wed, Dec 12, 2012 at 11:37 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:

> I'm a little vague on what you want to do. Can you provide an example?
>
>
> 2012/12/11 Prashant Kommireddi <[EMAIL PROTECTED]>
>
> > Here is a snippet of how schema is applied to tuples
> >
> > String serializedSchema = p.getProperty(signature + SCHEMA_FILE);
> >                 if (serializedSchema != null) {
> >                     try {
> >                         resourceSchema = new
> > ResourceSchema(Utils.getSchemaFromString(serializedSchema));
> >                     } catch (ParserException e) {
> >                         mLog.error("Unable to parse serialized schema " +
> > serializedSchema, e);
> >                     }
> >                 }
> >
> >
> > Is there a good way to define multiple "serializedSchema" which could be
> > applied to different type of tuples (different log lines)? I am able to
> > push this logic into a UDF to parse a record based on a schema data
> > structure I build within it. Wondering if this can be done in LoadFunc
> > itself.
> >
> > Thanks,
> > Prashant
> >
>