Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - Declaring schema for unknown number of columns


+
Chan, Tim 2013-01-07, 22:19
+
Jinyuan Zhou 2013-01-07, 22:27
+
Chan, Tim 2013-01-08, 01:48
Copy link to this message
-
Re: Declaring schema for unknown number of columns
Jinyuan Zhou 2013-01-08, 02:48
Sorry, Looks like my suggestion won't help unless you were able to specify
the schema with the original load statement. If the number of field is ONLY
available at runtime but each row have the same number field and you know
the position of join key, then I have a ugly approach. First, sample  the
first line to get the number of fields. Write a UDF that  takes all fields
of the data. Pass the number to UDF and  override the method  public Schema
outputSchema(Schema input) to output a complete schema.  your exec method
would return the tuple with same length as input tuple and convert each
item in tuple to the datatype you know. The resulting relation should have
valid schema. But I don't know how to pass the number to UDF efficiently. I
hope some one can have better suggestions.
Thanks,
On Mon, Jan 7, 2013 at 5:48 PM, Chan, Tim <[EMAIL PROTECTED]> wrote:

> Hi Jinyuan,
>
> Since I don't know how many columns I will have, I do something like this.
>
> six_month_and_variable_month_sales_2 = FOREACH
> six_month_and_variable_month_sales
>   GENERATE $0 AS ed_style_id,
>     $1 AS sale_start_month,
>     $2 AS sale_month_1,
>     $3 AS sale_month_2,
>     $4 AS sale_month_3,
>     $5 AS sale_month_4,
>     $6 AS sale_month_5,
>     $7 AS sale_month_6,
>     $8 ..;
>
> I still get the same error when I try to join on this relation.
>
>
>
>
> On Mon, Jan 7, 2013 at 2:27 PM, Jinyuan Zhou <[EMAIL PROTECTED]>
> wrote:
>
> > If you can load it but join operation need the complete schema, then you
> > can try  do a generate statement to project your original relation  to
> > produce the one you can define schema for all fields.
> >
> > On Mon, Jan 7, 2013 at 2:19 PM, Chan, Tim <[EMAIL PROTECTED]> wrote:
> >
> > > Is it possible to declare a schema when doing a LOAD for data in which
> > you
> > > do not know the total number of columns?
> > >
> > > For instance. I know the data contains 6 or more columns. These columns
> > are
> > > of the same data type.
> > >
> > > I basically want to join this data with another data set, but I was
> > getting
> > > the following error:
> > >
> > > ERROR 1109: Input (six_month_and_variable_month_sales) on which outer
> > > join is desired should have a valid schema
> > >
> >
> >
> >
> > --
> > -- Jinyuan (Jack) Zhou
> >
>

--
-- Jinyuan (Jack) Zhou