Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - union


+
Keren Ouaknine 2013-07-25, 06:01
+
Mohammad Tariq 2013-07-25, 07:22
+
Mohammad Tariq 2013-07-25, 07:33
Copy link to this message
-
Re: union
Keren Ouaknine 2013-07-26, 18:46
Thanks Tariq for the explanations.
Once there's one name associated to the union then we can consider it as
one input I assume.

Keren

On Thu, Jul 25, 2013 at 12:33 AM, Mohammad Tariq <[EMAIL PROTECTED]> wrote:

> You could try something like this :
>
> A = load '/1.txt' using PigStorage(' ') as (x:int, y:chararray,
> z:chararray);
>
> B = load '/1_ext.txt' using PigStorage(' ') as (a:int, b:chararray,
> c:chararray);
>
> C = union A, B;
>
> D = group C by 1;
>
> E = foreach D generate flatten(C);
>
> store E into '/dir';
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Thu, Jul 25, 2013 at 12:52 PM, Mohammad Tariq <[EMAIL PROTECTED]>
> wrote:
>
> > Hello Keren,
> >
> > There is nothing wrong in this. One dataset in Hadoop is usually one
> > folder and not one file. Pig is doing what it is supposed to do and
> > performing a union on both the files. You would have seen the content of
> > both the files together while doing dump C.
> >
> > Since this is a map only job, and 2 mappers are getting generated, you
> are
> > getting 2 separate files. Which is actually one complete dataset. If you
> > want to have just one file, you need to force a reduce so that you get
> all
> > the results collectively in a single output file.
> >
> > HTH
> >
> > Warm Regards,
> > Tariq
> > cloudfront.blogspot.com
> >
> >
> > On Thu, Jul 25, 2013 at 11:31 AM, Keren Ouaknine <[EMAIL PROTECTED]>
> wrote:
> >
> >> Hi,
> >>
> >> According to Pig's documention on union, two schemas which have the same
> >> schema (have the same length and  types can be implicitly cast) can be
> >> concatenated (see http://pig.apache.org/docs/r0.11.1/basic.html#union)
> >>
> >> However, when I try with:
> >> A = load '1.txt'          using PigStorage(' ')  as (x:int, y:chararray,
> >> z:chararray);
> >> B = load '1_ext.txt'  using PigStorage(' ')  as (a:int, b:chararray,
> >> c:chararray);
> >> C = union A, B;
> >> describe C;
> >> DUMP C;
> >> store C into '/home/kereno/Documents/pig-0.11.1/workspace/res';
> >>
> >> with:
> >> ~/Documents/pig-0.11.1/workspace 130$ more 1.txt 1_ext.txt
> >> ::::::::::::::
> >> 1.txt
> >> ::::::::::::::
> >> 1 a aleph
> >> 2 b bet
> >> 3 g gimel
> >> ::::::::::::::
> >> 1_ext.txt
> >> ::::::::::::::
> >> 0 a alpha
> >> 0 b beta
> >> 0 g gimel
> >>
> >>
> >> I get in result:~/Documents/pig-0.11.1/workspace 0$ more
> res/part-m-0000*
> >> ::::::::::::::
> >> res/part-m-00000
> >> ::::::::::::::
> >> 0 a alpha
> >> 0 b beta
> >> 0 g gimel
> >>  ::::::::::::::
> >> res/part-m-00001
> >> ::::::::::::::
> >> 1 a aleph
> >> 2 b bet
> >> 3 g gimel
> >>
> >> Whereas I was expecting something like
> >> 0 a alpha
> >> 0 b beta
> >> 0 g gimel
> >> 1 a aleph
> >> 2 b bet
> >> 3 g gimel
> >>
> >> [all together]
> >>
> >> I understand that two files for non-matching schemas would be generated
> >> but
> >> why for union with a matching schema?
> >>
> >> Thanks,
> >> Keren
> >>
> >> --
> >> Keren Ouaknine
> >> Web: www.kereno.com
> >>
> >
> >
>

--
Keren Ouaknine
Web: www.kereno.com