Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> union


Thanks Tariq for the explanations.
Once there's one name associated to the union then we can consider it as
one input I assume.

Keren

On Thu, Jul 25, 2013 at 12:33 AM, Mohammad Tariq <[EMAIL PROTECTED]> wrote:

> You could try something like this :
>
> A = load '/1.txt' using PigStorage(' ') as (x:int, y:chararray,
> z:chararray);
>
> B = load '/1_ext.txt' using PigStorage(' ') as (a:int, b:chararray,
> c:chararray);
>
> C = union A, B;
>
> D = group C by 1;
>
> E = foreach D generate flatten(C);
>
> store E into '/dir';
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Thu, Jul 25, 2013 at 12:52 PM, Mohammad Tariq <[EMAIL PROTECTED]>
> wrote:
>
> > Hello Keren,
> >
> > There is nothing wrong in this. One dataset in Hadoop is usually one
> > folder and not one file. Pig is doing what it is supposed to do and
> > performing a union on both the files. You would have seen the content of
> > both the files together while doing dump C.
> >
> > Since this is a map only job, and 2 mappers are getting generated, you
> are
> > getting 2 separate files. Which is actually one complete dataset. If you
> > want to have just one file, you need to force a reduce so that you get
> all
> > the results collectively in a single output file.
> >
> > HTH
> >
> > Warm Regards,
> > Tariq
> > cloudfront.blogspot.com
> >
> >
> > On Thu, Jul 25, 2013 at 11:31 AM, Keren Ouaknine <[EMAIL PROTECTED]>
> wrote:
> >
> >> Hi,
> >>
> >> According to Pig's documention on union, two schemas which have the same
> >> schema (have the same length and  types can be implicitly cast) can be
> >> concatenated (see http://pig.apache.org/docs/r0.11.1/basic.html#union)
> >>
> >> However, when I try with:
> >> A = load '1.txt'          using PigStorage(' ')  as (x:int, y:chararray,
> >> z:chararray);
> >> B = load '1_ext.txt'  using PigStorage(' ')  as (a:int, b:chararray,
> >> c:chararray);
> >> C = union A, B;
> >> describe C;
> >> DUMP C;
> >> store C into '/home/kereno/Documents/pig-0.11.1/workspace/res';
> >>
> >> with:
> >> ~/Documents/pig-0.11.1/workspace 130$ more 1.txt 1_ext.txt
> >> ::::::::::::::
> >> 1.txt
> >> ::::::::::::::
> >> 1 a aleph
> >> 2 b bet
> >> 3 g gimel
> >> ::::::::::::::
> >> 1_ext.txt
> >> ::::::::::::::
> >> 0 a alpha
> >> 0 b beta
> >> 0 g gimel
> >>
> >>
> >> I get in result:~/Documents/pig-0.11.1/workspace 0$ more
> res/part-m-0000*
> >> ::::::::::::::
> >> res/part-m-00000
> >> ::::::::::::::
> >> 0 a alpha
> >> 0 b beta
> >> 0 g gimel
> >>  ::::::::::::::
> >> res/part-m-00001
> >> ::::::::::::::
> >> 1 a aleph
> >> 2 b bet
> >> 3 g gimel
> >>
> >> Whereas I was expecting something like
> >> 0 a alpha
> >> 0 b beta
> >> 0 g gimel
> >> 1 a aleph
> >> 2 b bet
> >> 3 g gimel
> >>
> >> [all together]
> >>
> >> I understand that two files for non-matching schemas would be generated
> >> but
> >> why for union with a matching schema?
> >>
> >> Thanks,
> >> Keren
> >>
> >> --
> >> Keren Ouaknine
> >> Web: www.kereno.com
> >>
> >
> >
>

--
Keren Ouaknine
Web: www.kereno.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB