Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> union


+
Keren Ouaknine 2013-07-25, 06:01
Hello Keren,

There is nothing wrong in this. One dataset in Hadoop is usually one folder
and not one file. Pig is doing what it is supposed to do and performing a
union on both the files. You would have seen the content of both the files
together while doing dump C.

Since this is a map only job, and 2 mappers are getting generated, you are
getting 2 separate files. Which is actually one complete dataset. If you
want to have just one file, you need to force a reduce so that you get all
the results collectively in a single output file.

HTH

Warm Regards,
Tariq
cloudfront.blogspot.com
On Thu, Jul 25, 2013 at 11:31 AM, Keren Ouaknine <[EMAIL PROTECTED]> wrote:

> Hi,
>
> According to Pig's documention on union, two schemas which have the same
> schema (have the same length and  types can be implicitly cast) can be
> concatenated (see http://pig.apache.org/docs/r0.11.1/basic.html#union)
>
> However, when I try with:
> A = load '1.txt'          using PigStorage(' ')  as (x:int, y:chararray,
> z:chararray);
> B = load '1_ext.txt'  using PigStorage(' ')  as (a:int, b:chararray,
> c:chararray);
> C = union A, B;
> describe C;
> DUMP C;
> store C into '/home/kereno/Documents/pig-0.11.1/workspace/res';
>
> with:
> ~/Documents/pig-0.11.1/workspace 130$ more 1.txt 1_ext.txt
> ::::::::::::::
> 1.txt
> ::::::::::::::
> 1 a aleph
> 2 b bet
> 3 g gimel
> ::::::::::::::
> 1_ext.txt
> ::::::::::::::
> 0 a alpha
> 0 b beta
> 0 g gimel
>
>
> I get in result:~/Documents/pig-0.11.1/workspace 0$ more res/part-m-0000*
> ::::::::::::::
> res/part-m-00000
> ::::::::::::::
> 0 a alpha
> 0 b beta
> 0 g gimel
>  ::::::::::::::
> res/part-m-00001
> ::::::::::::::
> 1 a aleph
> 2 b bet
> 3 g gimel
>
> Whereas I was expecting something like
> 0 a alpha
> 0 b beta
> 0 g gimel
> 1 a aleph
> 2 b bet
> 3 g gimel
>
> [all together]
>
> I understand that two files for non-matching schemas would be generated but
> why for union with a matching schema?
>
> Thanks,
> Keren
>
> --
> Keren Ouaknine
> Web: www.kereno.com
>
+
Mohammad Tariq 2013-07-25, 07:33
+
Keren Ouaknine 2013-07-26, 18:46
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB