Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> union


You could try something like this :

A = load '/1.txt' using PigStorage(' ') as (x:int, y:chararray,
z:chararray);

B = load '/1_ext.txt' using PigStorage(' ') as (a:int, b:chararray,
c:chararray);

C = union A, B;

D = group C by 1;

E = foreach D generate flatten(C);

store E into '/dir';

Warm Regards,
Tariq
cloudfront.blogspot.com
On Thu, Jul 25, 2013 at 12:52 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote:

> Hello Keren,
>
> There is nothing wrong in this. One dataset in Hadoop is usually one
> folder and not one file. Pig is doing what it is supposed to do and
> performing a union on both the files. You would have seen the content of
> both the files together while doing dump C.
>
> Since this is a map only job, and 2 mappers are getting generated, you are
> getting 2 separate files. Which is actually one complete dataset. If you
> want to have just one file, you need to force a reduce so that you get all
> the results collectively in a single output file.
>
> HTH
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Thu, Jul 25, 2013 at 11:31 AM, Keren Ouaknine <[EMAIL PROTECTED]> wrote:
>
>> Hi,
>>
>> According to Pig's documention on union, two schemas which have the same
>> schema (have the same length and  types can be implicitly cast) can be
>> concatenated (see http://pig.apache.org/docs/r0.11.1/basic.html#union)
>>
>> However, when I try with:
>> A = load '1.txt'          using PigStorage(' ')  as (x:int, y:chararray,
>> z:chararray);
>> B = load '1_ext.txt'  using PigStorage(' ')  as (a:int, b:chararray,
>> c:chararray);
>> C = union A, B;
>> describe C;
>> DUMP C;
>> store C into '/home/kereno/Documents/pig-0.11.1/workspace/res';
>>
>> with:
>> ~/Documents/pig-0.11.1/workspace 130$ more 1.txt 1_ext.txt
>> ::::::::::::::
>> 1.txt
>> ::::::::::::::
>> 1 a aleph
>> 2 b bet
>> 3 g gimel
>> ::::::::::::::
>> 1_ext.txt
>> ::::::::::::::
>> 0 a alpha
>> 0 b beta
>> 0 g gimel
>>
>>
>> I get in result:~/Documents/pig-0.11.1/workspace 0$ more res/part-m-0000*
>> ::::::::::::::
>> res/part-m-00000
>> ::::::::::::::
>> 0 a alpha
>> 0 b beta
>> 0 g gimel
>>  ::::::::::::::
>> res/part-m-00001
>> ::::::::::::::
>> 1 a aleph
>> 2 b bet
>> 3 g gimel
>>
>> Whereas I was expecting something like
>> 0 a alpha
>> 0 b beta
>> 0 g gimel
>> 1 a aleph
>> 2 b bet
>> 3 g gimel
>>
>> [all together]
>>
>> I understand that two files for non-matching schemas would be generated
>> but
>> why for union with a matching schema?
>>
>> Thanks,
>> Keren
>>
>> --
>> Keren Ouaknine
>> Web: www.kereno.com
>>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB