Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - union


+
Keren Ouaknine 2013-07-25, 06:01
+
Mohammad Tariq 2013-07-25, 07:22
Copy link to this message
-
Re: union
Mohammad Tariq 2013-07-25, 07:33
You could try something like this :

A = load '/1.txt' using PigStorage(' ') as (x:int, y:chararray,
z:chararray);

B = load '/1_ext.txt' using PigStorage(' ') as (a:int, b:chararray,
c:chararray);

C = union A, B;

D = group C by 1;

E = foreach D generate flatten(C);

store E into '/dir';

Warm Regards,
Tariq
cloudfront.blogspot.com
On Thu, Jul 25, 2013 at 12:52 PM, Mohammad Tariq <[EMAIL PROTECTED]> wrote:

> Hello Keren,
>
> There is nothing wrong in this. One dataset in Hadoop is usually one
> folder and not one file. Pig is doing what it is supposed to do and
> performing a union on both the files. You would have seen the content of
> both the files together while doing dump C.
>
> Since this is a map only job, and 2 mappers are getting generated, you are
> getting 2 separate files. Which is actually one complete dataset. If you
> want to have just one file, you need to force a reduce so that you get all
> the results collectively in a single output file.
>
> HTH
>
> Warm Regards,
> Tariq
> cloudfront.blogspot.com
>
>
> On Thu, Jul 25, 2013 at 11:31 AM, Keren Ouaknine <[EMAIL PROTECTED]> wrote:
>
>> Hi,
>>
>> According to Pig's documention on union, two schemas which have the same
>> schema (have the same length and  types can be implicitly cast) can be
>> concatenated (see http://pig.apache.org/docs/r0.11.1/basic.html#union)
>>
>> However, when I try with:
>> A = load '1.txt'          using PigStorage(' ')  as (x:int, y:chararray,
>> z:chararray);
>> B = load '1_ext.txt'  using PigStorage(' ')  as (a:int, b:chararray,
>> c:chararray);
>> C = union A, B;
>> describe C;
>> DUMP C;
>> store C into '/home/kereno/Documents/pig-0.11.1/workspace/res';
>>
>> with:
>> ~/Documents/pig-0.11.1/workspace 130$ more 1.txt 1_ext.txt
>> ::::::::::::::
>> 1.txt
>> ::::::::::::::
>> 1 a aleph
>> 2 b bet
>> 3 g gimel
>> ::::::::::::::
>> 1_ext.txt
>> ::::::::::::::
>> 0 a alpha
>> 0 b beta
>> 0 g gimel
>>
>>
>> I get in result:~/Documents/pig-0.11.1/workspace 0$ more res/part-m-0000*
>> ::::::::::::::
>> res/part-m-00000
>> ::::::::::::::
>> 0 a alpha
>> 0 b beta
>> 0 g gimel
>>  ::::::::::::::
>> res/part-m-00001
>> ::::::::::::::
>> 1 a aleph
>> 2 b bet
>> 3 g gimel
>>
>> Whereas I was expecting something like
>> 0 a alpha
>> 0 b beta
>> 0 g gimel
>> 1 a aleph
>> 2 b bet
>> 3 g gimel
>>
>> [all together]
>>
>> I understand that two files for non-matching schemas would be generated
>> but
>> why for union with a matching schema?
>>
>> Thanks,
>> Keren
>>
>> --
>> Keren Ouaknine
>> Web: www.kereno.com
>>
>
>
+
Keren Ouaknine 2013-07-26, 18:46