Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> comparing two files using pig


Copy link to this message
-
Re: comparing two files using pig
I did not read you original post clearly enough. I didn't realize both
the d AND the q had to match. It's only slightly more complex, just add
the d column to the cogroup statement and sum the number of matches:

A = load 'file1.txt' as (q:chararray, d:chararray);
B = load 'file2.txt' as (q:chararray, d:chararray);

counts = foreach (cogroup A by (q,d), B by (q,d)) {
            num_matches = MIN(TOBAG(COUNT(A), COUNT(B)));
            generate
              flatten(group) as (q,d),
              num_matches    as num_matches;
          };

all_matches = foreach (group counts by q) generate group as q,
SUM(counts.num_matches) as total_matches;

dump all_matches;

(q1,2)
(q2,0)
(q3,0)

--jacob
@thedatachef

On 06/20/2013 02:06 PM, Barclay Dunn wrote:
> Jacob,
>
> If I run that code with an added row in file2.txt, e.g.,
>
>  $ cat file2.txt
> q1 d1
> q1 d2
> q3 d3
> q2 d4
>
> This gives me mistaken results, i.e.,
>
> (q1,2)
> (q2,1)
> (q3,0)
>
>
> I am new at this so I apologize for the ponderous pace of the
> following. It can no doubt be shortened. But it gets the correct
> results with either data set.
>
> set io.sort.mb 10;         -- avoid java.lang.OutOfMemoryError: Java
> heap space (execmode: -x local)
>
> A = LOAD '../../../input/file1.txt' using PigStorage(' ') as
> (aa:chararray, ab:chararray);
> B = LOAD '../../../input/file2.txt' using PigStorage(' ') as
> (ba:chararray, bb:chararray);
>
> C = UNION A, B;
> D = COGROUP C by ($0, $1);
>
> F = FOREACH D GENERATE FLATTEN($0), COUNT($1);
>
> G0 = FILTER F BY $2 > 1;   -- any that match
> G1 = FILTER F BY $2 < 2;   -- any that don't match
>
> H0 = GROUP G0 BY $0;
> H1 = GROUP G1 BY $0;
>
>
> J0 = FOREACH H0 GENERATE $0, COUNT($1);
> J1 = FOREACH H1 GENERATE $0, 0;
>
> K = UNION J0, J1;
>
> DUMP K;
> /*
> (q2,0)
> (q3,0)
> (q1,2)
> */
>
>
> Barclay Dunn
>
>
> On 6/20/13 10:07 AM, Jacob Perkins wrote:
>> Hi,
>>
>> This should just be a simple cogroup.
>>
>> A = load 'file1.txt' as (q:chararray, d:chararray);
>> B = load 'file2.txt' as (q:chararray, d:chararray);
>>
>> counts = foreach (cogroup A by q, B by q) {
>>                  num_matches = MIN(TOBAG(COUNT(A), COUNT(B)));
>>                  generate
>>                    group       as q,
>>                    num_matches as num_matches;
>>               };
>>
>> dump counts;
>>
>> (q1,2)
>> (q2,0)
>> (q3,0)
>>
>> --jacob
>> @thedatachef
>>
>> On Jun 20, 2013, at 4:00 AM, Siddhi Borkar wrote:
>>
>>> Hi,
>>>
>>> I have a problem statement where in I have to compare two files and get the count of matching attributes.
>>>
>>> For ex:
>>> File 1:  file1.txt
>>>
>>> q1           d1
>>> q1           d2
>>> q2           d3
>>> q2           d1
>>>
>>> File 2: file2.txt
>>> q1           d1
>>> q1           d2
>>> q3           d3
>>>
>>> Now what I need is for each distinct q  the count of matching d's
>>>
>>> For ex, the output should be
>>> q1           2  (q1     d1 and q1            d2 are matching in both the files hence count is 2)
>>> q2           0 (has no d's matching)
>>> q3           0
>>>
>>> Any idea how this can be achieved?
>>>
>>> Thnx in advance
>>>
>>> -Sid
>>>
>>>
>>>
>>> DISCLAIMER
>>> =========>>> This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
>