Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - comparing two files using pig


Copy link to this message
-
Re: comparing two files using pig
Jacob Perkins 2013-06-21, 13:38
Now here's where it gets fun :)

First, I do want to show you that (given sufficient coffee) there is a set theoretic approach to your first question that allows you to solve it with just one map-reduce job (a single cogroup) and not two (a cogroup followed by a group). Consider two sets, A and B where |A| is the number of elements in A and |B| is the number of elements in B.

Let |AUB| be the size of the set union of A and B. Note, Pig does not have a set union operator. The UNION operator in Pig is a misnomer. Plus, you cant use it in a nested projection which is frustrating...
Let |A^B| be the size of the set intersection of A and B. (The number of elements that are in BOTH A and B.

What you're technically after is |A^B|. However, since Pig does not have a set intersection operator, and I'm assuming writing a UDF is out of the question for you, we can be a bit more clever. As it turns out Pig has a DIFF operator. It takes two bags (basically sets although duplicate elements are allowed) and returns all the elements that are in either bag but NOT in both. Notice:

|AUB| = |A^B| + |DIFF(A,B)| and
|AUB| = |A| + |B| - |A^B| therefor

|A^B| = 1/2*( |A| + |B| - |DIFF(A,B)| )

All of which we can compute with native Pig :)

So:

A = load 'file1.txt' as (q:chararray, d:chararray);
B = load 'file2.txt' as (q:chararray, d:chararray);

counts = foreach (cogroup A by q, B by q) {
           a_size     = COUNT(A);         -- |A|
           b_size     = COUNT(B);         -- |B|
           diff_size  = COUNT(DIFF(A,B)); -- |DIFF(A,B)
           match_size = (a_size + b_size - diff_size)/2l; -- 1/2*(|A| + |B| - |DIFF(A,B)|) = |A intersect B|
           generate
             group as q,
             match_size;
         };

dump counts;

Alright, back to your other issue of adding the matching elements. Again, if you were up for it, you could simply write a set intersection udf and be done with it. Otherwise, here's what I came up with:
A = load 'file1.txt' as (q:chararray, d:chararray);
B = load 'file2.txt' as (q:chararray, d:chararray);

counts = foreach (cogroup A by (q,d), B by (q,d)) {
            num_matches = MIN(TOBAG(COUNT(A), COUNT(B)));
            generate
              flatten(group) as (q,d),
              num_matches    as num_matches;
          };

all_matches = foreach (group counts by q) {
                match_set = filter counts by num_matches > 0;
                match_set = match_set.d;
                generate
                  group as q,
                  SUM(counts.num_matches) as total_matches,
                  match_set as match_set;
              };
              

dump all_matches;

(q1,2,{(d1),(d2)})
(q2,0,{})
(q3,0,{})

The empty curly braces indicate bags that contain no tuples.

--jacob
@thedatachef

On Jun 21, 2013, at 6:14 AM, Siddhi Borkar wrote:

> Thanks a lot the solution worked fine. Is it possible also to display the comma separated matching d's?
>
> For ex
> (q1,2, {d1,d2})
> (q2,0)
> (q3,0)
>
> -----Original Message-----
> From: Chris Hokamp [mailto:[EMAIL PROTECTED]]
> Sent: Friday, June 21, 2013 1:52 AM
> To: [EMAIL PROTECTED]; Barclay Dunn
> Subject: Re: comparing two files using pig
>
> Z
>
>
> Sent from Samsung Mobile
>
> -------- Original message --------
> From: Jacob Perkins <[EMAIL PROTECTED]>
> Date: 20/06/2013  20:30  (GMT+00:00)
> To: Barclay Dunn <[EMAIL PROTECTED]>
> Cc: [EMAIL PROTECTED]
> Subject: Re: comparing two files using pig
>
> I did not read you original post clearly enough. I didn't realize both the d AND the q had to match. It's only slightly more complex, just add the d column to the cogroup statement and sum the number of matches:
>
> A = load 'file1.txt' as (q:chararray, d:chararray); B = load 'file2.txt' as (q:chararray, d:chararray);
>
> counts = foreach (cogroup A by (q,d), B by (q,d)) {
>             num_matches = MIN(TOBAG(COUNT(A), COUNT(B)));
>             generate
>               flatten(group) as (q,d),
>               num_matches    as num_matches;