Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Re: comparing two files using pig


+
Chris Hokamp 2013-06-20, 20:22
+
Siddhi Borkar 2013-06-21, 11:14
+
Jacob Perkins 2013-06-21, 13:38
Copy link to this message
-
Re: comparing two files using pig
The introductory theory of this is awesome. "A+++++ would read again" ;)

On 6/21/13 9:38 AM, Jacob Perkins wrote:
> Now here's where it gets fun :)
>
> First, I do want to show you that (given sufficient coffee) there is a set theoretic approach to your first question that allows you to solve it with just one map-reduce job (a single cogroup) and not two (a cogroup followed by a group). Consider two sets, A and B where |A| is the number of elements in A and |B| is the number of elements in B.
>
> Let |AUB| be the size of the set union of A and B. Note, Pig does not have a set union operator. The UNION operator in Pig is a misnomer. Plus, you cant use it in a nested projection which is frustrating...
> Let |A^B| be the size of the set intersection of A and B. (The number of elements that are in BOTH A and B.
>
> What you're technically after is |A^B|. However, since Pig does not have a set intersection operator, and I'm assuming writing a UDF is out of the question for you, we can be a bit more clever. As it turns out Pig has a DIFF operator. It takes two bags (basically sets although duplicate elements are allowed) and returns all the elements that are in either bag but NOT in both. Notice:
>
> |AUB| = |A^B| + |DIFF(A,B)| and
> |AUB| = |A| + |B| - |A^B| therefor
>
> |A^B| = 1/2*( |A| + |B| - |DIFF(A,B)| )
>
> All of which we can compute with native Pig :)
>
> So:
>
> A = load 'file1.txt' as (q:chararray, d:chararray);
> B = load 'file2.txt' as (q:chararray, d:chararray);
>
> counts = foreach (cogroup A by q, B by q) {
>             a_size     = COUNT(A);         -- |A|
>             b_size     = COUNT(B);         -- |B|
>             diff_size  = COUNT(DIFF(A,B)); -- |DIFF(A,B)
>             match_size = (a_size + b_size - diff_size)/2l; -- 1/2*(|A| + |B| - |DIFF(A,B)|) = |A intersect B|
>             generate
>               group as q,
>               match_size;
>           };
>
> dump counts;
>
>
>
> Alright, back to your other issue of adding the matching elements. Again, if you were up for it, you could simply write a set intersection udf and be done with it. Otherwise, here's what I came up with:
>
>
> A = load 'file1.txt' as (q:chararray, d:chararray);
> B = load 'file2.txt' as (q:chararray, d:chararray);
>
> counts = foreach (cogroup A by (q,d), B by (q,d)) {
>              num_matches = MIN(TOBAG(COUNT(A), COUNT(B)));
>              generate
>                flatten(group) as (q,d),
>                num_matches    as num_matches;
>            };
>
> all_matches = foreach (group counts by q) {
>                  match_set = filter counts by num_matches > 0;
>                  match_set = match_set.d;
>                  generate
>                    group as q,
>                    SUM(counts.num_matches) as total_matches,
>                    match_set as match_set;
>                };
>                
>
> dump all_matches;
>
> (q1,2,{(d1),(d2)})
> (q2,0,{})
> (q3,0,{})
>
> The empty curly braces indicate bags that contain no tuples.
>
> --jacob
> @thedatachef
>
> On Jun 21, 2013, at 6:14 AM, Siddhi Borkar wrote:
>
>> Thanks a lot the solution worked fine. Is it possible also to display the comma separated matching d's?
>>
>> For ex
>> (q1,2, {d1,d2})
>> (q2,0)
>> (q3,0)
>>
>> -----Original Message-----
>> From: Chris Hokamp [mailto:[EMAIL PROTECTED]]
>> Sent: Friday, June 21, 2013 1:52 AM
>> To: [EMAIL PROTECTED]; Barclay Dunn
>> Subject: Re: comparing two files using pig
>>
>> Z
>>
>>
>> Sent from Samsung Mobile
>>
>> -------- Original message --------
>> From: Jacob Perkins <[EMAIL PROTECTED]>
>> Date: 20/06/2013  20:30  (GMT+00:00)
>> To: Barclay Dunn <[EMAIL PROTECTED]>
>> Cc: [EMAIL PROTECTED]
>> Subject: Re: comparing two files using pig
>>
>> I did not read you original post clearly enough. I didn't realize both the d AND the q had to match. It's only slightly more complex, just add the d column to the cogroup statement and sum the number of matches:
>>
>> A = load 'file1.txt' as (q:chararray, d:chararray); B = load 'file2.txt' as (q:chararray, d:chararray);
+
Siddhi Borkar 2013-06-20, 09:00
+
Jacob Perkins 2013-06-20, 14:07
+
Barclay Dunn 2013-06-20, 19:06
+
Jacob Perkins 2013-06-20, 19:30