Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> comparing two files using pig


Copy link to this message
-
Re: comparing two files using pig
The introductory theory of this is awesome. "A+++++ would read again" ;)

On 6/21/13 9:38 AM, Jacob Perkins wrote:
> Now here's where it gets fun :)
>
> First, I do want to show you that (given sufficient coffee) there is a set theoretic approach to your first question that allows you to solve it with just one map-reduce job (a single cogroup) and not two (a cogroup followed by a group). Consider two sets, A and B where |A| is the number of elements in A and |B| is the number of elements in B.
>
> Let |AUB| be the size of the set union of A and B. Note, Pig does not have a set union operator. The UNION operator in Pig is a misnomer. Plus, you cant use it in a nested projection which is frustrating...
> Let |A^B| be the size of the set intersection of A and B. (The number of elements that are in BOTH A and B.
>
> What you're technically after is |A^B|. However, since Pig does not have a set intersection operator, and I'm assuming writing a UDF is out of the question for you, we can be a bit more clever. As it turns out Pig has a DIFF operator. It takes two bags (basically sets although duplicate elements are allowed) and returns all the elements that are in either bag but NOT in both. Notice:
>
> |AUB| = |A^B| + |DIFF(A,B)| and
> |AUB| = |A| + |B| - |A^B| therefor
>
> |A^B| = 1/2*( |A| + |B| - |DIFF(A,B)| )
>
> All of which we can compute with native Pig :)
>
> So:
>
> A = load 'file1.txt' as (q:chararray, d:chararray);
> B = load 'file2.txt' as (q:chararray, d:chararray);
>
> counts = foreach (cogroup A by q, B by q) {
>             a_size     = COUNT(A);         -- |A|
>             b_size     = COUNT(B);         -- |B|
>             diff_size  = COUNT(DIFF(A,B)); -- |DIFF(A,B)
>             match_size = (a_size + b_size - diff_size)/2l; -- 1/2*(|A| + |B| - |DIFF(A,B)|) = |A intersect B|
>             generate
>               group as q,
>               match_size;
>           };
>
> dump counts;
>
>
>
> Alright, back to your other issue of adding the matching elements. Again, if you were up for it, you could simply write a set intersection udf and be done with it. Otherwise, here's what I came up with:
>
>
> A = load 'file1.txt' as (q:chararray, d:chararray);
> B = load 'file2.txt' as (q:chararray, d:chararray);
>
> counts = foreach (cogroup A by (q,d), B by (q,d)) {
>              num_matches = MIN(TOBAG(COUNT(A), COUNT(B)));
>              generate
>                flatten(group) as (q,d),
>                num_matches    as num_matches;
>            };
>
> all_matches = foreach (group counts by q) {
>                  match_set = filter counts by num_matches > 0;
>                  match_set = match_set.d;
>                  generate
>                    group as q,
>                    SUM(counts.num_matches) as total_matches,
>                    match_set as match_set;
>                };
>                
>
> dump all_matches;
>
> (q1,2,{(d1),(d2)})
> (q2,0,{})
> (q3,0,{})
>
> The empty curly braces indicate bags that contain no tuples.
>
> --jacob
> @thedatachef
>
> On Jun 21, 2013, at 6:14 AM, Siddhi Borkar wrote:
>
>> Thanks a lot the solution worked fine. Is it possible also to display the comma separated matching d's?
>>
>> For ex
>> (q1,2, {d1,d2})
>> (q2,0)
>> (q3,0)
>>
>> -----Original Message-----
>> From: Chris Hokamp [mailto:[EMAIL PROTECTED]]
>> Sent: Friday, June 21, 2013 1:52 AM
>> To: [EMAIL PROTECTED]; Barclay Dunn
>> Subject: Re: comparing two files using pig
>>
>> Z
>>
>>
>> Sent from Samsung Mobile
>>
>> -------- Original message --------
>> From: Jacob Perkins <[EMAIL PROTECTED]>
>> Date: 20/06/2013  20:30  (GMT+00:00)
>> To: Barclay Dunn <[EMAIL PROTECTED]>
>> Cc: [EMAIL PROTECTED]
>> Subject: Re: comparing two files using pig
>>
>> I did not read you original post clearly enough. I didn't realize both the d AND the q had to match. It's only slightly more complex, just add the d column to the cogroup statement and sum the number of matches:
>>
>> A = load 'file1.txt' as (q:chararray, d:chararray); B = load 'file2.txt' as (q:chararray, d:chararray);
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB