Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Re: comparing two files using pig


+
Chris Hokamp 2013-06-20, 20:22
+
Siddhi Borkar 2013-06-21, 11:14
+
Jacob Perkins 2013-06-21, 13:38
+
Barclay Dunn 2013-06-21, 13:44
+
Siddhi Borkar 2013-06-20, 09:00
Copy link to this message
-
Re: comparing two files using pig
Hi,

This should just be a simple cogroup.

A = load 'file1.txt' as (q:chararray, d:chararray);
B = load 'file2.txt' as (q:chararray, d:chararray);

counts = foreach (cogroup A by q, B by q) {
                num_matches = MIN(TOBAG(COUNT(A), COUNT(B)));
                generate
                  group       as q,
                  num_matches as num_matches;
             };

dump counts;

(q1,2)
(q2,0)
(q3,0)

--jacob
@thedatachef

On Jun 20, 2013, at 4:00 AM, Siddhi Borkar wrote:

> Hi,
>
> I have a problem statement where in I have to compare two files and get the count of matching attributes.
>
> For ex:
> File 1:  file1.txt
>
> q1           d1
> q1           d2
> q2           d3
> q2           d1
>
> File 2: file2.txt
> q1           d1
> q1           d2
> q3           d3
>
> Now what I need is for each distinct q  the count of matching d's
>
> For ex, the output should be
> q1           2  (q1     d1 and q1            d2 are matching in both the files hence count is 2)
> q2           0 (has no d's matching)
> q3           0
>
> Any idea how this can be achieved?
>
> Thnx in advance
>
> -Sid
>
>
>
> DISCLAIMER
> =========> This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
+
Barclay Dunn 2013-06-20, 19:06
+
Jacob Perkins 2013-06-20, 19:30