Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> comparing two files using pig

Copy link to this message
Re: comparing two files using pig

This should just be a simple cogroup.

A = load 'file1.txt' as (q:chararray, d:chararray);
B = load 'file2.txt' as (q:chararray, d:chararray);

counts = foreach (cogroup A by q, B by q) {
                num_matches = MIN(TOBAG(COUNT(A), COUNT(B)));
                  group       as q,
                  num_matches as num_matches;

dump counts;



On Jun 20, 2013, at 4:00 AM, Siddhi Borkar wrote:

> Hi,
> I have a problem statement where in I have to compare two files and get the count of matching attributes.
> For ex:
> File 1:  file1.txt
> q1           d1
> q1           d2
> q2           d3
> q2           d1
> File 2: file2.txt
> q1           d1
> q1           d2
> q3           d3
> Now what I need is for each distinct q  the count of matching d's
> For ex, the output should be
> q1           2  (q1     d1 and q1            d2 are matching in both the files hence count is 2)
> q2           0 (has no d's matching)
> q3           0
> Any idea how this can be achieved?
> Thnx in advance
> -Sid
> =========> This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.