Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Dataset comparison and ranking - views


Copy link to this message
-
Re: Dataset comparison and ranking - views
Marcos Ortiz 2011-03-08, 15:56
On Tue, 2011-03-08 at 10:51 +0530, Sonal Goyal wrote:
> Hi Marcos,
>
> Thanks for replying. I think I was not very clear in my last post. Let
> me describe my use case in detail.
>
> I have two datasets coming from different sources, lets call them
> dataset1 and dataset2. Both of them contain records for entities, say
> Person. A single record looks like:
>
> First Name Last Name,  Street, City, State,Zip
>
> We want to compare each record of dataset1 with each record of
> dataset2, in effect a cross join.
>
> We know that the way data is collected, names will not match exactly,
> but we want to find close enoughs. So we have a rule which says create
> bigrams and find the matching bigrams. If 0 to 5 match, give a score
> of 10, if 5-15 match, give a score of 20 and so on.
Well, a approach for this problem has a solution given by Milind
Bhandarkar, on his presentation called "Practical Problem Solving with
Hadoop and Pig".
He talk about a solution for Bigrams giving a example with word
matching.
Bigrams
=======
Input: A large text corpus
• Output: List(word , Top (word ))
• Two Stages:
• Generate all possible bigrams
• Find most frequent K bigrams for each word

Bigrams: Stage 1
Map
==• Generate all possible Bigrams
• Map Input: Large text corpus
• Map computation
• In each sentence, or each “word word ”
• Output (word , word ), (word , word )
• Partition & Sort by (word , word )

pairs.pl
--------
while(<STDIN>) {
chomp;
$_ =~ s/[^a-zA-Z]+/ /g ;
$_ =~ s/^\s+//g ;
$_ =~ s/\s+$//g ;
$_ =~ tr/A-Z/a-z/;
my @words = split(/\s+/, $_);
for (my $i = 0; $i < $#words - 1; ++$i) {
print "$words[$i]:$words[$i+1]\n";
print "$words[$i+1]:$words[$i]\n";
}
}

Bigrams: Stage 1
Reduce
=====• Input: List(word , word ) sorted and partitioned
• Output: List(word , [freq, word ])
• Counting similar to Unigrams example

count.pl
--------

$_ = <STDIN>; chomp;
my ($pw1, $pw2) = split(/:/, $_);
$count = 1;
while(<STDIN>) {
chomp;
my ($w1, $w2) = split(/:/, $_);
if ($w1 eq $pw1 && $w2 eq $pw2) {
$count++;
} else {
print "$pw1:$count:$pw2\n";
$pw1 = $w1;
$pw2 = $w2;
$count = 1;
}
}
print "$pw1:$count:$pw2\n";

Bigrams: Stage 2
Map
==• Input: List(word , [freq,word ])
• Output: List(word , [freq, word ])
• Identity Mapper (/bin/cat)
• Partition by word
• Sort descending by (word , freq)

Bigrams: Stage 2
Reduce
=====• Input: List(word , [freq,word ])
• partitioned by word
• sorted descending by (word , freq)
• Output: Top (List(word , [freq, word ]))
• For each word, throw away after K records

firstN.pl
$N = 5;
$_ = <STDIN>; chomp;
my ($pw1, $count, $pw2) = split(/:/, $_);
$idx = 1;
$out = "$pw1\t$pw2,$count;";
while(<STDIN>) {
chomp;
my ($w1, $c, $w2) = split(/:/, $_);
if ($w1 eq $pw1) {
if ($idx < $N) {
$out .= "$w2,$c;";
$idx++;
}
} else {
print "$out\n";
$pw1 = $w1;
$idx = 1;
$out = "$pw1\t$w2,$c;";
}
}
print "$out\n";
You can translate this approach to your especific problem.
I recommend you that you discuss this with him because he has a vast
experience with all this, much more than me.

Regards

> For Zip, we have our rule saying exact match or within 5 kms of each
> other(through a lookup), give a score of 50 and so on.
>
> Once we have each person of dataset1 compared with that of dataset2,
> we find the overall rank. Which is a weighted average of scores of
> name, address etc comparison.
>
> One approach is to use the DistributedCache for the smaller dataset
> and do a nested loop join in the mapper. The second approach is to use
> multiple  MR flows, and compare the fields and reduce/collate the
> results.
>
> I am curious to know if people have other approaches they have
> implemented, what are the efficiencies they have built up etc.
>
> Thanks and Regards,
> Sonal
> Hadoop ETL and Data Integration
> Nube Technologies
>
>
>
>
>
>
>
> On Tue, Mar 8, 2011 at 12:55 AM, Marcos Ortiz <[EMAIL PROTECTED]> wrote:
>        
>         On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote:
>         > Hi,
>         >
>         > I am working on a problem to compare two different datasets,

 Marcos Luís Ortíz Valmaseda
 Software Engineer
 Centro de Tecnologías de Gestión de Datos (DATEC)
 Universidad de las Ciencias Informáticas
 http://uncubanitolinuxero.blogspot.com
 http://www.linkedin.com/in/marcosluis2186