Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Dataset comparison and ranking - views


Copy link to this message
-
Re: Dataset comparison and ranking - views
On Tue, 2011-03-08 at 10:51 +0530, Sonal Goyal wrote:
> Hi Marcos,
>
> Thanks for replying. I think I was not very clear in my last post. Let
> me describe my use case in detail.
>
> I have two datasets coming from different sources, lets call them
> dataset1 and dataset2. Both of them contain records for entities, say
> Person. A single record looks like:
>
> First Name Last Name,  Street, City, State,Zip
>
> We want to compare each record of dataset1 with each record of
> dataset2, in effect a cross join.
>
> We know that the way data is collected, names will not match exactly,
> but we want to find close enoughs. So we have a rule which says create
> bigrams and find the matching bigrams. If 0 to 5 match, give a score
> of 10, if 5-15 match, give a score of 20 and so on.
Well, a approach for this problem has a solution given by Milind
Bhandarkar, on his presentation called "Practical Problem Solving with
Hadoop and Pig".
He talk about a solution for Bigrams giving a example with word
matching.
Bigrams
=======
Input: A large text corpus
• Output: List(word , Top (word ))
• Two Stages:
• Generate all possible bigrams
• Find most frequent K bigrams for each word

Bigrams: Stage 1
Map
==• Generate all possible Bigrams
• Map Input: Large text corpus
• Map computation
• In each sentence, or each “word word ”
• Output (word , word ), (word , word )
• Partition & Sort by (word , word )

pairs.pl
--------
while(<STDIN>) {
chomp;
$_ =~ s/[^a-zA-Z]+/ /g ;
$_ =~ s/^\s+//g ;
$_ =~ s/\s+$//g ;
$_ =~ tr/A-Z/a-z/;
my @words = split(/\s+/, $_);
for (my $i = 0; $i < $#words - 1; ++$i) {
print "$words[$i]:$words[$i+1]\n";
print "$words[$i+1]:$words[$i]\n";
}
}

Bigrams: Stage 1
Reduce
=====• Input: List(word , word ) sorted and partitioned
• Output: List(word , [freq, word ])
• Counting similar to Unigrams example

count.pl
--------

$_ = <STDIN>; chomp;
my ($pw1, $pw2) = split(/:/, $_);
$count = 1;
while(<STDIN>) {
chomp;
my ($w1, $w2) = split(/:/, $_);
if ($w1 eq $pw1 && $w2 eq $pw2) {
$count++;
} else {
print "$pw1:$count:$pw2\n";
$pw1 = $w1;
$pw2 = $w2;
$count = 1;
}
}
print "$pw1:$count:$pw2\n";

Bigrams: Stage 2
Map
==• Input: List(word , [freq,word ])
• Output: List(word , [freq, word ])
• Identity Mapper (/bin/cat)
• Partition by word
• Sort descending by (word , freq)

Bigrams: Stage 2
Reduce
=====• Input: List(word , [freq,word ])
• partitioned by word
• sorted descending by (word , freq)
• Output: Top (List(word , [freq, word ]))
• For each word, throw away after K records

firstN.pl
$N = 5;
$_ = <STDIN>; chomp;
my ($pw1, $count, $pw2) = split(/:/, $_);
$idx = 1;
$out = "$pw1\t$pw2,$count;";
while(<STDIN>) {
chomp;
my ($w1, $c, $w2) = split(/:/, $_);
if ($w1 eq $pw1) {
if ($idx < $N) {
$out .= "$w2,$c;";
$idx++;
}
} else {
print "$out\n";
$pw1 = $w1;
$idx = 1;
$out = "$pw1\t$w2,$c;";
}
}
print "$out\n";
You can translate this approach to your especific problem.
I recommend you that you discuss this with him because he has a vast
experience with all this, much more than me.

Regards

> For Zip, we have our rule saying exact match or within 5 kms of each
> other(through a lookup), give a score of 50 and so on.
>
> Once we have each person of dataset1 compared with that of dataset2,
> we find the overall rank. Which is a weighted average of scores of
> name, address etc comparison.
>
> One approach is to use the DistributedCache for the smaller dataset
> and do a nested loop join in the mapper. The second approach is to use
> multiple  MR flows, and compare the fields and reduce/collate the
> results.
>
> I am curious to know if people have other approaches they have
> implemented, what are the efficiencies they have built up etc.
>
> Thanks and Regards,
> Sonal
> Hadoop ETL and Data Integration
> Nube Technologies
>
>
>
>
>
>
>
> On Tue, Mar 8, 2011 at 12:55 AM, Marcos Ortiz <[EMAIL PROTECTED]> wrote:
>        
>         On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote:
>         > Hi,
>         >
>         > I am working on a problem to compare two different datasets,

 Marcos Luís Ortíz Valmaseda
 Software Engineer
 Centro de Tecnologías de Gestión de Datos (DATEC)
 Universidad de las Ciencias Informáticas
 http://uncubanitolinuxero.blogspot.com
 http://www.linkedin.com/in/marcosluis2186
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB