Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Dataset comparison and ranking - views


Copy link to this message
-
Re: Dataset comparison and ranking - views
On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote:
> Hi,
>
> I am working on a problem to compare two different datasets, and rank
> each record of the first with respect to the other, in terms of how
> similar they are. The records are dimensional, but do not have a lot
> of dimensions. Some of the fields will be compared for exact matches,
> some for similar sound, some with closest match etc. One of the
> datasets is large, and the other is much smaller.  The final goal is
> to compute a rank between each record of first dataset with each
> record of the second. The rank is based on weighted scores of each
> dimension comparison.
>
> I was wondering if people in the community have any advice/suggested
> patterns/thoughts about cross joining two datasets in map reduce. Do
> let me know if you have any suggestions.  
>
> Thanks and Regards,
> Sonal
> Hadoop ETL and Data Integration
> Nube Technologies

Regards, Sonal. Can you give us more information about a basic workflow
of your idea?

Some questions:
- How do you know that two records are identical? By id?
- Can you give a example of the ranking that you want to archieve with a
match of each case:
- two records that are identical
- two records that ar similar
- two records with the closest match

For MapReduce Design's Algoritms, I recommend to you this excelent from
Ricky Ho:
http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html

For the join of the two datasets, you can use Pig for this. Here you
have a basic Pig example from Milind Bhandarkar
([EMAIL PROTECTED])'s talk "Practical Problem Solving with Hadoop
and Pig":
Users = load ‘users’ as (name, age);
Filtered = filter Users by age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
            COUNT(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;
--
 Marcos Luís Ortíz Valmaseda
 Software Engineer
 Centro de Tecnologías de Gestión de Datos (DATEC)
 Universidad de las Ciencias Informáticas
 http://uncubanitolinuxero.blogspot.com
 http://www.linkedin.com/in/marcosluis2186
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB