Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Dataset comparison and ranking - views


Copy link to this message
-
Re: Dataset comparison and ranking - views
Marcos Ortiz 2011-03-07, 19:25
On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote:
> Hi,
>
> I am working on a problem to compare two different datasets, and rank
> each record of the first with respect to the other, in terms of how
> similar they are. The records are dimensional, but do not have a lot
> of dimensions. Some of the fields will be compared for exact matches,
> some for similar sound, some with closest match etc. One of the
> datasets is large, and the other is much smaller.  The final goal is
> to compute a rank between each record of first dataset with each
> record of the second. The rank is based on weighted scores of each
> dimension comparison.
>
> I was wondering if people in the community have any advice/suggested
> patterns/thoughts about cross joining two datasets in map reduce. Do
> let me know if you have any suggestions.  
>
> Thanks and Regards,
> Sonal
> Hadoop ETL and Data Integration
> Nube Technologies

Regards, Sonal. Can you give us more information about a basic workflow
of your idea?

Some questions:
- How do you know that two records are identical? By id?
- Can you give a example of the ranking that you want to archieve with a
match of each case:
- two records that are identical
- two records that ar similar
- two records with the closest match

For MapReduce Design's Algoritms, I recommend to you this excelent from
Ricky Ho:
http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html

For the join of the two datasets, you can use Pig for this. Here you
have a basic Pig example from Milind Bhandarkar
([EMAIL PROTECTED])'s talk "Practical Problem Solving with Hadoop
and Pig":
Users = load ‘users’ as (name, age);
Filtered = filter Users by age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
            COUNT(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;
--
 Marcos Luís Ortíz Valmaseda
 Software Engineer
 Centro de Tecnologías de Gestión de Datos (DATEC)
 Universidad de las Ciencias Informáticas
 http://uncubanitolinuxero.blogspot.com
 http://www.linkedin.com/in/marcosluis2186