Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Dataset comparison and ranking - views

Copy link to this message
Re: Dataset comparison and ranking - views
On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote:
> Hi,
> I am working on a problem to compare two different datasets, and rank
> each record of the first with respect to the other, in terms of how
> similar they are. The records are dimensional, but do not have a lot
> of dimensions. Some of the fields will be compared for exact matches,
> some for similar sound, some with closest match etc. One of the
> datasets is large, and the other is much smaller.  The final goal is
> to compute a rank between each record of first dataset with each
> record of the second. The rank is based on weighted scores of each
> dimension comparison.
> I was wondering if people in the community have any advice/suggested
> patterns/thoughts about cross joining two datasets in map reduce. Do
> let me know if you have any suggestions.  
> Thanks and Regards,
> Sonal
> Hadoop ETL and Data Integration
> Nube Technologies

Regards, Sonal. Can you give us more information about a basic workflow
of your idea?

Some questions:
- How do you know that two records are identical? By id?
- Can you give a example of the ranking that you want to archieve with a
match of each case:
- two records that are identical
- two records that ar similar
- two records with the closest match

For MapReduce Design's Algoritms, I recommend to you this excelent from
Ricky Ho:

For the join of the two datasets, you can use Pig for this. Here you
have a basic Pig example from Milind Bhandarkar
([EMAIL PROTECTED])'s talk "Practical Problem Solving with Hadoop
and Pig":
Users = load ‘users’ as (name, age);
Filtered = filter Users by age >= 18 and age <= 25;
Pages = load ‘pages’ as (user, url);
Joined = join Filtered by name, Pages by user;
Grouped = group Joined by url;
Summed = foreach Grouped generate group,
            COUNT(Joined) as clicks;
Sorted = order Summed by clicks desc;
Top5 = limit Sorted 5;
store Top5 into ‘top5sites’;
 Marcos Luís Ortíz Valmaseda
 Software Engineer
 Centro de Tecnologías de Gestión de Datos (DATEC)
 Universidad de las Ciencias Informáticas