|
|
-
Re: Dataset comparison and ranking - viewsMarcos Ortiz 2011-03-07, 19:25
On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote:
> Hi, > > I am working on a problem to compare two different datasets, and rank > each record of the first with respect to the other, in terms of how > similar they are. The records are dimensional, but do not have a lot > of dimensions. Some of the fields will be compared for exact matches, > some for similar sound, some with closest match etc. One of the > datasets is large, and the other is much smaller. The final goal is > to compute a rank between each record of first dataset with each > record of the second. The rank is based on weighted scores of each > dimension comparison. > > I was wondering if people in the community have any advice/suggested > patterns/thoughts about cross joining two datasets in map reduce. Do > let me know if you have any suggestions. > > Thanks and Regards, > Sonal > Hadoop ETL and Data Integration > Nube Technologies Regards, Sonal. Can you give us more information about a basic workflow of your idea? Some questions: - How do you know that two records are identical? By id? - Can you give a example of the ranking that you want to archieve with a match of each case: - two records that are identical - two records that ar similar - two records with the closest match For MapReduce Design's Algoritms, I recommend to you this excelent from Ricky Ho: http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html For the join of the two datasets, you can use Pig for this. Here you have a basic Pig example from Milind Bhandarkar ([EMAIL PROTECTED])'s talk "Practical Problem Solving with Hadoop and Pig": Users = load ‘users’ as (name, age); Filtered = filter Users by age >= 18 and age <= 25; Pages = load ‘pages’ as (user, url); Joined = join Filtered by name, Pages by user; Grouped = group Joined by url; Summed = foreach Grouped generate group, COUNT(Joined) as clicks; Sorted = order Summed by clicks desc; Top5 = limit Sorted 5; store Top5 into ‘top5sites’; -- Marcos Luís Ortíz Valmaseda Software Engineer Centro de Tecnologías de Gestión de Datos (DATEC) Universidad de las Ciencias Informáticas http://uncubanitolinuxero.blogspot.com http://www.linkedin.com/in/marcosluis2186 |