|
|
-
Re: Dataset comparison and ranking - viewsLance Norskog 2011-03-10, 06:38
The Mahout project has several tools for this class of problem.
http://mahout.apache.org On Tue, Mar 8, 2011 at 9:07 AM, Chase Bradford <[EMAIL PROTECTED]> wrote: > How much smaller is the smaller dataset? If you can use the DC and > precompute bigrams, locations, etc, and hold all the results in memory > during setup before mapping on the large dataset, then I would suggest that > approach. > Another trick I've seen for similar problems where the final score is a > product of feature scores, is to cluster in a way that eliminates obvious > 0s. For example, if distance > 50km is a zero, then choose enough anchor > coordinates to canvas the map with circles with radius 25km and overlap. > Then, your mapper would emit (coord, record) pairs for every anchor region > the record is in. That way, only records know to be similar in some way are > considered. > On Mar 7, 2011, at 9:21 PM, Sonal Goyal <[EMAIL PROTECTED]> wrote: > > Hi Marcos, > > Thanks for replying. I think I was not very clear in my last post. Let me > describe my use case in detail. > > I have two datasets coming from different sources, lets call them dataset1 > and dataset2. Both of them contain records for entities, say Person. A > single record looks like: > > First Name Last Name, Street, City, State,Zip > > We want to compare each record of dataset1 with each record of dataset2, in > effect a cross join. > > We know that the way data is collected, names will not match exactly, but we > want to find close enoughs. So we have a rule which says create bigrams and > find the matching bigrams. If 0 to 5 match, give a score of 10, if 5-15 > match, give a score of 20 and so on. > For Zip, we have our rule saying exact match or within 5 kms of each > other(through a lookup), give a score of 50 and so on. > > Once we have each person of dataset1 compared with that of dataset2, we find > the overall rank. Which is a weighted average of scores of name, address etc > comparison. > > One approach is to use the DistributedCache for the smaller dataset and do a > nested loop join in the mapper. The second approach is to use multiple�� MR > flows, and compare the fields and reduce/collate the results. > > I am curious to know if people have other approaches they have implemented, > what are the efficiencies they have built up etc. > > Thanks and Regards, > Sonal > Hadoop ETL and Data Integration > Nube Technologies > > > > > > > > On Tue, Mar 8, 2011 at 12:55 AM, Marcos Ortiz <[EMAIL PROTECTED]> wrote: >> >> On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote: >> > Hi, >> > >> > I am working on a problem to compare two different datasets, and rank >> > each record of the first with respect to the other, in terms of how >> > similar they are. The records are dimensional, but do not have a lot >> > of dimensions. Some of the fields will be compared for exact matches, >> > some for similar sound, some with closest match etc. One of the >> > datasets is large, and the other is much smaller. The final goal is >> > to compute a rank between each record of first dataset with each >> > record of the second. The rank is based on weighted scores of each >> > dimension comparison. >> > >> > I was wondering if people in the community have any advice/suggested >> > patterns/thoughts about cross joining two datasets in map reduce. Do >> > let me know if you have any suggestions. >> > >> > Thanks and Regards, >> > Sonal >> > Hadoop ETL and Data Integration >> > Nube Technologies >> >> Regards, Sonal. Can you give us more information about a basic workflow >> of your idea? >> >> Some questions: >> - How do you know that two records are identical? By id? >> - Can you give a example of the ranking that you want to archieve with a >> match of each case: >> - two records that are identical >> - two records that ar similar >> - two records with the closest match >> >> For MapReduce Design's Algoritms, I recommend to you this excelent from >> Ricky Ho: >> >> http://horicky.blogspot.com/2010/08/designing-algorithmis-for-map-reduce.html Lance Norskog [EMAIL PROTECTED] |