The Mahout project has several tools for this class of problem.
On Tue, Mar 8, 2011 at 9:07 AM, Chase Bradford <[EMAIL PROTECTED]> wrote:
> How much smaller is the smaller dataset? If you can use the DC and
> precompute bigrams, locations, etc, and hold all the results in memory
> during setup before mapping on the large dataset, then I would suggest that
> Another trick I've seen for similar problems where the final score is a
> product of feature scores, is to cluster in a way that eliminates obvious
> 0s. For example, if distance > 50km is a zero, then choose enough anchor
> coordinates to canvas the map with circles with radius 25km and overlap.
> Then, your mapper would emit (coord, record) pairs for every anchor region
> the record is in. That way, only records know to be similar in some way are
> On Mar 7, 2011, at 9:21 PM, Sonal Goyal <[EMAIL PROTECTED]> wrote:
> Hi Marcos,
> Thanks for replying. I think I was not very clear in my last post. Let me
> describe my use case in detail.
> I have two datasets coming from different sources, lets call them dataset1
> and dataset2. Both of them contain records for entities, say Person. A
> single record looks like:
> First Name Last Name, Street, City, State,Zip
> We want to compare each record of dataset1 with each record of dataset2, in
> effect a cross join.
> We know that the way data is collected, names will not match exactly, but we
> want to find close enoughs. So we have a rule which says create bigrams and
> find the matching bigrams. If 0 to 5 match, give a score of 10, if 5-15
> match, give a score of 20 and so on.
> For Zip, we have our rule saying exact match or within 5 kms of each
> other(through a lookup), give a score of 50 and so on.
> Once we have each person of dataset1 compared with that of dataset2, we find
> the overall rank. Which is a weighted average of scores of name, address etc
> One approach is to use the DistributedCache for the smaller dataset and do a
> nested loop join in the mapper. The second approach is to use multiple�� MR
> flows, and compare the fields and reduce/collate the results.
> I am curious to know if people have other approaches they have implemented,
> what are the efficiencies they have built up etc.
> Thanks and Regards,
> Hadoop ETL and Data Integration
> Nube Technologies
> On Tue, Mar 8, 2011 at 12:55 AM, Marcos Ortiz <[EMAIL PROTECTED]> wrote:
>> On Tue, 2011-03-08 at 00:36 +0530, Sonal Goyal wrote:
>> > Hi,
>> > I am working on a problem to compare two different datasets, and rank
>> > each record of the first with respect to the other, in terms of how
>> > similar they are. The records are dimensional, but do not have a lot
>> > of dimensions. Some of the fields will be compared for exact matches,
>> > some for similar sound, some with closest match etc. One of the
>> > datasets is large, and the other is much smaller. The final goal is
>> > to compute a rank between each record of first dataset with each
>> > record of the second. The rank is based on weighted scores of each
>> > dimension comparison.
>> > I was wondering if people in the community have any advice/suggested
>> > patterns/thoughts about cross joining two datasets in map reduce. Do
>> > let me know if you have any suggestions.
>> > Thanks and Regards,
>> > Sonal
>> > Hadoop ETL and Data Integration
>> > Nube Technologies
>> Regards, Sonal. Can you give us more information about a basic workflow
>> of your idea?
>> Some questions:
>> - How do you know that two records are identical? By id?
>> - Can you give a example of the ranking that you want to archieve with a
>> match of each case:
>> - two records that are identical
>> - two records that ar similar
>> - two records with the closest match
>> For MapReduce Design's Algoritms, I recommend to you this excelent from
>> Ricky Ho: