Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS, mail # user - how to design the mapper and reducer for the below problem


+
parnab kumar 2013-06-14, 04:41
+
Azuryy Yu 2013-06-14, 08:37
+
Harsh J 2013-06-14, 09:39
+
John Lilley 2013-06-16, 19:02
+
John Lilley 2013-06-16, 19:03
+
parnab kumar 2013-06-14, 14:06
Copy link to this message
-
Re: How to design the mapper and reducer for the following problem
Sanjay Subramanian 2013-06-14, 16:15
Hi

My quick and dirty non-optimized solution would be as follows

MAPPER
======OUTPUT from Mapper
    <Key = Sorted List {HASH1,HASH2,HASH3,HASH4} >      <Value = DOCID1~HASH1 HASH2 HASH3 HASH4>
    <Key = Sorted List {HASH1,HASH2,HASH3,HASH4} >      <Value = DOCID1~DOCID2   HASH5 HASH3 HASH1 HASH4>

REDUCER
=======Iterate over keys
For a key = (say) {HASH1,HASH2,HASH3,HASH4}
     Format the collection of values into some StringBuilder kind of class

Output
KEY = {DOCID1 DOCID2}  value = null
KEY = {DOCID3 DOCID5} value = null

Hope I have understood your problem correctly…If not sorry about that

sanjay

From: parnab kumar <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
Reply-To: "[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
Date: Friday, June 14, 2013 7:06 AM
To: "[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
Subject: How to design the mapper and reducer for the following problem

An input file where each line corresponds to a document .Each document is identfied by some fingerPrints .For example a line in the input file
is of the following form :

input:
---------------------
DOCID1   HASH1 HASH2 HASH3 HASH4
DOCID2   HASH5 HASH3 HASH1 HASH4

The output of the mapreduce job should write the pair of DOCIDS which share a threshold number of HASH in common.

output:
--------------------------
DOCID1 DOCID2
DOCID3 DOCID5

CONFIDENTIALITY NOTICE
=====================This email message and any attachments are for the exclusive use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message along with any attachments, from your computer system. If you are the intended recipient, please be advised that the content of this message is subject to access, review and disclosure by the sender's Email System Administrator.
+
John Lilley 2013-06-16, 19:25
+
John Lilley 2013-06-16, 19:40