Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - how to implements the 'diff' cmd in hadoop

Copy link to this message
Re: how to implements the 'diff' cmd in hadoop
Bejoy Ks 2012-03-20, 10:09
Hi Lin
        In you mapper make the line no as the key and the line contents as
the value. In your reducer check whether the two values for a key are
matching. ie if you are comparing two files then there would be two values
for a line number. If non matching patterns found increment a counter to
determine the number of non matching patterns and write those patterns to
output file . If the values matches for a key do nothing, no need even
writing to output dir.

Bejoy KS

On Tue, Mar 20, 2012 at 2:01 PM, botma lin <[EMAIL PROTECTED]> wrote:

> Hi, all
>      I'm newbie to hadoop.
>      I'm trying to compare two large file and get the difference between
> them ,like the diff cmd in linux,
>  however,  the mapred api can only get one record at a time . so how can I
> get the relative records in two files and compare them by using mapred api.
>     thinks!