Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> how to implements the 'diff' cmd in hadoop

Copy link to this message
Re: how to implements the 'diff' cmd in hadoop
Hi Lin
        In you mapper make the line no as the key and the line contents as
the value. In your reducer check whether the two values for a key are
matching. ie if you are comparing two files then there would be two values
for a line number. If non matching patterns found increment a counter to
determine the number of non matching patterns and write those patterns to
output file . If the values matches for a key do nothing, no need even
writing to output dir.

Bejoy KS

On Tue, Mar 20, 2012 at 2:01 PM, botma lin <[EMAIL PROTECTED]> wrote:

> Hi, all
>      I'm newbie to hadoop.
>      I'm trying to compare two large file and get the difference between
> them ,like the diff cmd in linux,
>  however,  the mapred api can only get one record at a time . so how can I
> get the relative records in two files and compare them by using mapred api.
>     thinks!