-Re: how to implements the 'diff' cmd in hadoop
Bejoy Ks 2012-03-20, 11:13
Yes, if you are having more than 2 files to be compared against then, the
file name/ id is required from mapper. If it is just two files and you
just want to know which lines are not unique then just the line no would be
good but if you are looking at more granular info like the exact changes in
which all files then the value from mapper could be prefixed with some
value like file name.
2012/3/20 botma lin <[EMAIL PROTECTED]>
> Thanks Bejoy, that makes sense .
> If I want to know the different record's original file, I need to
> put an extra file id into the mapper's output value, then get it in the
> reducer .
> Do you have any other ideas
> On Tue, Mar 20, 2012 at 6:09 PM，Bejoy Ks <[EMAIL PROTECTED]> wrote：
> > Hi Lin
> > In you mapper make the line no as the key and the line contents as
> > the value. In your reducer check whether the two values for a key are
> > matching. ie if you are comparing two files then there would be two
> > for a line number. If non matching patterns found increment a counter to
> > determine the number of non matching patterns and write those patterns to
> > output file . If the values matches for a key do nothing, no need even
> > writing to output dir.
> > Regards
> > Bejoy KS
> > On Tue, Mar 20, 2012 at 2:01 PM, botma lin <[EMAIL PROTECTED]> wrote:
> > > Hi, all
> > >
> > > I'm newbie to hadoop.
> > >
> > > I'm trying to compare two large file and get the difference
> > > them ,like the diff cmd in linux,
> > > however, the mapred api can only get one record at a time . so how
> > I
> > > get the relative records in two files and compare them by using mapred
> > api.
> > >
> > > thinks!
> > >