|
|
-
how to implements the 'diff' cmd in hadoop
botma lin 2012-03-20, 08:31
Hi, all
I'm newbie to hadoop.
I'm trying to compare two large file and get the difference between them ,like the diff cmd in linux, however, the mapred api can only get one record at a time . so how can I get the relative records in two files and compare them by using mapred api.
thinks!
-
Re: how to implements the 'diff' cmd in hadoop
Bejoy Ks 2012-03-20, 10:09
Hi Lin In you mapper make the line no as the key and the line contents as the value. In your reducer check whether the two values for a key are matching. ie if you are comparing two files then there would be two values for a line number. If non matching patterns found increment a counter to determine the number of non matching patterns and write those patterns to output file . If the values matches for a key do nothing, no need even writing to output dir.
Regards Bejoy KS
On Tue, Mar 20, 2012 at 2:01 PM, botma lin <[EMAIL PROTECTED]> wrote:
> Hi, all > > I'm newbie to hadoop. > > I'm trying to compare two large file and get the difference between > them ,like the diff cmd in linux, > however, the mapred api can only get one record at a time . so how can I > get the relative records in two files and compare them by using mapred api. > > thinks! >
-
Re: how to implements the 'diff' cmd in hadoop
botma lin 2012-03-20, 11:06
Thanks Bejoy, that makes sense .
If I want to know the different record's original file, I need to put an extra file id into the mapper's output value, then get it in the reducer .
Do you have any other ideas
Thanks!. On Tue, Mar 20, 2012 at 6:09 PM,Bejoy Ks <[EMAIL PROTECTED]> wrote:
> Hi Lin > In you mapper make the line no as the key and the line contents as > the value. In your reducer check whether the two values for a key are > matching. ie if you are comparing two files then there would be two values > for a line number. If non matching patterns found increment a counter to > determine the number of non matching patterns and write those patterns to > output file . If the values matches for a key do nothing, no need even > writing to output dir. > > Regards > Bejoy KS > > On Tue, Mar 20, 2012 at 2:01 PM, botma lin <[EMAIL PROTECTED]> wrote: > > > Hi, all > > > > I'm newbie to hadoop. > > > > I'm trying to compare two large file and get the difference between > > them ,like the diff cmd in linux, > > however, the mapred api can only get one record at a time . so how can > I > > get the relative records in two files and compare them by using mapred > api. > > > > thinks! > > >
-
Re: how to implements the 'diff' cmd in hadoop
Bejoy Ks 2012-03-20, 11:13
Yes, if you are having more than 2 files to be compared against then, the file name/ id is required from mapper. If it is just two files and you just want to know which lines are not unique then just the line no would be good but if you are looking at more granular info like the exact changes in which all files then the value from mapper could be prefixed with some value like file name.
Regards Bejoy KS
2012/3/20 botma lin <[EMAIL PROTECTED]>
> Thanks Bejoy, that makes sense . > > If I want to know the different record's original file, I need to > put an extra file id into the mapper's output value, then get it in the > reducer . > > Do you have any other ideas > > Thanks!. > > > On Tue, Mar 20, 2012 at 6:09 PM,Bejoy Ks <[EMAIL PROTECTED]> wrote: > > > Hi Lin > > In you mapper make the line no as the key and the line contents as > > the value. In your reducer check whether the two values for a key are > > matching. ie if you are comparing two files then there would be two > values > > for a line number. If non matching patterns found increment a counter to > > determine the number of non matching patterns and write those patterns to > > output file . If the values matches for a key do nothing, no need even > > writing to output dir. > > > > Regards > > Bejoy KS > > > > On Tue, Mar 20, 2012 at 2:01 PM, botma lin <[EMAIL PROTECTED]> wrote: > > > > > Hi, all > > > > > > I'm newbie to hadoop. > > > > > > I'm trying to compare two large file and get the difference > between > > > them ,like the diff cmd in linux, > > > however, the mapred api can only get one record at a time . so how > can > > I > > > get the relative records in two files and compare them by using mapred > > api. > > > > > > thinks! > > > > > >
-
Re: how to implements the 'diff' cmd in hadoop
botma lin 2012-03-20, 11:22
Thanks a lot! On Tue, Mar 20, 2012 at 7:13,Bejoy Ks <[EMAIL PROTECTED]> wrote:
> Yes, if you are having more than 2 files to be compared against then, the > file name/ id is required from mapper. If it is just two files and you > just want to know which lines are not unique then just the line no would be > good but if you are looking at more granular info like the exact changes in > which all files then the value from mapper could be prefixed with some > value like file name. > > Regards > Bejoy KS > > 2012/3/20 botma lin <[EMAIL PROTECTED]> > > > Thanks Bejoy, that makes sense . > > > > If I want to know the different record's original file, I need to > > put an extra file id into the mapper's output value, then get it in the > > reducer . > > > > Do you have any other ideas > > > > Thanks!. > > > > > > On Tue, Mar 20, 2012 at 6:09 PM,Bejoy Ks <[EMAIL PROTECTED]> wrote: > > > > > Hi Lin > > > In you mapper make the line no as the key and the line contents > as > > > the value. In your reducer check whether the two values for a key are > > > matching. ie if you are comparing two files then there would be two > > values > > > for a line number. If non matching patterns found increment a counter > to > > > determine the number of non matching patterns and write those patterns > to > > > output file . If the values matches for a key do nothing, no need even > > > writing to output dir. > > > > > > Regards > > > Bejoy KS > > > > > > On Tue, Mar 20, 2012 at 2:01 PM, botma lin <[EMAIL PROTECTED]> wrote: > > > > > > > Hi, all > > > > > > > > I'm newbie to hadoop. > > > > > > > > I'm trying to compare two large file and get the difference > > between > > > > them ,like the diff cmd in linux, > > > > however, the mapred api can only get one record at a time . so how > > can > > > I > > > > get the relative records in two files and compare them by using > mapred > > > api. > > > > > > > > thinks! > > > > > > > > > >
-
Re: how to implements the 'diff' cmd in hadoop
Dieter Plaetinck 2012-03-20, 11:33
the "diff command on linux" (i.e. gnu diffutils) is way more involved than this. it can compare sections on different line numbers. (for example if you copy a text file to another, and then delete or add some lines in arbitrary places, and compare them, it will detect just that, whereas this crude logic will give a lot false positives) the diff logic is hard to map on (and hence IMHO doesn't fit) the M/R paradigm But what's the bigger picture here? usually you would run diff on files created by humans (source code, notes, etc), i.e. files that can easily be diff'ed on a single machine. If you have files that are so huge they are probably generated by software, which means you can do more appropriate things than diffing output files.
Dieter On Tue, 20 Mar 2012 16:43:06 +0530 Bejoy Ks <[EMAIL PROTECTED]> wrote:
> Yes, if you are having more than 2 files to be compared against then, the > file name/ id is required from mapper. If it is just two files and you > just want to know which lines are not unique then just the line no would be > good but if you are looking at more granular info like the exact changes in > which all files then the value from mapper could be prefixed with some > value like file name. > > Regards > Bejoy KS > > 2012/3/20 botma lin <[EMAIL PROTECTED]> > > > Thanks Bejoy, that makes sense . > > > > If I want to know the different record's original file, I need to > > put an extra file id into the mapper's output value, then get it in the > > reducer . > > > > Do you have any other ideas > > > > Thanks!. > > > > > > On Tue, Mar 20, 2012 at 6:09 PM,Bejoy Ks <[EMAIL PROTECTED]> wrote: > > > > > Hi Lin > > > In you mapper make the line no as the key and the line contents as > > > the value. In your reducer check whether the two values for a key are > > > matching. ie if you are comparing two files then there would be two > > values > > > for a line number. If non matching patterns found increment a counter to > > > determine the number of non matching patterns and write those patterns to > > > output file . If the values matches for a key do nothing, no need even > > > writing to output dir. > > > > > > Regards > > > Bejoy KS > > > > > > On Tue, Mar 20, 2012 at 2:01 PM, botma lin <[EMAIL PROTECTED]> wrote: > > > > > > > Hi, all > > > > > > > > I'm newbie to hadoop. > > > > > > > > I'm trying to compare two large file and get the difference > > between > > > > them ,like the diff cmd in linux, > > > > however, the mapred api can only get one record at a time . so how > > can > > > I > > > > get the relative records in two files and compare them by using mapred > > > api. > > > > > > > > thinks! > > > > > > > > >
-
Re: how to implements the 'diff' cmd in hadoop
botma lin 2012-03-21, 02:53
You are right, Dieter. The "linux diff" regards a file as a list, but I only want to treat it as a set. Sorry I did't make it clear at begining .
On Tue, Mar 20, 2012 at 7:33 PM,Dieter Plaetinck <[EMAIL PROTECTED]> wrote:
> the "diff command on linux" (i.e. gnu diffutils) is way more involved than > this. > it can compare sections on different line numbers. (for example if you > copy a text file to another, and then delete or add some lines in arbitrary > places, and compare them, it will detect just that, whereas this crude > logic will give a lot false positives) > the diff logic is hard to map on (and hence IMHO doesn't fit) the M/R > paradigm > But what's the bigger picture here? usually you would run diff on files > created by humans (source code, notes, etc), i.e. files that can easily be > diff'ed on a single machine. > If you have files that are so huge they are probably generated by > software, which means you can do more appropriate things than diffing > output files. > > Dieter > > > On Tue, 20 Mar 2012 16:43:06 +0530 > Bejoy Ks <[EMAIL PROTECTED]> wrote: > > > Yes, if you are having more than 2 files to be compared against then, the > > file name/ id is required from mapper. If it is just two files and you > > just want to know which lines are not unique then just the line no would > be > > good but if you are looking at more granular info like the exact changes > in > > which all files then the value from mapper could be prefixed with some > > value like file name. > > > > Regards > > Bejoy KS > > > > 2012/3/20 botma lin <[EMAIL PROTECTED]> > > > > > Thanks Bejoy, that makes sense . > > > > > > If I want to know the different record's original file, I need to > > > put an extra file id into the mapper's output value, then get it in the > > > reducer . > > > > > > Do you have any other ideas > > > > > > Thanks!. > > > > > > > > > On Tue, Mar 20, 2012 at 6:09 PM,Bejoy Ks <[EMAIL PROTECTED]> > wrote: > > > > > > > Hi Lin > > > > In you mapper make the line no as the key and the line > contents as > > > > the value. In your reducer check whether the two values for a key are > > > > matching. ie if you are comparing two files then there would be two > > > values > > > > for a line number. If non matching patterns found increment a > counter to > > > > determine the number of non matching patterns and write those > patterns to > > > > output file . If the values matches for a key do nothing, no need > even > > > > writing to output dir. > > > > > > > > Regards > > > > Bejoy KS > > > > > > > > On Tue, Mar 20, 2012 at 2:01 PM, botma lin <[EMAIL PROTECTED]> > wrote: > > > > > > > > > Hi, all > > > > > > > > > > I'm newbie to hadoop. > > > > > > > > > > I'm trying to compare two large file and get the difference > > > between > > > > > them ,like the diff cmd in linux, > > > > > however, the mapred api can only get one record at a time . so > how > > > can > > > > I > > > > > get the relative records in two files and compare them by using > mapred > > > > api. > > > > > > > > > > thinks! > > > > > > > > > > > > > >
|
|