thank you, that is the exact solution to my problem as I have formulated it.
That's valid and it stands, but I should have added that the two logs each
have time stamps and that we are looking for missing records with time
stamps in reasonable proximity.
I have come up with a solution where I make rounded time as the key, and
then in the reducer sort all records that fall within the rounded time, and
after that I am free to find the missing ones or anything else I want about
What do you think?
On Sun, Jun 26, 2011 at 12:34 AM, Kumar Kandasami <
[EMAIL PROTECTED]> wrote:
> Mark -
> A thought around accomplishing this as a MapReduce Job - if you could add
> the the datasource information in the mapper phase with record id as the
> key, in the reducer phase you can look for record ids with missing
> datasource and print the record id.
> Driver Code:
> MultipleInputs.addInputPath(conf, log1path, InputFormat,
> MultipleInputs.addInputPath(conf, log2path, InputFormat,
> Mapper Phase -
> Output - Key - Record Id, Value contains the datasource in
> addition to other values.
> Logic - add the datasource information to the record.
> Reduce Phase -
> Output - Print the Record Id that does not have log2 or log1
> datasource value.
> Logic - add to the output only records that does not have log1 or
> log2 datasource.
> Kumar _/|\_
> On Sat, Jun 25, 2011 at 11:39 PM, Mark Kerzner <[EMAIL PROTECTED]
> > Hi,
> > I have two logs which should have all the records for the same record_id,
> > in
> > other words, if this record_id is found in the first log, it should also
> > found in the second one. However, I suspect that the second log is
> > out, and I need to find the missing records. Anything is allowed:
> > job, Hive, Pig, and even a NoSQL database.
> > Thank you.
> > It is also a good time to express my thanks to all the members of the
> > who are always very helpful.
> > Sincerely,
> > Mark