Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Comparing two logs, finding missing records


Copy link to this message
-
Re: Comparing two logs, finding missing records
Mark Kerzner 2011-06-26, 05:53
Kumar,

thank you, that is the exact solution to my problem as I have formulated it.
That's valid and it stands, but I should have added that the two logs each
have time stamps and that we are looking for missing records with time
stamps in reasonable proximity.

I have come up with a solution where I make rounded time as the key, and
then in the reducer sort all records that fall within the rounded time, and
after that I am free to find the missing ones or anything else I want about
them.

What do you think?

Sincerely,
Mark

On Sun, Jun 26, 2011 at 12:34 AM, Kumar Kandasami <
[EMAIL PROTECTED]> wrote:

> Mark -
>
>  A thought around accomplishing this as a MapReduce Job - if you could add
> the the datasource information in the mapper phase with record id as the
> key, in the reducer phase you can look for record ids with missing
> datasource and print the record id.
>
> Driver Code:
>
>          MultipleInputs.addInputPath(conf, log1path, InputFormat,
> Log1Mapper);
>          MultipleInputs.addInputPath(conf, log2path, InputFormat,
> Log2Mapper);
>
> Mapper Phase -
>
>          Output - Key - Record Id, Value contains the datasource in
> addition to other values.
>          Logic - add the datasource information to the record.
>
> Reduce Phase -
>
>          Output - Print the Record Id that does not have log2 or log1
> datasource value.
>          Logic - add to the output only records that does not have log1 or
> log2 datasource.
>
>
> Kumar    _/|\_
>
>
> On Sat, Jun 25, 2011 at 11:39 PM, Mark Kerzner <[EMAIL PROTECTED]
> >wrote:
>
> > Hi,
> >
> > I have two logs which should have all the records for the same record_id,
> > in
> > other words, if this record_id is found in the first log, it should also
> be
> > found in the second one. However, I suspect that the second log is
> filtered
> > out, and I need to find the missing records. Anything is allowed:
> MapReduce
> > job, Hive, Pig, and even a NoSQL database.
> >
> > Thank you.
> >
> > It is also a good time to express my thanks to all the members of the
> group
> > who are always very helpful.
> >
> > Sincerely,
> > Mark
> >
>