Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Comparing two logs, finding missing records


Copy link to this message
-
Re: Comparing two logs, finding missing records
Kumar,

thank you, that is the exact solution to my problem as I have formulated it.
That's valid and it stands, but I should have added that the two logs each
have time stamps and that we are looking for missing records with time
stamps in reasonable proximity.

I have come up with a solution where I make rounded time as the key, and
then in the reducer sort all records that fall within the rounded time, and
after that I am free to find the missing ones or anything else I want about
them.

What do you think?

Sincerely,
Mark

On Sun, Jun 26, 2011 at 12:34 AM, Kumar Kandasami <
[EMAIL PROTECTED]> wrote:

> Mark -
>
>  A thought around accomplishing this as a MapReduce Job - if you could add
> the the datasource information in the mapper phase with record id as the
> key, in the reducer phase you can look for record ids with missing
> datasource and print the record id.
>
> Driver Code:
>
>          MultipleInputs.addInputPath(conf, log1path, InputFormat,
> Log1Mapper);
>          MultipleInputs.addInputPath(conf, log2path, InputFormat,
> Log2Mapper);
>
> Mapper Phase -
>
>          Output - Key - Record Id, Value contains the datasource in
> addition to other values.
>          Logic - add the datasource information to the record.
>
> Reduce Phase -
>
>          Output - Print the Record Id that does not have log2 or log1
> datasource value.
>          Logic - add to the output only records that does not have log1 or
> log2 datasource.
>
>
> Kumar    _/|\_
>
>
> On Sat, Jun 25, 2011 at 11:39 PM, Mark Kerzner <[EMAIL PROTECTED]
> >wrote:
>
> > Hi,
> >
> > I have two logs which should have all the records for the same record_id,
> > in
> > other words, if this record_id is found in the first log, it should also
> be
> > found in the second one. However, I suspect that the second log is
> filtered
> > out, and I need to find the missing records. Anything is allowed:
> MapReduce
> > job, Hive, Pig, and even a NoSQL database.
> >
> > Thank you.
> >
> > It is also a good time to express my thanks to all the members of the
> group
> > who are always very helpful.
> >
> > Sincerely,
> > Mark
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB