|
Mark Kerzner
2011-06-26, 04:39
Kumar Kandasami
2011-06-26, 05:34
Mark Kerzner
2011-06-26, 05:53
Bharath Mundlapudi
2011-06-26, 22:12
Mark Kerzner
2011-06-26, 22:24
Mark Kerzner
2011-06-27, 00:50
Bharath Mundlapudi
2011-06-27, 01:04
Mark Kerzner
2011-06-27, 02:20
Rajesh Balamohan
2011-06-27, 07:59
|
-
Comparing two logs, finding missing recordsMark Kerzner 2011-06-26, 04:39
Hi,
I have two logs which should have all the records for the same record_id, in other words, if this record_id is found in the first log, it should also be found in the second one. However, I suspect that the second log is filtered out, and I need to find the missing records. Anything is allowed: MapReduce job, Hive, Pig, and even a NoSQL database. Thank you. It is also a good time to express my thanks to all the members of the group who are always very helpful. Sincerely, Mark
-
Re: Comparing two logs, finding missing recordsKumar Kandasami 2011-06-26, 05:34
Mark -
A thought around accomplishing this as a MapReduce Job - if you could add the the datasource information in the mapper phase with record id as the key, in the reducer phase you can look for record ids with missing datasource and print the record id. Driver Code: MultipleInputs.addInputPath(conf, log1path, InputFormat, Log1Mapper); MultipleInputs.addInputPath(conf, log2path, InputFormat, Log2Mapper); Mapper Phase - Output - Key - Record Id, Value contains the datasource in addition to other values. Logic - add the datasource information to the record. Reduce Phase - Output - Print the Record Id that does not have log2 or log1 datasource value. Logic - add to the output only records that does not have log1 or log2 datasource. Kumar _/|\_ On Sat, Jun 25, 2011 at 11:39 PM, Mark Kerzner <[EMAIL PROTECTED]>wrote: > Hi, > > I have two logs which should have all the records for the same record_id, > in > other words, if this record_id is found in the first log, it should also be > found in the second one. However, I suspect that the second log is filtered > out, and I need to find the missing records. Anything is allowed: MapReduce > job, Hive, Pig, and even a NoSQL database. > > Thank you. > > It is also a good time to express my thanks to all the members of the group > who are always very helpful. > > Sincerely, > Mark >
-
Re: Comparing two logs, finding missing recordsMark Kerzner 2011-06-26, 05:53
Kumar,
thank you, that is the exact solution to my problem as I have formulated it. That's valid and it stands, but I should have added that the two logs each have time stamps and that we are looking for missing records with time stamps in reasonable proximity. I have come up with a solution where I make rounded time as the key, and then in the reducer sort all records that fall within the rounded time, and after that I am free to find the missing ones or anything else I want about them. What do you think? Sincerely, Mark On Sun, Jun 26, 2011 at 12:34 AM, Kumar Kandasami < [EMAIL PROTECTED]> wrote: > Mark - > > A thought around accomplishing this as a MapReduce Job - if you could add > the the datasource information in the mapper phase with record id as the > key, in the reducer phase you can look for record ids with missing > datasource and print the record id. > > Driver Code: > > MultipleInputs.addInputPath(conf, log1path, InputFormat, > Log1Mapper); > MultipleInputs.addInputPath(conf, log2path, InputFormat, > Log2Mapper); > > Mapper Phase - > > Output - Key - Record Id, Value contains the datasource in > addition to other values. > Logic - add the datasource information to the record. > > Reduce Phase - > > Output - Print the Record Id that does not have log2 or log1 > datasource value. > Logic - add to the output only records that does not have log1 or > log2 datasource. > > > Kumar _/|\_ > > > On Sat, Jun 25, 2011 at 11:39 PM, Mark Kerzner <[EMAIL PROTECTED] > >wrote: > > > Hi, > > > > I have two logs which should have all the records for the same record_id, > > in > > other words, if this record_id is found in the first log, it should also > be > > found in the second one. However, I suspect that the second log is > filtered > > out, and I need to find the missing records. Anything is allowed: > MapReduce > > job, Hive, Pig, and even a NoSQL database. > > > > Thank you. > > > > It is also a good time to express my thanks to all the members of the > group > > who are always very helpful. > > > > Sincerely, > > Mark > > >
-
Re: Comparing two logs, finding missing recordsBharath Mundlapudi 2011-06-26, 22:12
If you have Serde or PigLoader for your log format, probably Pig or Hive will be a quicker solution with the join.
-Bharath ________________________________ From: Mark Kerzner <[EMAIL PROTECTED]> To: Hadoop Discussion Group <[EMAIL PROTECTED]> Sent: Saturday, June 25, 2011 9:39 PM Subject: Comparing two logs, finding missing records Hi, I have two logs which should have all the records for the same record_id, in other words, if this record_id is found in the first log, it should also be found in the second one. However, I suspect that the second log is filtered out, and I need to find the missing records. Anything is allowed: MapReduce job, Hive, Pig, and even a NoSQL database. Thank you. It is also a good time to express my thanks to all the members of the group who are always very helpful. Sincerely, Mark
-
Re: Comparing two logs, finding missing recordsMark Kerzner 2011-06-26, 22:24
Interesting, Bharath, I will look at these.
Mark On Sun, Jun 26, 2011 at 5:12 PM, Bharath Mundlapudi <[EMAIL PROTECTED]>wrote: > If you have Serde or PigLoader for your log format, probably Pig or Hive > will be a quicker solution with the join. > > -Bharath > > > > ________________________________ > From: Mark Kerzner <[EMAIL PROTECTED]> > To: Hadoop Discussion Group <[EMAIL PROTECTED]> > Sent: Saturday, June 25, 2011 9:39 PM > Subject: Comparing two logs, finding missing records > > Hi, > > I have two logs which should have all the records for the same record_id, > in > other words, if this record_id is found in the first log, it should also be > found in the second one. However, I suspect that the second log is filtered > out, and I need to find the missing records. Anything is allowed: MapReduce > job, Hive, Pig, and even a NoSQL database. > > Thank you. > > It is also a good time to express my thanks to all the members of the group > who are always very helpful. > > Sincerely, > Mark >
-
Re: Comparing two logs, finding missing recordsMark Kerzner 2011-06-27, 00:50
Bharath,
how would a Pig query look like? Thank you, Mark On Sun, Jun 26, 2011 at 5:12 PM, Bharath Mundlapudi <[EMAIL PROTECTED]>wrote: > If you have Serde or PigLoader for your log format, probably Pig or Hive > will be a quicker solution with the join. > > -Bharath > > > > ________________________________ > From: Mark Kerzner <[EMAIL PROTECTED]> > To: Hadoop Discussion Group <[EMAIL PROTECTED]> > Sent: Saturday, June 25, 2011 9:39 PM > Subject: Comparing two logs, finding missing records > > Hi, > > I have two logs which should have all the records for the same record_id, > in > other words, if this record_id is found in the first log, it should also be > found in the second one. However, I suspect that the second log is filtered > out, and I need to find the missing records. Anything is allowed: MapReduce > job, Hive, Pig, and even a NoSQL database. > > Thank you. > > It is also a good time to express my thanks to all the members of the group > who are always very helpful. > > Sincerely, > Mark >
-
Re: Comparing two logs, finding missing recordsBharath Mundlapudi 2011-06-27, 01:04
SQL:
SELECT * FROM LOG1 LEFT OUTER JOIN LOG2 ON LOG1.recordid = LOG2.recordid; PIG: data = JOIN LOG1 BY recordid LEFT OUTER, LOG2 BY recordid; DUMP data; If you need more PIG help, please post in PIG email alias. -Bharath ________________________________ From: Mark Kerzner <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; Bharath Mundlapudi <[EMAIL PROTECTED]> Sent: Sunday, June 26, 2011 5:50 PM Subject: Re: Comparing two logs, finding missing records Bharath, how would a Pig query look like? Thank you, Mark On Sun, Jun 26, 2011 at 5:12 PM, Bharath Mundlapudi <[EMAIL PROTECTED]> wrote: If you have Serde or PigLoader for your log format, probably Pig or Hive will be a quicker solution with the join. > >-Bharath > > > >________________________________ >From: Mark Kerzner <[EMAIL PROTECTED]> >To: Hadoop Discussion Group <[EMAIL PROTECTED]> >Sent: Saturday, June 25, 2011 9:39 PM >Subject: Comparing two logs, finding missing records > > >Hi, > >I have two logs which should have all the records for the same record_id, in >other words, if this record_id is found in the first log, it should also be >found in the second one. However, I suspect that the second log is filtered >out, and I need to find the missing records. Anything is allowed: MapReduce >job, Hive, Pig, and even a NoSQL database. > >Thank you. > >It is also a good time to express my thanks to all the members of the group >who are always very helpful. > >Sincerely, >Mark
-
Re: Comparing two logs, finding missing recordsMark Kerzner 2011-06-27, 02:20
Thank you, Bharath, tomorrow I will get the reaction to my solution from the
actual person who posed the problem for me, and then we will see what details I might have missed. Mark On Sun, Jun 26, 2011 at 8:04 PM, Bharath Mundlapudi <[EMAIL PROTECTED]>wrote: > SQL: > SELECT * FROM LOG1 LEFT OUTER JOIN LOG2 ON LOG1.recordid = LOG2.recordid; > > PIG: > data = JOIN LOG1 BY recordid LEFT OUTER, LOG2 BY recordid; > DUMP data; > > If you need more PIG help, please post in PIG email alias. > > -Bharath > > ------------------------------ > *From:* Mark Kerzner <[EMAIL PROTECTED]> > *To:* [EMAIL PROTECTED]; Bharath Mundlapudi < > [EMAIL PROTECTED]> > *Sent:* Sunday, June 26, 2011 5:50 PM > *Subject:* Re: Comparing two logs, finding missing records > > Bharath, > > how would a Pig query look like? > > Thank you, > Mark > > On Sun, Jun 26, 2011 at 5:12 PM, Bharath Mundlapudi <[EMAIL PROTECTED] > > wrote: > > If you have Serde or PigLoader for your log format, probably Pig or Hive > will be a quicker solution with the join. > > -Bharath > > > > ________________________________ > From: Mark Kerzner <[EMAIL PROTECTED]> > To: Hadoop Discussion Group <[EMAIL PROTECTED]> > Sent: Saturday, June 25, 2011 9:39 PM > Subject: Comparing two logs, finding missing records > > Hi, > > I have two logs which should have all the records for the same record_id, > in > other words, if this record_id is found in the first log, it should also be > found in the second one. However, I suspect that the second log is filtered > out, and I need to find the missing records. Anything is allowed: MapReduce > job, Hive, Pig, and even a NoSQL database. > > Thank you. > > It is also a good time to express my thanks to all the members of the group > who are always very helpful. > > Sincerely, > Mark > > > > >
-
Re: Comparing two logs, finding missing recordsRajesh Balamohan 2011-06-27, 07:59
I believe you meant,
SELECT * FROM LOG1 LEFT OUTER JOIN LOG2 ON LOG1.recordid = LOG2.recordid WHERE LOG2.recordid is null. (this would produce set of records in LOG1 and which are not present in LOG2). In PIG, we have to add additional filter with "is null" condition. ~Rajesh.B On Mon, Jun 27, 2011 at 6:34 AM, Bharath Mundlapudi <[EMAIL PROTECTED]>wrote: > SQL: > > SELECT * FROM LOG1 LEFT OUTER JOIN LOG2 ON LOG1.recordid = LOG2.recordid; > > > PIG: > data = JOIN LOG1 BY recordid LEFT OUTER, LOG2 BY recordid; > DUMP data; > > > If you need more PIG help, please post in PIG email alias. > > -Bharath > > > ________________________________ > From: Mark Kerzner <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED]; Bharath Mundlapudi < > [EMAIL PROTECTED]> > Sent: Sunday, June 26, 2011 5:50 PM > Subject: Re: Comparing two logs, finding missing records > > > Bharath, > > how would a Pig query look like? > > Thank you, > Mark > > > On Sun, Jun 26, 2011 at 5:12 PM, Bharath Mundlapudi <[EMAIL PROTECTED]> > wrote: > > If you have Serde or PigLoader for your log format, probably Pig or Hive > will be a quicker solution with the join. > > > >-Bharath > > > > > > > >________________________________ > >From: Mark Kerzner <[EMAIL PROTECTED]> > >To: Hadoop Discussion Group <[EMAIL PROTECTED]> > >Sent: Saturday, June 25, 2011 9:39 PM > >Subject: Comparing two logs, finding missing records > > > > > >Hi, > > > >I have two logs which should have all the records for the same record_id, > in > >other words, if this record_id is found in the first log, it should also > be > >found in the second one. However, I suspect that the second log is > filtered > >out, and I need to find the missing records. Anything is allowed: > MapReduce > >job, Hive, Pig, and even a NoSQL database. > > > >Thank you. > > > >It is also a good time to express my thanks to all the members of the > group > >who are always very helpful. > > > >Sincerely, > >Mark > -- ~Rajesh.B |