|
Matt Davies
2010-10-22, 03:22
Eric Yang
2010-10-22, 15:48
Ariel Rabkin
2010-10-22, 16:03
Matt Davies
2010-10-22, 16:21
Eric Yang
2010-10-22, 16:46
Matt Davies
2010-10-22, 19:23
Corbin Hoenes
2010-10-23, 00:21
Bill Graham
2010-10-23, 19:59
Corbin Hoenes
2010-10-23, 20:27
Eric Yang
2010-10-24, 03:22
Eric Yang
2010-10-24, 03:33
Eric Yang
2010-10-24, 03:34
Bill Graham
2010-10-25, 23:03
Eric Yang
2010-10-26, 00:28
|
-
Seeing duplicate entriesMatt Davies 2010-10-22, 03:22
Hey everyone,
I have a situation where I'm seeing duplicated data downstream before the demux process. It appears this happens during high system loads and we are still using the 0.3.0 series. So, we have validated that there is a single, unique entry in our source file which then shows up a random amount of times before we see it in demux. So, it appears that there is duplication happening somewhere between the agent and collector. Has anyone else seen this? Any ideas as to why we are seeing this during high system loads, but not during lower loads. TIA, Matt
-
Re: Seeing duplicate entriesEric Yang 2010-10-22, 15:48
Hi Matt,
This is expected in Chukwa archives. When agent is unable to post to the collector, it will retry to post the same data again to another collector or retrys with the same collector when no other collector is available. Collector may have data written without proper acknowledge back to agent in high load situation. Chukwa philosophy is to retry until receiving acknowledgement. Duplicated data filter will be treated after data has been received. The duplication filtering in Chukwa 0.3.0 depends on data loading to mysql. The same primary key will update to the same row to remove duplicates. It is possible to build a duplication detection process prior to demux which filter data based on sequence id + data type + csource (host), but this hasn't been implemented because primary key update method works well for my use case. In Chukwa 0.5, we are treating duplication the same as in Chukwa 0.3, where it will replace any duplicated row in HBase base on Timestamp + HBase row key. regards, Eric On Thu, Oct 21, 2010 at 8:22 PM, Matt Davies <[EMAIL PROTECTED]> wrote: > Hey everyone, > > I have a situation where I'm seeing duplicated data downstream before the demux process. It appears this happens during high system loads and we are still using the 0.3.0 series. > > So, we have validated that there is a single, unique entry in our source file which then shows up a random amount of times before we see it in demux. So, it appears that there is duplication happening somewhere between the agent and collector. > > Has anyone else seen this? Any ideas as to why we are seeing this during high system loads, but not during lower loads. > > TIA, > Matt > >
-
Re: Seeing duplicate entriesAriel Rabkin 2010-10-22, 16:03
On Fri, Oct 22, 2010 at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote:
> Hi Matt, > > The duplication filtering in Chukwa 0.3.0 depends on data loading to > mysql. The same primary key will update to the same row to remove > duplicates. It is possible to build a duplication detection process > prior to demux which filter data based on sequence id + data type + > csource (host), but this hasn't been implemented because primary key > update method works well for my use case. This isn't quite right. There is support in 0.3 and later versions for doing de-duplication at the collector, in the manner Eric describes. It works as a filter in the writer pipeline. You need the following in your configuration: <property> <name>chukwaCollector.writerClass</name> <value>org.apache.hadoop.chukwa.datacollection.writer.PipelineStageWriter</value> </property> <property> <name>chukwaCollector.pipeline</name> <value>org.apache.hadoop.chukwa.datacollection.writer.Dedup,org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter</value> </property> See http://incubator.apache.org/chukwa/docs/r0.3.0/collector.html for background --Ari -- Ari Rabkin [EMAIL PROTECTED] UC Berkeley Computer Science Department
-
Re: Seeing duplicate entriesMatt Davies 2010-10-22, 16:21
Thank you for the insight.
"Ariel Rabkin" <[EMAIL PROTECTED]> said: > On Fri, Oct 22, 2010 at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote: >> Hi Matt, > > >> >> The duplication filtering in Chukwa 0.3.0 depends on data loading to >> mysql. The same primary key will update to the same row to remove >> duplicates. It is possible to build a duplication detection process >> prior to demux which filter data based on sequence id + data type + >> csource (host), but this hasn't been implemented because primary key >> update method works well for my use case. > > This isn't quite right. There is support in 0.3 and later versions for > doing de-duplication at the collector, in the manner Eric describes. > It works as a filter in the writer pipeline. > > You need the following in your configuration: > > <property> > <name>chukwaCollector.writerClass</name> > <value>org.apache.hadoop.chukwa.datacollection.writer.PipelineStageWriter</value> > </property> > > <property> > <name>chukwaCollector.pipeline</name> > <value>org.apache.hadoop.chukwa.datacollection.writer.Dedup,org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter</value> > </property> > > > See http://incubator.apache.org/chukwa/docs/r0.3.0/collector.html for background > > > --Ari > > -- > Ari Rabkin [EMAIL PROTECTED] > UC Berkeley Computer Science Department >
-
Re: Seeing duplicate entriesEric Yang 2010-10-22, 16:46
Note, the Dedup collector is only good for a single collector. If you use
multiple collector, it will not help. Regards, Eric On 10/22/10 9:21 AM, "Matt Davies" <[EMAIL PROTECTED]> wrote: > Thank you for the insight. > > "Ariel Rabkin" <[EMAIL PROTECTED]> said: > >> On Fri, Oct 22, 2010 at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote: >>> Hi Matt, >> >> >>> >>> The duplication filtering in Chukwa 0.3.0 depends on data loading to >>> mysql. The same primary key will update to the same row to remove >>> duplicates. It is possible to build a duplication detection process >>> prior to demux which filter data based on sequence id + data type + >>> csource (host), but this hasn't been implemented because primary key >>> update method works well for my use case. >> >> This isn't quite right. There is support in 0.3 and later versions for >> doing de-duplication at the collector, in the manner Eric describes. >> It works as a filter in the writer pipeline. >> >> You need the following in your configuration: >> >> <property> >> <name>chukwaCollector.writerClass</name> >> >> <value>org.apache.hadoop.chukwa.datacollection.writer.PipelineStageWriter</va >> lue> >> </property> >> >> <property> >> <name>chukwaCollector.pipeline</name> >> <value>org.apache.hadoop.chukwa.datacollection.writer.Dedup,org.apache.hadoop >> .chukwa.datacollection.writer.SeqFileWriter</value> >> </property> >> >> >> See http://incubator.apache.org/chukwa/docs/r0.3.0/collector.html for >> background >> >> >> --Ari >> >> -- >> Ari Rabkin [EMAIL PROTECTED] >> UC Berkeley Computer Science Department >> > > >
-
Re: Seeing duplicate entriesMatt Davies 2010-10-22, 19:23
Eric,
I've been playing out several ideas on where to put in the correction for our system. Upon investigation it seems that 2 separate demux operations see the duplicate record so doing some sort of distinct in demux seems unreliable given our use. It appears you are putting data into a database and using the db to enforce the uniqueness constraint. Do you see any way we could do a dedup operation after demux (within the chukwa environment) if we write our data strait into HDFS? I could see writing a simple MR job to go and figure this stuff out for me, but it seems very inelegant and introduces more delay before I can utilize the data. Any other thoughts? "Eric Yang" <[EMAIL PROTECTED]> said: > Note, the Dedup collector is only good for a single collector. If you use > multiple collector, it will not help. > > Regards, > Eric > > On 10/22/10 9:21 AM, "Matt Davies" <[EMAIL PROTECTED]> wrote: > >> Thank you for the insight. >> >> "Ariel Rabkin" <[EMAIL PROTECTED]> said: >> >>> On Fri, Oct 22, 2010 at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote: >>>> Hi Matt, >>> >>> >>>> >>>> The duplication filtering in Chukwa 0.3.0 depends on data loading to >>>> mysql. The same primary key will update to the same row to remove >>>> duplicates. It is possible to build a duplication detection process >>>> prior to demux which filter data based on sequence id + data type + >>>> csource (host), but this hasn't been implemented because primary key >>>> update method works well for my use case. >>> >>> This isn't quite right. There is support in 0.3 and later versions for >>> doing de-duplication at the collector, in the manner Eric describes. >>> It works as a filter in the writer pipeline. >>> >>> You need the following in your configuration: >>> >>> <property> >>> <name>chukwaCollector.writerClass</name> >>> >>> <value>org.apache.hadoop.chukwa.datacollection.writer.PipelineStageWriter</va >>> lue> >>> </property> >>> >>> <property> >>> <name>chukwaCollector.pipeline</name> >>> <value>org.apache.hadoop.chukwa.datacollection.writer.Dedup,org.apache.hadoop >>> .chukwa.datacollection.writer.SeqFileWriter</value> >>> </property> >>> >>> >>> See http://incubator.apache.org/chukwa/docs/r0.3.0/collector.html for >>> background >>> >>> >>> --Ari >>> >>> -- >>> Ari Rabkin [EMAIL PROTECTED] >>> UC Berkeley Computer Science Department >>> >> >> >> > >
-
Re: Seeing duplicate entriesCorbin Hoenes 2010-10-23, 00:21
Eric in chukwa 0.5 is hbase the final store instead of hdfs? What
format will the hbase data be in (e.g. A chukwarecord object ? Something user configurable? ) Sent from my iPhone On Oct 22, 2010, at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote: > Hi Matt, > > This is expected in Chukwa archives. When agent is unable to post to > the collector, it will retry to post the same data again to another > collector or retrys with the same collector when no other collector is > available. Collector may have data written without proper acknowledge > back to agent in high load situation. Chukwa philosophy is to retry > until receiving acknowledgement. Duplicated data filter will be > treated after data has been received. > > The duplication filtering in Chukwa 0.3.0 depends on data loading to > mysql. The same primary key will update to the same row to remove > duplicates. It is possible to build a duplication detection process > prior to demux which filter data based on sequence id + data type + > csource (host), but this hasn't been implemented because primary key > update method works well for my use case. > > In Chukwa 0.5, we are treating duplication the same as in Chukwa 0.3, > where it will replace any duplicated row in HBase base on Timestamp + > HBase row key. > > regards, > Eric > > On Thu, Oct 21, 2010 at 8:22 PM, Matt Davies <[EMAIL PROTECTED]> > wrote: >> Hey everyone, >> >> I have a situation where I'm seeing duplicated data downstream >> before the demux process. It appears this happens during high >> system loads and we are still using the 0.3.0 series. >> >> So, we have validated that there is a single, unique entry in our >> source file which then shows up a random amount of times before we >> see it in demux. So, it appears that there is duplication happening >> somewhere between the agent and collector. >> >> Has anyone else seen this? Any ideas as to why we are seeing this >> during high system loads, but not during lower loads. >> >> TIA, >> Matt >> >>
-
Re: Seeing duplicate entriesBill Graham 2010-10-23, 19:59
Eric, I'm also curious about how the HBase integration works. Do you
have time to write something up on it? I'm interested in the possibility of extending what's there to write my own custom data into HBase from a collector, while said data also continues through to HDFS as it does currently. On Fri, Oct 22, 2010 at 5:21 PM, Corbin Hoenes <[EMAIL PROTECTED]> wrote: > Eric in chukwa 0.5 is hbase the final store instead of hdfs? What format > will the hbase data be in (e.g. A chukwarecord object ? Something user > configurable? ) > > Sent from my iPhone > > On Oct 22, 2010, at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote: > >> Hi Matt, >> >> This is expected in Chukwa archives. When agent is unable to post to >> the collector, it will retry to post the same data again to another >> collector or retrys with the same collector when no other collector is >> available. Collector may have data written without proper acknowledge >> back to agent in high load situation. Chukwa philosophy is to retry >> until receiving acknowledgement. Duplicated data filter will be >> treated after data has been received. >> >> The duplication filtering in Chukwa 0.3.0 depends on data loading to >> mysql. The same primary key will update to the same row to remove >> duplicates. It is possible to build a duplication detection process >> prior to demux which filter data based on sequence id + data type + >> csource (host), but this hasn't been implemented because primary key >> update method works well for my use case. >> >> In Chukwa 0.5, we are treating duplication the same as in Chukwa 0.3, >> where it will replace any duplicated row in HBase base on Timestamp + >> HBase row key. >> >> regards, >> Eric >> >> On Thu, Oct 21, 2010 at 8:22 PM, Matt Davies <[EMAIL PROTECTED]> wrote: >>> >>> Hey everyone, >>> >>> I have a situation where I'm seeing duplicated data downstream before the >>> demux process. It appears this happens during high system loads and we are >>> still using the 0.3.0 series. >>> >>> So, we have validated that there is a single, unique entry in our source >>> file which then shows up a random amount of times before we see it in demux. >>> So, it appears that there is duplication happening somewhere between the >>> agent and collector. >>> >>> Has anyone else seen this? Any ideas as to why we are seeing this during >>> high system loads, but not during lower loads. >>> >>> TIA, >>> Matt >>> >>> >
-
Re: Seeing duplicate entriesCorbin Hoenes 2010-10-23, 20:27
+1
I imagine it is jst another pipelinable class loaded into the collector? If so bill's scenario would work. Sent from my iPhone On Oct 23, 2010, at 12:59 PM, Bill Graham <[EMAIL PROTECTED]> wrote: > Eric, I'm also curious about how the HBase integration works. Do you > have time to write something up on it? I'm interested in the > possibility of extending what's there to write my own custom data into > HBase from a collector, while said data also continues through to HDFS > as it does currently. > > > On Fri, Oct 22, 2010 at 5:21 PM, Corbin Hoenes > <[EMAIL PROTECTED]> wrote: >> Eric in chukwa 0.5 is hbase the final store instead of hdfs? What >> format >> will the hbase data be in (e.g. A chukwarecord object ? Something >> user >> configurable? ) >> >> Sent from my iPhone >> >> On Oct 22, 2010, at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote: >> >>> Hi Matt, >>> >>> This is expected in Chukwa archives. When agent is unable to post >>> to >>> the collector, it will retry to post the same data again to another >>> collector or retrys with the same collector when no other >>> collector is >>> available. Collector may have data written without proper >>> acknowledge >>> back to agent in high load situation. Chukwa philosophy is to retry >>> until receiving acknowledgement. Duplicated data filter will be >>> treated after data has been received. >>> >>> The duplication filtering in Chukwa 0.3.0 depends on data loading to >>> mysql. The same primary key will update to the same row to remove >>> duplicates. It is possible to build a duplication detection process >>> prior to demux which filter data based on sequence id + data type + >>> csource (host), but this hasn't been implemented because primary key >>> update method works well for my use case. >>> >>> In Chukwa 0.5, we are treating duplication the same as in Chukwa >>> 0.3, >>> where it will replace any duplicated row in HBase base on >>> Timestamp + >>> HBase row key. >>> >>> regards, >>> Eric >>> >>> On Thu, Oct 21, 2010 at 8:22 PM, Matt Davies >>> <[EMAIL PROTECTED]> wrote: >>>> >>>> Hey everyone, >>>> >>>> I have a situation where I'm seeing duplicated data downstream >>>> before the >>>> demux process. It appears this happens during high system loads >>>> and we are >>>> still using the 0.3.0 series. >>>> >>>> So, we have validated that there is a single, unique entry in our >>>> source >>>> file which then shows up a random amount of times before we see >>>> it in demux. >>>> So, it appears that there is duplication happening somewhere >>>> between the >>>> agent and collector. >>>> >>>> Has anyone else seen this? Any ideas as to why we are seeing this >>>> during >>>> high system loads, but not during lower loads. >>>> >>>> TIA, >>>> Matt >>>> >>>> >>
-
Re: Seeing duplicate entriesEric Yang 2010-10-24, 03:22
HBase only supports bytes. What to store in the cell, is decided by
the demux parser. Chukwa data are currently stored as byte string for the parsers that I implemented. User has full control of data type to store into each HBase column by customize the demux parser. regards, Eric On Fri, Oct 22, 2010 at 5:21 PM, Corbin Hoenes <[EMAIL PROTECTED]> wrote: > Eric in chukwa 0.5 is hbase the final store instead of hdfs? What format > will the hbase data be in (e.g. A chukwarecord object ? Something user > configurable? ) > > Sent from my iPhone > > On Oct 22, 2010, at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote: > >> Hi Matt, >> >> This is expected in Chukwa archives. When agent is unable to post to >> the collector, it will retry to post the same data again to another >> collector or retrys with the same collector when no other collector is >> available. Collector may have data written without proper acknowledge >> back to agent in high load situation. Chukwa philosophy is to retry >> until receiving acknowledgement. Duplicated data filter will be >> treated after data has been received. >> >> The duplication filtering in Chukwa 0.3.0 depends on data loading to >> mysql. The same primary key will update to the same row to remove >> duplicates. It is possible to build a duplication detection process >> prior to demux which filter data based on sequence id + data type + >> csource (host), but this hasn't been implemented because primary key >> update method works well for my use case. >> >> In Chukwa 0.5, we are treating duplication the same as in Chukwa 0.3, >> where it will replace any duplicated row in HBase base on Timestamp + >> HBase row key. >> >> regards, >> Eric >> >> On Thu, Oct 21, 2010 at 8:22 PM, Matt Davies <[EMAIL PROTECTED]> wrote: >>> >>> Hey everyone, >>> >>> I have a situation where I'm seeing duplicated data downstream before the >>> demux process. It appears this happens during high system loads and we are >>> still using the 0.3.0 series. >>> >>> So, we have validated that there is a single, unique entry in our source >>> file which then shows up a random amount of times before we see it in demux. >>> So, it appears that there is duplication happening somewhere between the >>> agent and collector. >>> >>> Has anyone else seen this? Any ideas as to why we are seeing this during >>> high system loads, but not during lower loads. >>> >>> TIA, >>> Matt >>> >>> >
-
Re: Seeing duplicate entriesEric Yang 2010-10-24, 03:33
Hi Bill,
I started the document in this wiki page: http://wiki.apache.org/hadoop/Chukwa_Quick_Start There is a architecture diagram to describe the new setup. Your existing parser should work with Chukwa 0.5, and by adding Chukwa annotations to the parser, it will stream data into the HBase table. I recommend to take a look of SystemMetrics demux parser, it's a good example to follow for updating your existing parser to work with HBase. In the default chukwa-collector-conf.xml.template, there is a section for HBase configuration, uncomment it, and comment out the default seqFileWriter. Restart the collector, and data should appear in HBase. regards, Eric On Sat, Oct 23, 2010 at 12:59 PM, Bill Graham <[EMAIL PROTECTED]> wrote: > Eric, I'm also curious about how the HBase integration works. Do you > have time to write something up on it? I'm interested in the > possibility of extending what's there to write my own custom data into > HBase from a collector, while said data also continues through to HDFS > as it does currently. > > > On Fri, Oct 22, 2010 at 5:21 PM, Corbin Hoenes <[EMAIL PROTECTED]> wrote: >> Eric in chukwa 0.5 is hbase the final store instead of hdfs? What format >> will the hbase data be in (e.g. A chukwarecord object ? Something user >> configurable? ) >> >> Sent from my iPhone >> >> On Oct 22, 2010, at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote: >> >>> Hi Matt, >>> >>> This is expected in Chukwa archives. When agent is unable to post to >>> the collector, it will retry to post the same data again to another >>> collector or retrys with the same collector when no other collector is >>> available. Collector may have data written without proper acknowledge >>> back to agent in high load situation. Chukwa philosophy is to retry >>> until receiving acknowledgement. Duplicated data filter will be >>> treated after data has been received. >>> >>> The duplication filtering in Chukwa 0.3.0 depends on data loading to >>> mysql. The same primary key will update to the same row to remove >>> duplicates. It is possible to build a duplication detection process >>> prior to demux which filter data based on sequence id + data type + >>> csource (host), but this hasn't been implemented because primary key >>> update method works well for my use case. >>> >>> In Chukwa 0.5, we are treating duplication the same as in Chukwa 0.3, >>> where it will replace any duplicated row in HBase base on Timestamp + >>> HBase row key. >>> >>> regards, >>> Eric >>> >>> On Thu, Oct 21, 2010 at 8:22 PM, Matt Davies <[EMAIL PROTECTED]> wrote: >>>> >>>> Hey everyone, >>>> >>>> I have a situation where I'm seeing duplicated data downstream before the >>>> demux process. It appears this happens during high system loads and we are >>>> still using the 0.3.0 series. >>>> >>>> So, we have validated that there is a single, unique entry in our source >>>> file which then shows up a random amount of times before we see it in demux. >>>> So, it appears that there is duplication happening somewhere between the >>>> agent and collector. >>>> >>>> Has anyone else seen this? Any ideas as to why we are seeing this during >>>> high system loads, but not during lower loads. >>>> >>>> TIA, >>>> Matt >>>> >>>> >> >
-
Re: Seeing duplicate entriesEric Yang 2010-10-24, 03:34
Yes, you are right. It should work automatically after annotation is
added to his demux parser. regards, Eric On Sat, Oct 23, 2010 at 1:27 PM, Corbin Hoenes <[EMAIL PROTECTED]> wrote: > +1 > > I imagine it is jst another pipelinable class loaded into the collector? If > so bill's scenario would work. > > Sent from my iPhone > > On Oct 23, 2010, at 12:59 PM, Bill Graham <[EMAIL PROTECTED]> wrote: > >> Eric, I'm also curious about how the HBase integration works. Do you >> have time to write something up on it? I'm interested in the >> possibility of extending what's there to write my own custom data into >> HBase from a collector, while said data also continues through to HDFS >> as it does currently. >> >> >> On Fri, Oct 22, 2010 at 5:21 PM, Corbin Hoenes <[EMAIL PROTECTED]> >> wrote: >>> >>> Eric in chukwa 0.5 is hbase the final store instead of hdfs? What format >>> will the hbase data be in (e.g. A chukwarecord object ? Something user >>> configurable? ) >>> >>> Sent from my iPhone >>> >>> On Oct 22, 2010, at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote: >>> >>>> Hi Matt, >>>> >>>> This is expected in Chukwa archives. When agent is unable to post to >>>> the collector, it will retry to post the same data again to another >>>> collector or retrys with the same collector when no other collector is >>>> available. Collector may have data written without proper acknowledge >>>> back to agent in high load situation. Chukwa philosophy is to retry >>>> until receiving acknowledgement. Duplicated data filter will be >>>> treated after data has been received. >>>> >>>> The duplication filtering in Chukwa 0.3.0 depends on data loading to >>>> mysql. The same primary key will update to the same row to remove >>>> duplicates. It is possible to build a duplication detection process >>>> prior to demux which filter data based on sequence id + data type + >>>> csource (host), but this hasn't been implemented because primary key >>>> update method works well for my use case. >>>> >>>> In Chukwa 0.5, we are treating duplication the same as in Chukwa 0.3, >>>> where it will replace any duplicated row in HBase base on Timestamp + >>>> HBase row key. >>>> >>>> regards, >>>> Eric >>>> >>>> On Thu, Oct 21, 2010 at 8:22 PM, Matt Davies <[EMAIL PROTECTED]> >>>> wrote: >>>>> >>>>> Hey everyone, >>>>> >>>>> I have a situation where I'm seeing duplicated data downstream before >>>>> the >>>>> demux process. It appears this happens during high system loads and we >>>>> are >>>>> still using the 0.3.0 series. >>>>> >>>>> So, we have validated that there is a single, unique entry in our >>>>> source >>>>> file which then shows up a random amount of times before we see it in >>>>> demux. >>>>> So, it appears that there is duplication happening somewhere between >>>>> the >>>>> agent and collector. >>>>> >>>>> Has anyone else seen this? Any ideas as to why we are seeing this >>>>> during >>>>> high system loads, but not during lower loads. >>>>> >>>>> TIA, >>>>> Matt >>>>> >>>>> >>> >
-
Re: Seeing duplicate entriesBill Graham 2010-10-25, 23:03
Thanks Eric, this is helpful. I dug around in the following files and
I think I have a handle on what's happening but I can use some clarifications: oahc.datacollection.adaptor.SyslogAdaptor oahc.extraction.demux.processor.mapper.SysLog oahc.datacollection.writer.hbase.OutputCollector conf/hbase.schema conf/chukwa-collector-conf.xml.template To make sure I'm clear, let me know if this is accurate: 1. SyslogAdaptor sends syslog message byte arrays as the chunk body bound to the dataType for that facility. 2. In the collector configs, this config says to write data to HBase only: <property> <name>chukwaCollector.pipeline</name> <value>org.apache.hadoop.chukwa.datacollection.writer.SocketTeeWriter,org.apache.hadoop.chukwa.datacollection.writer.hbase.HBaseWriter</value> </property> If I also wanted to write data to HDFS, would I just need to add ",org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter" as a third item in the chain? 3. In the collector configs, all packages beneath the package configured in hbase.demux.package would be checked for the annotated classes (it would be useful to have this also take a comma-separated list at some point for extensibility). What about the data being sent indicates that the SysLog processor should be used? 4. The collector via HBaseWriter writes the data to the SystemMetrics/SysLog table/family in HBase per the annotations. Looking at OutputCollector it appears the following data is set: - key is taken as the '[source]-[ts]' from the ChukwaRecordKey - column family seems to be taken as the reduceType (i.e. dataType), but I thought that was set by the annotation in SysLog. Which is it? - column name/value is every field name and value in the ChukwaRecord. This last part is throwing me off though, since I can't see where field names and values are set on your ChukwaRecord. Can you clarify? It seems like the record was just the entire byte array payload of the syslog message. Btw, the documentation is a big help thanks, but one bit of feedback is that the "Configure Log4j syslog appender" section is confusing w.r.t. what nodes your speaking of. I assume you're talking about the Hadoop nodes being monitored, but is there anything about this approach that limits this to monitoring Hadoop nodes only? Either way, which nodes being discussed and which Hadoop cluster needs to be rebooted should be clarified. thanks, Bill On Sat, Oct 23, 2010 at 8:34 PM, Eric Yang <[EMAIL PROTECTED]> wrote: > Yes, you are right. It should work automatically after annotation is > added to his demux parser. > > regards, > Eric > > On Sat, Oct 23, 2010 at 1:27 PM, Corbin Hoenes <[EMAIL PROTECTED]> wrote: >> +1 >> >> I imagine it is jst another pipelinable class loaded into the collector? If >> so bill's scenario would work. >> >> Sent from my iPhone >> >> On Oct 23, 2010, at 12:59 PM, Bill Graham <[EMAIL PROTECTED]> wrote: >> >>> Eric, I'm also curious about how the HBase integration works. Do you >>> have time to write something up on it? I'm interested in the >>> possibility of extending what's there to write my own custom data into >>> HBase from a collector, while said data also continues through to HDFS >>> as it does currently. >>> >>> >>> On Fri, Oct 22, 2010 at 5:21 PM, Corbin Hoenes <[EMAIL PROTECTED]> >>> wrote: >>>> >>>> Eric in chukwa 0.5 is hbase the final store instead of hdfs? What format >>>> will the hbase data be in (e.g. A chukwarecord object ? Something user >>>> configurable? ) >>>> >>>> Sent from my iPhone >>>> >>>> On Oct 22, 2010, at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote: >>>> >>>>> Hi Matt, >>>>> >>>>> This is expected in Chukwa archives. When agent is unable to post to >>>>> the collector, it will retry to post the same data again to another >>>>> collector or retrys with the same collector when no other collector is >>>>> available. Collector may have data written without proper acknowledge >>>>> back to agent in high load situation. Chukwa philosophy is to retry
-
Re: Seeing duplicate entriesEric Yang 2010-10-26, 00:28
Hi Bill,
On 10/25/10 4:03 PM, "Bill Graham" <[EMAIL PROTECTED]> wrote: > Thanks Eric, this is helpful. I dug around in the following files and > I think I have a handle on what's happening but I can use some > clarifications: > > oahc.datacollection.adaptor.SyslogAdaptor > oahc.extraction.demux.processor.mapper.SysLog > oahc.datacollection.writer.hbase.OutputCollector > conf/hbase.schema > conf/chukwa-collector-conf.xml.template > > > To make sure I'm clear, let me know if this is accurate: > > 1. SyslogAdaptor sends syslog message byte arrays as the chunk body > bound to the dataType for that facility. Yes, Syslog message looks like this: <142>This is a log entry The facility name is derived from the first 3 digit number, priority + severity + facility number*8. Hence, the SyslogAdaptor manually maps the existing 24 data types into data type make sense to Chukwa. For example, a syslog message with facility LOCAL0, and SyslogAdaptor looks up for running SyslogAdaptor on port 9095, facility LOCAL1 maps to HADOOP. Chunk data is stamped as HADOOP for demux. This mapping is added in chukwa-agent-conf.xml, like this: <property> <name>syslog.adaptor.port.9095.facility.LOCAL1</name> <value>HADOOP</value> </property> > 2. In the collector configs, this config says to write data to HBase only: > <property> > <name>chukwaCollector.pipeline</name> > <value>org.apache.hadoop.chukwa.datacollection.writer.SocketTeeWriter,org.apac > he.hadoop.chukwa.datacollection.writer.hbase.HBaseWriter</value> > </property> > > If I also wanted to write data to HDFS, would I just need to add > ",org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter" as a > third item in the chain? Yes. Make sure writerClass is configured to use PipelineStageWriter. <property> <name>chukwaCollector.writerClass</name> <value>org.apache.hadoop.chukwa.datacollection.writer.PipelineStageWriter</ value> </property> > 3. In the collector configs, all packages beneath the package > configured in hbase.demux.package would be checked for the annotated > classes (it would be useful to have this also take a comma-separated > list at some point for extensibility). What about the data being sent > indicates that the SysLog processor should be used? HBaseWriter reads chukwa-demux-conf.xml if it is available in collector's conf directory. Hence, mappings of data type to parser is the same as demux on hdfs. > 4. The collector via HBaseWriter writes the data to the > SystemMetrics/SysLog table/family in HBase per the annotations. > Looking at OutputCollector it appears the following data is set: > > - key is taken as the '[source]-[ts]' from the ChukwaRecordKey > - column family seems to be taken as the reduceType (i.e. dataType), > but I thought that was set by the annotation in SysLog. Which is it? > - column name/value is every field name and value in the ChukwaRecord. > > This last part is throwing me off though, since I can't see where > field names and values are set on your ChukwaRecord. Can you clarify? > It seems like the record was just the entire byte array payload of the > syslog message. This is currently set to reduceType. The annotation for column does nothing at this moment. In the future, it would be nice to have reduce type map to annotation. This means it will become more ORM entity bean code for demux process. I am not sure if that is something that we want Chukwa to do. It is nicer to have Apache Gora handle ORM for Hbase, hence Chukwa doesn't detour from original objective. SystemMetrics writes to SystemMetrics table. Hadoop logs which streamed through SyslogAdaptor is mapped to HADOOP. I have not test the HADOOP parser to see if Hadoop log processing is working. This is on my TODO list. In theory, it should work. ;) The annotation in SysLogAdaptor is only defining the which data type it is, it has not define which parser to process the data. This is done by demux configuration. I think the default behavior to map data type to demux parser probably throw you off to assume data is processed by oahc. extraction.demux.processor.mapper.SysLog. Instead, you need to make sure there is configuration in agent for mapping facility name to data type of your choice, and configure demux to invoke the proper parser. Let's say if you are sending /var/log/messages with SyslogAdaptor, and map facility name to SysLog and having demux configuration map to use SysLog. Logs will appear in Hbase table: SystemMetrics, SysLog column family, with a column called "body" which contains all your log entries. The buildGenericRecord will create default record with body field. There are some clean up work to decouple entity bean from our parser, then demux will look nice and neat. We should change serialization of ChukwaRecord to avro, then it will make a lot of sense, and easier to annotate columns. For now, I only got bare minimum working. Any log file written by SyslogAppender could be stream over to SyslogAdaptor. The only two required pieces are to write a demux parser which can process your log file, and map facility name to demux parser. For Hadoop, the modification to log4j.properties should applies to all nodes (namenode, jobtracker, datanode, tasktracker, secondary name node.) Hence, all logs can be streamed over and processed. However, there is a lot of data, and the current Chukwa parsers are not written to pick up all the details. When log4j.properties is changed, you will need to restart cluster in order to take advantage of the changes. Hope this helps. Regards, Eric |