Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Chukwa >> mail # user >> Seeing duplicate entries


Copy link to this message
-
Re: Seeing duplicate entries
Thanks Eric, this is helpful. I dug around in the following files and
I think I have a handle on what's happening but I can use some
clarifications:

oahc.datacollection.adaptor.SyslogAdaptor
oahc.extraction.demux.processor.mapper.SysLog
oahc.datacollection.writer.hbase.OutputCollector
conf/hbase.schema
conf/chukwa-collector-conf.xml.template
To make sure I'm clear, let me know if this is accurate:

1. SyslogAdaptor sends syslog message byte arrays as the chunk body
bound to the dataType for that facility.

2. In the collector configs, this config says to write data to HBase only:
<property>
<name>chukwaCollector.pipeline</name>
<value>org.apache.hadoop.chukwa.datacollection.writer.SocketTeeWriter,org.apache.hadoop.chukwa.datacollection.writer.hbase.HBaseWriter</value>
</property>

If I also wanted to write data to HDFS, would I just need to add
",org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter" as a
third item in the chain?

3. In the collector configs, all packages beneath the package
configured in hbase.demux.package would be checked for the annotated
classes (it would be useful to have this also take a comma-separated
list at some point for extensibility). What about the data being sent
indicates that the SysLog processor should be used?

4. The collector via HBaseWriter writes the data to the
SystemMetrics/SysLog table/family in HBase per the annotations.
Looking at OutputCollector it appears the following data is set:

 - key is taken as the '[source]-[ts]' from the ChukwaRecordKey
 - column family seems to be taken as the reduceType (i.e. dataType),
but I thought that was set by the annotation in SysLog. Which is it?
 - column name/value is every field name and value in the ChukwaRecord.

This last part is throwing me off though, since I can't see where
field names and values are set on your ChukwaRecord. Can you clarify?
It seems like the record was just the entire byte array payload of the
syslog message.
Btw, the documentation is a big help thanks, but one bit of feedback
is that the "Configure Log4j syslog appender" section is confusing
w.r.t. what nodes your speaking of. I assume you're talking about the
Hadoop nodes being monitored, but is there anything about this
approach that limits this to monitoring Hadoop nodes only? Either way,
which nodes being discussed and which Hadoop cluster needs to be
rebooted should be clarified.
thanks,
Bill
On Sat, Oct 23, 2010 at 8:34 PM, Eric Yang <[EMAIL PROTECTED]> wrote:
> Yes, you are right.  It should work automatically after annotation is
> added to his demux parser.
>
> regards,
> Eric
>
> On Sat, Oct 23, 2010 at 1:27 PM, Corbin Hoenes <[EMAIL PROTECTED]> wrote:
>> +1
>>
>> I imagine it is jst another pipelinable class loaded into the collector?  If
>> so bill's scenario would work.
>>
>> Sent from my iPhone
>>
>> On Oct 23, 2010, at 12:59 PM, Bill Graham <[EMAIL PROTECTED]> wrote:
>>
>>> Eric, I'm also curious about how the HBase integration works. Do you
>>> have time to write something up on it? I'm interested in the
>>> possibility of extending what's there to write my own custom data into
>>> HBase from a collector, while said data also continues through to HDFS
>>> as it does currently.
>>>
>>>
>>> On Fri, Oct 22, 2010 at 5:21 PM, Corbin Hoenes <[EMAIL PROTECTED]>
>>> wrote:
>>>>
>>>> Eric in chukwa 0.5 is hbase the final store instead of hdfs?  What format
>>>> will the hbase data be in (e.g. A chukwarecord object ? Something user
>>>> configurable? )
>>>>
>>>> Sent from my iPhone
>>>>
>>>> On Oct 22, 2010, at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> Hi Matt,
>>>>>
>>>>> This is expected in Chukwa archives.  When agent is unable to post to
>>>>> the collector, it will retry to post the same data again to another
>>>>> collector or retrys with the same collector when no other collector is
>>>>> available.  Collector may have data written without proper acknowledge
>>>>> back to agent in high load situation.  Chukwa philosophy is to retry
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB