Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Chukwa >> mail # user >> Seeing duplicate entries


Copy link to this message
-
Re: Seeing duplicate entries
On Fri, Oct 22, 2010 at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote:
> Hi Matt,
>
> The duplication filtering in Chukwa 0.3.0 depends on data loading to
> mysql.  The same primary key will update to the same row to remove
> duplicates.  It is possible to build a duplication detection process
> prior to demux which filter data based on sequence id + data type +
> csource (host), but this hasn't been implemented because primary key
> update method works well for my use case.

This isn't quite right. There is support in 0.3 and later versions for
doing de-duplication at the collector, in the manner Eric describes.
It works as a filter in the writer pipeline.

You need the following in your configuration:

<property>
  <name>chukwaCollector.writerClass</name>
  <value>org.apache.hadoop.chukwa.datacollection.writer.PipelineStageWriter</value>
</property>

<property>
  <name>chukwaCollector.pipeline</name>
<value>org.apache.hadoop.chukwa.datacollection.writer.Dedup,org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter</value>
</property>
 

See http://incubator.apache.org/chukwa/docs/r0.3.0/collector.html for background
--Ari

--
Ari Rabkin [EMAIL PROTECTED]
UC Berkeley Computer Science Department
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB