Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Chukwa >> mail # user >> Seeing duplicate entries


+
Matt Davies 2010-10-22, 03:22
+
Eric Yang 2010-10-22, 15:48
Copy link to this message
-
Re: Seeing duplicate entries
On Fri, Oct 22, 2010 at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote:
> Hi Matt,
>
> The duplication filtering in Chukwa 0.3.0 depends on data loading to
> mysql.  The same primary key will update to the same row to remove
> duplicates.  It is possible to build a duplication detection process
> prior to demux which filter data based on sequence id + data type +
> csource (host), but this hasn't been implemented because primary key
> update method works well for my use case.

This isn't quite right. There is support in 0.3 and later versions for
doing de-duplication at the collector, in the manner Eric describes.
It works as a filter in the writer pipeline.

You need the following in your configuration:

<property>
  <name>chukwaCollector.writerClass</name>
  <value>org.apache.hadoop.chukwa.datacollection.writer.PipelineStageWriter</value>
</property>

<property>
  <name>chukwaCollector.pipeline</name>
<value>org.apache.hadoop.chukwa.datacollection.writer.Dedup,org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter</value>
</property>
 

See http://incubator.apache.org/chukwa/docs/r0.3.0/collector.html for background
--Ari

--
Ari Rabkin [EMAIL PROTECTED]
UC Berkeley Computer Science Department
+
Matt Davies 2010-10-22, 16:21
+
Eric Yang 2010-10-22, 16:46
+
Matt Davies 2010-10-22, 19:23
+
Corbin Hoenes 2010-10-23, 00:21
+
Bill Graham 2010-10-23, 19:59
+
Corbin Hoenes 2010-10-23, 20:27
+
Eric Yang 2010-10-24, 03:34
+
Bill Graham 2010-10-25, 23:03
+
Eric Yang 2010-10-26, 00:28
+
Eric Yang 2010-10-24, 03:33
+
Eric Yang 2010-10-24, 03:22
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB