Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Chukwa, mail # user - Seeing duplicate entries


+
Matt Davies 2010-10-22, 03:22
+
Eric Yang 2010-10-22, 15:48
Copy link to this message
-
Re: Seeing duplicate entries
Ariel Rabkin 2010-10-22, 16:03
On Fri, Oct 22, 2010 at 8:48 AM, Eric Yang <[EMAIL PROTECTED]> wrote:
> Hi Matt,
>
> The duplication filtering in Chukwa 0.3.0 depends on data loading to
> mysql.  The same primary key will update to the same row to remove
> duplicates.  It is possible to build a duplication detection process
> prior to demux which filter data based on sequence id + data type +
> csource (host), but this hasn't been implemented because primary key
> update method works well for my use case.

This isn't quite right. There is support in 0.3 and later versions for
doing de-duplication at the collector, in the manner Eric describes.
It works as a filter in the writer pipeline.

You need the following in your configuration:

<property>
  <name>chukwaCollector.writerClass</name>
  <value>org.apache.hadoop.chukwa.datacollection.writer.PipelineStageWriter</value>
</property>

<property>
  <name>chukwaCollector.pipeline</name>
<value>org.apache.hadoop.chukwa.datacollection.writer.Dedup,org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter</value>
</property>
 

See http://incubator.apache.org/chukwa/docs/r0.3.0/collector.html for background
--Ari

--
Ari Rabkin [EMAIL PROTECTED]
UC Berkeley Computer Science Department
+
Matt Davies 2010-10-22, 16:21
+
Eric Yang 2010-10-22, 16:46
+
Matt Davies 2010-10-22, 19:23
+
Corbin Hoenes 2010-10-23, 00:21
+
Bill Graham 2010-10-23, 19:59
+
Corbin Hoenes 2010-10-23, 20:27
+
Eric Yang 2010-10-24, 03:34
+
Bill Graham 2010-10-25, 23:03
+
Eric Yang 2010-10-26, 00:28
+
Eric Yang 2010-10-24, 03:33
+
Eric Yang 2010-10-24, 03:22