Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Chukwa >> mail # user >> Simple Archiver , Demux and PostProcessManager about the Raw Data sink file


Copy link to this message
-
Re: Simple Archiver , Demux and PostProcessManager about the Raw Data sink file
On Thu, Mar 15, 2012 at 12:36 AM, IvyTang <[EMAIL PROTECTED]> wrote:
> As the wiki says, Data in the sink may include duplicate and omitted
> chunks.So we need demux and archive the raw data sink file .
>
> The start-data-processors.sh runs three processes ,  ChukwaArchiveManager
> , PostProcessorManager and DemuxManager.
>
> This
> page http://incubator.apache.org/chukwa/docs/r0.4.0/dataflow.html explains
> the data workflow.
>
> First , DemuxManager moves raw *.done to
>  dataSinkArchives/[yyyyMMdd]/*/*.done.
>
> Then, ChukwaArchiveManager every half hour or so aggregates and removes
> dataSinkArchives data using M/R , from dataSinkArchives/[yyyyMMdd]/*/*.done
> to finalArchives/.
>
> The complete logflow is  logs/*.done
> ==>  dataSinkArchives/[yyyyMMdd]/*/*.done ==> finalArchives
>
> 1.
>          Here , i have a question .Accoring to
> the http://incubator.apache.org/chukwa/docs/r0.4.0/programming.html#Using+MapReduce ,
>  Simple Archiver & Demux . The simple archiver removed the duplicates .
>         Does the simple archiver refers to the  ChukwaArchiveManager?

No, these are separate pieces. Back in the day, I found that
ChukwaArchiveManager was too complicated for my needs, and that I
wanted a simple command that would just archive whatever was in the
sink. And that's the simple archiver. It's found in
org.apache.hadoop.chukwa.extraction.archive.SinkArchiver.
> 3.     Can i just run the DemuxManager  & ChukwaArchiveManager ?  i found i
> just need these two components.

Yes, you should be fine with just those if they meet your needs.

--
Ari Rabkin [EMAIL PROTECTED]
UC Berkeley Computer Science Department