Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Chukwa, mail # user - Sink file has omitted chunks?

Copy link to this message
Re: Sink file has omitted chunks?
Ariel Rabkin 2010-11-23, 07:47
"Omitted chunks" is an error. By definition, if chunks are omitted
they won't be there. Duplicates and other peculiarities will happen in
the event of failures. As you say, it's a consequence of the
distributed environment.

SimpleArchiver should do the cleanup you want.


On Mon, Nov 22, 2010 at 11:39 PM, Ying Tang <[EMAIL PROTECTED]> wrote:
> Hi all ,
>     After reading the chukwa docs , per my understanding , the log data flow
> is :
>     adaptor-->agent-->collector-->sink file--->....
>     In the doc says , " Data in the sink may include duplicate and omitted
> chunks."
>     And it is not recommanded to write MapReduce jobs that directly examine
> the data sink , "becaues  jobs will likely discard most of their input ".
>     Here is my question:
>     1. Why data in sink file include duplicate and ommitted chunks ? Because
> the distributed environmrnt ?
>     2. How to solve the problem above ?  The Simple Archiver generates the
> archive file , and duplicates have been removed . So the simple archiver can
> only solve the duplicate data , right?
> --
> Best regards,
> Ivy Tang

UC Berkeley Computer Science Department