"Omitted chunks" is an error. By definition, if chunks are omitted
they won't be there. Duplicates and other peculiarities will happen in
the event of failures. As you say, it's a consequence of the
SimpleArchiver should do the cleanup you want.
On Mon, Nov 22, 2010 at 11:39 PM, Ying Tang <[EMAIL PROTECTED]> wrote:
> Hi all ,
> After reading the chukwa docs , per my understanding , the log data flow
> is :
> adaptor-->agent-->collector-->sink file--->....
> In the doc says , " Data in the sink may include duplicate and omitted
> And it is not recommanded to write MapReduce jobs that directly examine
> the data sink , "becaues jobs will likely discard most of their input ".
> Here is my question:
> 1. Why data in sink file include duplicate and ommitted chunks ? Because
> the distributed environmrnt ?
> 2. How to solve the problem above ? The Simple Archiver generates the
> archive file , and duplicates have been removed . So the simple archiver can
> only solve the duplicate data , right?
> Best regards,
> Ivy Tang
Ari Rabkin [EMAIL PROTECTED]
UC Berkeley Computer Science Department