I went through the source code of Hadoop consumer in contrib. It doesn't
seem to be using previous offset at all. Neither in Data Generator or in
Map reduce stage.
Before I go into the implementation, I can think of 2 ways :
1. A consumerconnector receiving all the messages continuously, and then
writing it to HDFS (in this case S3). Problem is autocommit is handled
internally, and there is no handler function while committing offset, which
can be used to upload file.
2. Wake up every one minute, pull all the data using simple consumer into a
local file and put to HDFS.
So, what is better approach ?
- Listen continuously vs in batch
- Use consumerconnector (where auto commit/offsets are handled internally)
vs simple consumer (which doesnot use zk, so I need to connect to each
On Thu, Dec 27, 2012 at 8:38 PM, David Arthur <[EMAIL PROTECTED]> wrote:
> I don't think anything exists like this in Kafka (or contrib), but it
> would be a useful addition! Personally, I have written this exact thing at
> previous jobs.
> As for the Hadoop consumer, since there is a FileSystem implementation for
> S3 in Hadoop, it should be possible. The Hadoop consumer works by writing
> out data files containing the Kafka messages along side offset files which
> contain the last offset read for each partition. If it is re-consuming from
> zero each time you run it, it means it's not finding the offset files from
> the previous run.
> Having used it a bit, the Hadoop consumer is certainly an area that could
> use improvement.
> On 12/27/12 4:41 AM, Pratyush Chandra wrote:
>> I am looking for a S3 based consumer, which can write all the received
>> events to S3 bucket (say every minute). Something similar to Flume
>> I have tried evaluating hadoop-consumer in contrib folder. But it seems to
>> be more for offline processing, which will fetch everything from offset 0
>> at once and replace it in S3 bucket.
>> Any help would be appreciated ?