I went through the source code of Hadoop consumer in contrib. It doesn't
seem to be using previous offset at all. Neither in Data Generator or in
Map reduce stage.

Before I go into the implementation, I can think of 2 ways :
1. A consumerconnector receiving all the messages continuously, and then
writing it to HDFS (in this case S3). Problem is autocommit is handled
internally, and there is no handler function while committing offset, which
can be used to upload file.
2. Wake up every one minute, pull all the data using simple consumer into a
local file and put to HDFS.

So, what is better approach ?
- Listen continuously vs in batch
- Use consumerconnector (where auto commit/offsets are handled internally)
vs simple consumer (which doesnot use zk, so I need to connect to each
broker individually)


On Thu, Dec 27, 2012 at 8:38 PM, David Arthur <[EMAIL PROTECTED]> wrote:

Pratyush Chandra

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB