So the hadoop consumer does use the latest offset, it reads it from the
'input' directory in the record reader.
We have a heavily modified version of the hadoop consumer that reads /
writes offsets to zookeeper [much like the scala consumers] and this works
FWIW we also use the hadoop consumer to write to S3 without any issues,
much like any ordinary mapreduce job, and it's pretty solid. We run our job
every 10-30 minutes.
Maybe also interesting is that we used to use Flume [0.9], and find the
kafka method of consuming to be much better during s3 networking issues.
With flume if you 'push' to s3, but something goes wrong it can fall over
and you can fairly easily lose data, with the hadoop kafka consumer the
mapper just fails-over and tries again, which is a little wasteful (you're
reading the records twice), but generally great.
On Fri, Dec 28, 2012 at 1:56 PM, Pratyush Chandra <
[EMAIL PROTECTED]> wrote:
Foursquare | Software Engineer | Server Engineering Team
[EMAIL PROTECTED] | @rathboma <http://twitter.com/rathboma>