-Re: S3 Consumer
Russell Jurney 2012-12-28, 23:46
Would you please contribute this to open source? What you've written
has been asked for many times. FWIW, I would immediately incorporate
it into my book, Agile Data.
Russell Jurney http://datasyndrome.com
On Dec 28, 2012, at 8:06 AM, Liam Stewart <[EMAIL PROTECTED]> wrote:
> We have a tool that reads data continuously from brokers and then writes
> files to S3. A MR job didn't make sense for us given our current size and
> volume. We have one instance running right now and could add more by if
> needed, adjusting which instance reads from which brokers/topics/...
> Unfortunately, part of the implementation is tied to our internals so I
> can't open source it at this point. The idea is roughly:
> - one simple consumer per broker / topic / partition; reads data in batches
> and writes to local temp files; reads are done in parallel
> - files are finalized after a given size or age (to better handle low
> volume topics) and then written to S3. all uploads are done in a separate
> thread pool (using the aws transfer manager) and don't block reads from
> kafka unless too many files get backed up due to either problems with S3 or
> upload speed
> - after a file is written, the next offset to read is written to zookeeper
> For me, part of the implementation was as an exercise to get experience
> with zookeeper so that was one reason for using the lower-level API and
> handling offset tracking etc myself.
> On Fri, Dec 28, 2012 at 6:56 AM, Pratyush Chandra <
> [EMAIL PROTECTED]> wrote:
>> I went through the source code of Hadoop consumer in contrib. It doesn't
>> seem to be using previous offset at all. Neither in Data Generator or in
>> Map reduce stage.
>> Before I go into the implementation, I can think of 2 ways :
>> 1. A consumerconnector receiving all the messages continuously, and then
>> writing it to HDFS (in this case S3). Problem is autocommit is handled
>> internally, and there is no handler function while committing offset, which
>> can be used to upload file.
>> 2. Wake up every one minute, pull all the data using simple consumer into a
>> local file and put to HDFS.
>> So, what is better approach ?
>> - Listen continuously vs in batch
>> - Use consumerconnector (where auto commit/offsets are handled internally)
>> vs simple consumer (which doesnot use zk, so I need to connect to each
>> broker individually)
>> On Thu, Dec 27, 2012 at 8:38 PM, David Arthur <[EMAIL PROTECTED]> wrote:
>>> I don't think anything exists like this in Kafka (or contrib), but it
>>> would be a useful addition! Personally, I have written this exact thing
>>> previous jobs.
>>> As for the Hadoop consumer, since there is a FileSystem implementation
>>> S3 in Hadoop, it should be possible. The Hadoop consumer works by writing
>>> out data files containing the Kafka messages along side offset files
>>> contain the last offset read for each partition. If it is re-consuming
>>> zero each time you run it, it means it's not finding the offset files
>>> the previous run.
>>> Having used it a bit, the Hadoop consumer is certainly an area that could
>>> use improvement.
>>> On 12/27/12 4:41 AM, Pratyush Chandra wrote:
>>>> I am looking for a S3 based consumer, which can write all the received
>>>> events to S3 bucket (say every minute). Something similar to Flume
>>>> I have tried evaluating hadoop-consumer in contrib folder. But it seems
>>>> be more for offline processing, which will fetch everything from offset
>>>> at once and replace it in S3 bucket.
>>>> Any help would be appreciated ?
>> Pratyush Chandra
> Liam Stewart :: [EMAIL PROTECTED]