Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Kafka >> mail # user >> S3 Consumer


+
Pratyush Chandra 2012-12-27, 09:42
+
David Arthur 2012-12-27, 15:08
+
Pratyush Chandra 2012-12-28, 13:57
+
Matthew Rathbone 2012-12-28, 14:32
+
Pratyush Chandra 2012-12-28, 15:22
+
Liam Stewart 2012-12-28, 16:06
Copy link to this message
-
Re: S3 Consumer
Would you please contribute this to open source? What you've written
has been asked for many times. FWIW, I would immediately incorporate
it into my book, Agile Data.

Russell Jurney http://datasyndrome.com

On Dec 28, 2012, at 8:06 AM, Liam Stewart <[EMAIL PROTECTED]> wrote:

> We have a tool that reads data continuously from brokers and then writes
> files to S3. A MR job didn't make sense for us given our current size and
> volume. We have one instance running right now and could add more by if
> needed, adjusting which instance reads from which brokers/topics/...
> Unfortunately, part of the implementation is tied to our internals so I
> can't open source it at this point. The idea is roughly:
>
> - one simple consumer per broker / topic / partition; reads data in batches
> and writes to local temp files; reads are done in parallel
> - files are finalized after a given size or age (to better handle low
> volume topics) and then written to S3. all uploads are done in a separate
> thread pool (using the aws transfer manager) and don't block reads from
> kafka unless too many files get backed up due to either problems with S3 or
> upload speed
> - after a file is written, the next offset to read is written to zookeeper
>
> For me, part of the implementation was as an exercise to get experience
> with zookeeper so that was one reason for using the lower-level API and
> handling offset tracking etc myself.
>
> On Fri, Dec 28, 2012 at 6:56 AM, Pratyush Chandra <
> [EMAIL PROTECTED]> wrote:
>
>> I went through the source code of Hadoop consumer in contrib. It doesn't
>> seem to be using previous offset at all. Neither in Data Generator or in
>> Map reduce stage.
>>
>> Before I go into the implementation, I can think of 2 ways :
>> 1. A consumerconnector receiving all the messages continuously, and then
>> writing it to HDFS (in this case S3). Problem is autocommit is handled
>> internally, and there is no handler function while committing offset, which
>> can be used to upload file.
>> 2. Wake up every one minute, pull all the data using simple consumer into a
>> local file and put to HDFS.
>>
>> So, what is better approach ?
>> - Listen continuously vs in batch
>> - Use consumerconnector (where auto commit/offsets are handled internally)
>> vs simple consumer (which doesnot use zk, so I need to connect to each
>> broker individually)
>>
>> Pratyush
>>
>> On Thu, Dec 27, 2012 at 8:38 PM, David Arthur <[EMAIL PROTECTED]> wrote:
>>
>>> I don't think anything exists like this in Kafka (or contrib), but it
>>> would be a useful addition! Personally, I have written this exact thing
>> at
>>> previous jobs.
>>>
>>> As for the Hadoop consumer, since there is a FileSystem implementation
>> for
>>> S3 in Hadoop, it should be possible. The Hadoop consumer works by writing
>>> out data files containing the Kafka messages along side offset files
>> which
>>> contain the last offset read for each partition. If it is re-consuming
>> from
>>> zero each time you run it, it means it's not finding the offset files
>> from
>>> the previous run.
>>>
>>> Having used it a bit, the Hadoop consumer is certainly an area that could
>>> use improvement.
>>>
>>> HTH,
>>> David
>>>
>>>
>>> On 12/27/12 4:41 AM, Pratyush Chandra wrote:
>>>
>>>> Hi,
>>>>
>>>> I am looking for a S3 based consumer, which can write all the received
>>>> events to S3 bucket (say every minute). Something similar to Flume
>>>> HDFSSink
>>>> http://flume.apache.org/**FlumeUserGuide.html#hdfs-sink<
>> http://flume.apache.org/FlumeUserGuide.html#hdfs-sink>
>>>> I have tried evaluating hadoop-consumer in contrib folder. But it seems
>> to
>>>> be more for offline processing, which will fetch everything from offset
>> 0
>>>> at once and replace it in S3 bucket.
>>>> Any help would be appreciated ?
>>>>
>>>>
>>>
>>
>>
>> --
>> Pratyush Chandra
>>
>
>
>
> --
> Liam Stewart :: [EMAIL PROTECTED]

 
+
Chetan Conikee 2012-12-29, 06:15
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB