Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Kafka >> mail # user >> S3 Consumer


+
Pratyush Chandra 2012-12-27, 09:42
+
David Arthur 2012-12-27, 15:08
+
Pratyush Chandra 2012-12-28, 13:57
+
Matthew Rathbone 2012-12-28, 14:32
Hi Matthew,

I may be doing something wrong.

I cloned the code at
https://github.com/apache/kafka/tree/trunk/contrib/hadoop-consumer

I am running following :
- ./run-class.sh kafka.etl.impl.DataGenerator test/test.properties which
generates a /tmp/kafka/data/1.dat file containing
Dump tcp://localhost:9092       atlas-topic1    0       -1 to
/tmp/kafka/data/1.dat
- ./run-class.sh kafka.etl.impl.SimpleKafkaETLJob test/test.properties
It says
Using offset range [0, 42649]
Connected to node tcp://localhost:9092 beginning reading at offset 0 latest
offset=42649
Again I run again
- ./run-class.sh kafka.etl.impl.SimpleKafkaETLJob test/test.properties
It says
Using offset range [0, 42759]
Connected to node tcp://localhost:9092 beginning reading at offset 0 latest
offset=42759

My test.properties contain local file system for input/output to test

kafka.etl.topic=topic1
hdfs.default.classpath.dir=/tmp/kafka/lib
event.count=1000
hadoop.job.ugi=kafka,hadoop
kafka.server.uri=tcp://localhost:9092
input=/tmp/kafka/data
output=/tmp/kafka/output
kafka.request.limit=-1
client.buffer.size=1048576
client.so.timeout=60000

Pratyush
On Fri, Dec 28, 2012 at 8:02 PM, Matthew Rathbone <[EMAIL PROTECTED]>wrote:

> So the hadoop consumer does use the latest offset, it reads it from the
> 'input' directory in the record reader.
>
> We have a heavily modified version of the hadoop consumer that reads /
> writes offsets to zookeeper [much like the scala consumers] and this works
> great.
>
> FWIW we also use the hadoop consumer to write to S3 without any issues,
> much like any ordinary mapreduce job, and it's pretty solid. We run our job
> every 10-30 minutes.
>
> Maybe also interesting is that we used to use Flume [0.9], and find the
> kafka method of consuming to be much better during s3 networking issues.
> With flume if you 'push' to s3, but something goes wrong it can fall over
> and you can fairly easily lose data, with the hadoop kafka consumer the
> mapper just fails-over and tries again, which is a little wasteful (you're
> reading the records twice), but generally great.
>
>
>
> On Fri, Dec 28, 2012 at 1:56 PM, Pratyush Chandra <
> [EMAIL PROTECTED]> wrote:
>
> > I went through the source code of Hadoop consumer in contrib. It doesn't
> > seem to be using previous offset at all. Neither in Data Generator or in
> > Map reduce stage.
> >
> > Before I go into the implementation, I can think of 2 ways :
> > 1. A consumerconnector receiving all the messages continuously, and then
> > writing it to HDFS (in this case S3). Problem is autocommit is handled
> > internally, and there is no handler function while committing offset,
> which
> > can be used to upload file.
> > 2. Wake up every one minute, pull all the data using simple consumer
> into a
> > local file and put to HDFS.
> >
> > So, what is better approach ?
> > - Listen continuously vs in batch
> > - Use consumerconnector (where auto commit/offsets are handled
> internally)
> > vs simple consumer (which doesnot use zk, so I need to connect to each
> > broker individually)
> >
> > Pratyush
> >
> > On Thu, Dec 27, 2012 at 8:38 PM, David Arthur <[EMAIL PROTECTED]> wrote:
> >
> > > I don't think anything exists like this in Kafka (or contrib), but it
> > > would be a useful addition! Personally, I have written this exact thing
> > at
> > > previous jobs.
> > >
> > > As for the Hadoop consumer, since there is a FileSystem implementation
> > for
> > > S3 in Hadoop, it should be possible. The Hadoop consumer works by
> writing
> > > out data files containing the Kafka messages along side offset files
> > which
> > > contain the last offset read for each partition. If it is re-consuming
> > from
> > > zero each time you run it, it means it's not finding the offset files
> > from
> > > the previous run.
> > >
> > > Having used it a bit, the Hadoop consumer is certainly an area that
> could
> > > use improvement.
> > >
> > > HTH,
> > > David
> > >
> > >
> > > On 12/27/12 4:41 AM, Pratyush Chandra wrote:

Pratyush Chandra

 
+
Liam Stewart 2012-12-28, 16:06
+
Russell Jurney 2012-12-28, 23:46
+
Chetan Conikee 2012-12-29, 06:15
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB