Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Hadoop Consumer


Copy link to this message
-
Re: Hadoop Consumer
Answer inlined...

--
Felix

On Fri, Jun 29, 2012 at 9:24 PM, Murtaza Doctor
<[EMAIL PROTECTED]>wrote:

> Had a few questions around the Hadoop Consumer.
>
> - We have event data under the topic "foo" written to the kafka
> Server/Broker in avro format and want to write those events to HDFS. Does
> the Hadoop consumer expect the data written to HDFS already?
No it doesn't expect the data to be written into HDFS already... There
wouldn't be much point to it, otherwise, no ;) ?
> Based on the
> doc looks like the DataGenerator is pulling events from the broker and
> writing to HDFS. In our case we only wanted to utilize the
> SimpleKafkaETLJob to write to HDFS.
That's what it does. It spawns a (map only) Map Reduce job that pulls in
parallel from the broker(s) and writes that data into HDFS.
> I am surely missing something here?
>

Maybe...? I don't know. Do tell if anything is not clear still...!
> - Is there a version of consumer which appends to an existing file on HDFS
> until it reaches a specific size?
>

No there isn't, as far as I know. Potential solutions to this would be:

   1. Leave the data in the broker long enough for it to reach the size you
   want. Running the SimpleKafkaETLJob at those intervals would give you the
   file size you want. This is the simplest thing to do, but the drawback is
   that your data in HDFS will be less real-time.
   2. Run the SimpleKafkaETLJob as frequently as you want, and then roll up
   / compact your small files into one bigger file. You would need to come up
   with the hadoop job that does the roll up, or find one somewhere.
   3. Don't use the SimpleKafkaETLJob at all and write a new job that makes
   use of hadoop append instead...

Also, you may be interested to take a look at these
scripts<http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/>I
posted a while ago. If you follow the links in this post, you can get
more details about how the scripts work and why it was necessary to do the
things it does... or you can just use them without reading. They should
work pretty much out of the box...

>
> Thanks,
> murtaza
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB