Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Reading Kafka directly from Pig?


Copy link to this message
-
Reading Kafka directly from Pig?
I've thrown together a Pig LoadFunc to read data from Kafka, so you
could load data like:

QUERY_LOGS = load 'kafka://localhost:9092/logs.query#8' using
com.mycompany.pig.KafkaAvroLoader('com.mycompany.Query');

The path part of the uri is the Kafka topic, and the fragment is the
number of partitions. In the implementation I have, it makes one input
split per partition. Offsets are not really dealt with at this point -
it's a rough prototype.

Anyone have thoughts on whether or not this is a good idea? I know
usually the pattern is: kafka -> hdfs -> mapreduce. If I'm only reading
from this data from Kafka once, is there any reason why I can't skip
writing to HDFS?

Thanks!
-David

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB