I've thrown together a Pig LoadFunc to read data from Kafka, so you could load data like:
QUERY_LOGS = load 'kafka://localhost:9092/logs.query#8' using com.mycompany.pig.KafkaAvroLoader('com.mycompany.Query');
The path part of the uri is the Kafka topic, and the fragment is the number of partitions. In the implementation I have, it makes one input split per partition. Offsets are not really dealt with at this point - it's a rough prototype.
Anyone have thoughts on whether or not this is a good idea? I know usually the pattern is: kafka -> hdfs -> mapreduce. If I'm only reading from this data from Kafka once, is there any reason why I can't skip writing to HDFS?
Right now it only terminates if SimpleConsumer hits the timeout. So, in theory it can forever. To bound the InputFormat, I would probably add a max time or max number of messages to consume (in addition to the timeout).
I started by looking at the Camus code, but it was easier to whip up a simple InputFormat for testing. If this becomes a real thing, I would probably figure out how to utilize Camus.
I'd be happy to, if and when it becomes a real thing. Still very alpha quality right now
On 8/7/13 10:58 AM, Russell Jurney wrote:
NEW: Monitor These Apps!
Apache Lucene, Apache Solr and all other Apache Software Foundation projects and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext