Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Reading Kafka directly from Pig?


Copy link to this message
-
Re: Reading Kafka directly from Pig?
Right now it only terminates if SimpleConsumer hits the timeout. So, in
theory it can forever. To bound the InputFormat, I would probably add a
max time or max number of messages to consume (in addition to the timeout).

I started by looking at the Camus code, but it was easier to whip up a
simple InputFormat for testing. If this becomes a real thing, I would
probably figure out how to utilize Camus.

-David

On 8/7/13 10:49 AM, Jun Rao wrote:
> David,
>
> That's interesting. Kafka provides an infinite stream of data whereas Pig
> works on a finite amount of data. How did you solve the mismatch?
>
> Thanks,
>
> Jun
>
>
> On Wed, Aug 7, 2013 at 7:41 AM, David Arthur <[EMAIL PROTECTED]> wrote:
>
>> I've thrown together a Pig LoadFunc to read data from Kafka, so you could
>> load data like:
>>
>> QUERY_LOGS = load 'kafka://localhost:9092/logs.**query#8' using
>> com.mycompany.pig.**KafkaAvroLoader('com.**mycompany.Query');
>>
>> The path part of the uri is the Kafka topic, and the fragment is the
>> number of partitions. In the implementation I have, it makes one input
>> split per partition. Offsets are not really dealt with at this point - it's
>> a rough prototype.
>>
>> Anyone have thoughts on whether or not this is a good idea? I know usually
>> the pattern is: kafka -> hdfs -> mapreduce. If I'm only reading from this
>> data from Kafka once, is there any reason why I can't skip writing to HDFS?
>>
>> Thanks!
>> -David
>>
 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB