Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Kafka >> mail # user >> ETL with Kafka


+
Guy Doulberg 2013-01-06, 07:49
+
David Arthur 2013-01-06, 22:29
+
Russell Jurney 2013-01-07, 07:00
+
Guy Doulberg 2013-01-07, 07:12
+
Ken Krugler 2013-01-07, 17:57
+
Russell Jurney 2013-01-07, 20:48
+
Ken Krugler 2013-01-07, 21:51
+
Russell Jurney 2013-01-07, 22:06
Copy link to this message
-
Re: ETL with Kafka

On Jan 7, 2013, at 2:05pm, Russell Jurney wrote:

> I previously posted a link to contrib in this thread.

Thanks, I missed that - all I saw was the long URL to the Talend integration doc on Hortonworks.

> No, its not a
> cascading tap. Its a complete job. One to read kafka events to hdfs, one to
> generate kafka events from hdfs. ETL can happen in between.

Some Cascading integration notes, just for posterity:

Having a Kafka Tap/Scheme would make integration easy. I see there are KafkaInputFormat and KafkaOutputFormat classes in the contrib, which is great - though these would have to back-port these to the older Hadoop APIs in order to work with Cascading. Also Cascading sends all data around as the key (value is always NullWritable) whereas the Kafka input/output formats do the opposite.

-- Ken

> On Jan 7, 2013 1:51 PM, "Ken Krugler" <[EMAIL PROTECTED]> wrote:
>
>> Hi Russell,
>>
>> On Jan 7, 2013, at 12:48pm, Russell Jurney wrote:
>>
>>> Just to be clear - a Kafka 'Tap' of sorts exists in contrib: it scans
>>> Hadoop records, which may be ETL'd first, and emits new Kafka events.
>>
>> Can you point me at the code?
>>
>> And just to confirm, you're talking about a Cascading Tap, right?
>>
>> -- Ken
>>
>>> On Mon, Jan 7, 2013 at 9:57 AM, Ken Krugler <[EMAIL PROTECTED]
>>> wrote:
>>>
>>>> Hi Guy,
>>>>
>>>> On Jan 6, 2013, at 11:11pm, Guy Doulberg wrote:
>>>>
>>>>> Hi,
>>>>> Thanks David,
>>>>>
>>>>> I am looking for a product (open source or not), something like Talend
>>>> or Pentaho that in which I can design the ETL (from and to kafka), and
>> run
>>>> the the ETL in Storm/ IronCount or even maybe I can run it in Hadoop
>>>> Map/Reduce.
>>>>
>>>> Interesting - we build ETLs on top of Hadoop using Cascading (open
>> source
>>>> workflow API), which has a lot of what it calls "Taps" for connecting to
>>>> data sources and sinks.
>>>>
>>>> But I haven't heard of a Kafka Tap. Should be possible to implement,
>>>> though.
>>>>
>>>> One issue is that Hadoop is batch oriented, so there's a bit of an
>>>> impedance mismatch when you've got a streaming data source, but from
>>>> experience it's possible to get that to work.
>>>>
>>>> -- Ken
>>>>
>>>>> The product should be complete and supports many connections to many
>>>> data sources and targets, In that sense if you know of a connection to
>>>> Talend or Pentaho it will be great.
>>>>>
>>>>> Thanks again.
>>>>> ,
>>>>>
>>>>>
>>>>> On 01/07/2013 12:28 AM, David Arthur wrote:
>>>>>> Storm has support for Kafka, if that's the sort of thing you're
>> looking
>>>>>> for. Maybe you could describe your use case a bit more?
>>>>>>
>>>>>> On Sunday, January 6, 2013, Guy Doulberg wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> I am looking for an ETL tool that can connect to kafka, as a consumer
>>>> and
>>>>>>> as a producer,
>>>>>>>
>>>>>>> Have you heard of such a tool?
>>>>>>>
>>>>>>> Thanks
>>>>>>> Guy
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>> --------------------------
>>>> Ken Krugler
>>>> +1 530-210-6378
>>>> http://www.scaleunlimited.com
>>>> custom big data solutions & training
>>>> Hadoop, Cascading, Cassandra & Solr
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Russell Jurney twitter.com/rjurney [EMAIL PROTECTED]
>> datasyndrome.com
>>
>> --------------------------------------------
>> http://about.me/kkrugler
>> +1 530-210-6378
>>
>>
>>
>>
>>
>>
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>>
>>
>>
>>
>>
>>

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr
 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB