Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka, mail # dev - LinkedIn's Kafka->Hadoop ETL pipeline is open source


Copy link to this message
-
LinkedIn's Kafka->Hadoop ETL pipeline is open source
Jay Kreps 2013-01-07, 23:01
Hey All,

There has been interesting in getting something a little more sophisticated
then the Input- and OutputFormat we include in contrib for reading Kafka
data into HDFS.

Internally at LinkedIn we have had a pretty sophisticated system that we
use for Kafka ETL. It automatically discovers topics, does date
partitioning, balances load for many topics, etc. We have wanted to open
source this for a while but haven't really had time to spend on it. This
code is now open source:
  https://github.com/linkedin/camus

Ken Goodhope is the lead for this system. If you have any questions there
is a mailing list here:
  [EMAIL PROTECTED]

We haven't done a ton of packaging work on this yet so there isn't a ton of
documentation and it is a bit of work to get set up. So it is probably most
appropriate for people who would be taking a "white box" approach to the
code. We have had interest from a few groups in contributing and we are
definitely interested in recruiting this kind of help. All our own
development going forward will be done off the public github repo, as usual
with LinkedIn open source projects.

Until we get better docs up, you can get a pretty good high-level overview
of our setup from this paper:
  http://sites.computer.org/debull/A12june/pipeline.pdf

-Jay