I haven't been reading the list for the past couple weeks, I've quite
busy... but I've searched and didn't find any discussions related to my
current issue, so I thought I'd ask while I'm still investigating on my
We've been running a Kafka 0.7.0 cluster without problem for a while now.
I've played around<http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/>with
importing data from our Kafka cluster into hadoop a while ago, using
the simple Kafka consumer located in the contrib directory of the Kafka
source, and that worked properly. At the time, the Hadoop cluster I was
running was CDH3u3, IIRC.
I'm now revisiting that project with a brand new CDH4.1.2 Hadoop cluster
(using MR1, not YARN), and I'm having difficulty getting it to work.
At first, the run-class.sh script in kafka/contrib/hadoop-consumer wasn't
using the proper hadoop jars to connect to my cluster, so I tweaked it so
that it includes the output of the `hadoop classpath` command in its
classpath. It's now able to connect to my hadoop cluster, but it's telling
me that the versions don't match:
Exception in thread "main" org.apache.hadoop.ipc.RemoteException: Server
IPC version 7 cannot communicate with client version 3
at $Proxy0.getProtocolVersion(Unknown Source)
... (I could give the whole stacktrace if you want, but I didn't think
that's really relevant...)
So anyway, I've messed around with the
kafka/project/build/KafkaProject.scala file so that it uses the
"2.0.0-mr1-cdh4.1.2" version of hadoop-core, and fetches it from the
cloudera repo. I've added the cloudera repo by adding this line at the
beginning of the HadoopConsumerProject class section:
val clouderaRepo = "Cloudera" at "https://repository.cloudera.com/artifactory/cloudera-repos/"
When I run ./sbt update, it fetches the new jars correctly, but then, when
I run ./sbt package, it's not able to find a bunch of hadoop related
classes and packages in the hadoop-consumer code, which I guess means that
a few APIs have changed between the two versions of CDH.
I've tried this on the 0.7.0 branch of Kafka (from the Apache git repo) as
well as on the 0.7.2 branch, and I get the same result on both (I can't
successfully run ./sbt package). The easiest for me would be to get it to
work on Kafka 0.7.0, but I guess I could persuade my people to upgrade to
0.7.2 if it's necessary (I'd like us to upgrade, but I guess you all know
how it is... getting a working system to change is a political hassle). I
don't think we'd be willing to move to Kafka 0.8 just yet, so hopefully
that won't be necessary.
*TLDR: Is anyone pumping data from Kafka 0.7.x to CDH4.x ? And if so, how?
Using the example consumer from kafka's contrib, or another one?* Perhaps this
(I'll probably give
it a try soon, BTW, so I'll keep you guys posted...). I may also try
porting the hadoop-consumer contrib to CDH4.
Finally, I haven't seen anything mentioned about the LinkedIn
kafka/avro/hadoop ETL stuff we've been hearing about for a while. I saw the
new LinkedIn DataFu stuff but it seems unrelated. Are there any updates
about whether or when the ETL code would get open sourced? As far as we're
concerned, we're using avro quite a bit, so in our case, the avro coupling
would definitely not be an issue. I don't know what version(s) of hadoop
LinkedIn is running, though, so perhaps their stuff wouldn't work out of
the box with CDH4 either anyway...
Any advice would be appreciated!
Thanks :) !