Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> How to use the hadoop consumer in distributed mode?

Copy link to this message
Re: How to use the hadoop consumer in distributed mode?

I wanted to give a little update on this topic.

I was able to make hadoop-consumer work with a kafka cluster.

What I did is:

   1. I generated a .properties file for one of the kafka brokers I wanted
   to connect to.
   2. I ran the DataGenerator program by passing the .properties file as a
   3. I moved the 1.dat offset file generated in HDFS so that it has another
   name (so that it's not overwritten the next time I run the DataGenerator).
   4. I changed the the broker's address in the .properties file to the next
   server I wanted to connect to.
   5. I repeated step 2 to 4 for every kafka server in the cluster.
   6. I then ran SimpleKafkaETLJob and it was able to spawn one map task per
   broker and pull all the data from each.

This is almost exactly what I was trying before, except that before, I had
manually modified the .dat offset files instead of generating each one with
the DataGenerator, and I think vim didn't play nice with the SEQ files or
something like that... I don't know.

Anyhow, what I'm doing now is a little convoluted but at least it works... I
will create a script that does all this repetitive stuff for me. Ideally, I
would also like to pull the brokers list from ZK, like you guys do.

The Kafka/Hadoop ETL tools you mentioned are no doubt more mature and
complete than the stuff I will create, so it would be really nice if you
could release it.

I think releasing those tools would help drive the adoption of Kafka,
because in the state it's in now, Kafka is not really plug and play. That
is, it works (which is already better than a lot of open source projects out
there ;) !) but it seems a rather important part is missing.


On Tue, Oct 18, 2011 at 7:31 PM, Hisham Mardam-Bey <[EMAIL PROTECTED]>wrote:

> Hi folks, been following this thread, Felix and I are working together
> on this project, we really like Kafka and are moving it into
> production very soon.
> Jay, question, would you guys consider releasing the code in a "not so
> clean state" and have the community (we would like to help) shore it
> up so it becomes usable by the masses or are there other issues
> (legal?) you have to sort out first?
> Thanks!
> hisham.
> On Tue, Oct 18, 2011 at 6:28 PM, Jay Kreps <[EMAIL PROTECTED]> wrote:
> > I would actually love for us to release the full ETL system we have for
> > Kafka/Hadoop, it is just a matter of finding the time to get this code
> into
> > that shape.
> >
> > The hadoop team that maintains that code is pretty busy right now, but i
> am
> > hoping we can find a way.
> >
> > -Jay
> >
> > On Tue, Oct 18, 2011 at 3:18 PM, Felix Giguere Villegas <
> > [EMAIL PROTECTED]> wrote:
> >
> >> Thanks for your replies guys :)
> >>
> >> @Jay: I thought about the Hadoop version mismatch too, because I've had
> the
> >> same problem before. I'll double check again to make sure I have the
> same
> >> versions of hadoop everywhere, as the Kafka distributed cluster I was
> >> testing on is a new setup and I might have forgotten to put the hadoop
> jars
> >> we use in it... I'm working part-time for now so I probably won't touch
> >> this
> >> again until next week but I'll keep you guys posted ASAP :)
> >>
> >> @Richard: Thanks a lot for your description. That clears out the
> >> inaccuracies in my understanding. Is there any chance you guys might
> >> release
> >> the code you use to query ZK and create appropriate offset files for
> each
> >> broker/partition pair? The hadoop consumer provided in the source works
> >> with
> >> the setup we get from the quickstart guide, but the process you describe
> >> seems more appropriate for production use.
> >>
> >> Thanks again :)
> >>
> >> --
> >> Felix
> >>
> >>
> >>
> >> On Tue, Oct 18, 2011 at 5:52 PM, Richard Park <[EMAIL PROTECTED]
> >> >wrote:
> >>
> >> > Does the version in contrib contain the fixes for Kafka-131? The
> offsets
> >> > were incorrectly computed prior to this patch.
> >> >
> >> > At LinkedIn, this is what we do in a nutshell.