-Re: Incremental Hadoop + SimpleKafkaETLJob
Felix GV 2012-01-25, 16:36
Yeah those shell scripts are basically the continuation of what I was doing
in my last blog posts. I planned to make new blog posts about them but I
just never got around to it. Then I saw your message and it gave me the
little kick in the arse I needed to at least gist those things :) ...
Hopefully, it can save you some time :) !
On Wed, Jan 25, 2012 at 3:30 AM, Paul Ingles <[EMAIL PROTECTED]> wrote:
> Thanks Felix- I found your blog posts before and it really helped me
> figure out how to get things working so I'll definitely give the shell
> scripts a run.
> On 24 Jan 2012, at 19:05, Felix GV wrote:
> > Hello :)
> > For question 1:
> > The hadoop consumer in the contrib directory has almost everything it
> > to do distributed incremental imports out of the box, but it requires a
> > of hand holding.
> > I've created two scripts to automate the process. One of them generates
> > initial offset files, and the other does incremental hadoop consumption.
> > I personally use a cron job to periodically call the incremental consumer
> > script with specific parameters (for topic and HDFS path output).
> > You can find all of the required files in this gist:
> > https://gist.github.com/1671887
> > The LinkedIn guys promised to release their full Hadoop/Kafka ETL code
> > eventually but I think they didn't have time to get around to it yet.
> > they do release it, it's probably going to be better than my scripts, but
> > for now, I think those scripts are the only publically available way to
> > this stuff without writing it yourself.
> > I don't know about question 2 and 3.
> > I hope this helps :) !
> > --
> > Felix
> > On Tue, Jan 24, 2012 at 3:24 AM, Paul Ingles <[EMAIL PROTECTED]> wrote:
> >> Hi,
> >> I'm investigating using Kafka and would really appreciate getting some
> >> more experienced opinion on the way things work together.
> >> Our application instances are creating Protocol Buffer serialized
> >> and pushing them to topics in Kafka:
> >> * Web log requests
> >> * Product details viewed
> >> * Search performed
> >> * Email registered
> >> etc...
> >> I would like to be able to perform incremental loads from these topics
> >> into HDFS and then into the rest of the batch processing. I guess I had
> >> broad questions
> >> 1) How do people trigger the batch loads? Do you just point your
> >> SimpleKafkaETLJob input to the previous runs outputted offset file? Do
> >> move files between runs of the SimpleKafkaETLJob- move the part-* file
> >> one place and move the offsets into an input directory ready for the
> >> run?
> >> 2) Yesterday I noticed that the hadoop-consumer's SimpleKafkaETLMapper
> >> outputs Long/Text writables and is marked as deprecated (this is in the
> >> source). Is there an alternative class that should be used instead, or
> >> the hadoop-consumer being deprecated overall?
> >> 3) Given the SimpleKafkaETLMapper reads bytes in but outputs Text lines,
> >> are most people using Kafka for passing text messages around or using
> >> data etc.?
> >> Thanks,
> >> Paul