Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Incremental Hadoop + SimpleKafkaETLJob


Copy link to this message
-
Re: Incremental Hadoop + SimpleKafkaETLJob
Yeah those shell scripts are basically the continuation of what I was doing
in my last blog posts. I planned to make new blog posts about them but I
just never got around to it. Then I saw your message and it gave me the
little kick in the arse I needed to at least gist those things :) ...

Hopefully, it can save you some time :) !

--
Felix

On Wed, Jan 25, 2012 at 3:30 AM, Paul Ingles <[EMAIL PROTECTED]> wrote:

> Thanks Felix- I found your blog posts before and it really helped me
> figure out how to get things working so I'll definitely give the shell
> scripts a run.
>
>
>
> On 24 Jan 2012, at 19:05, Felix GV wrote:
>
> > Hello :)
> >
> > For question 1:
> >
> > The hadoop consumer in the contrib directory has almost everything it
> needs
> > to do distributed incremental imports out of the box, but it requires a
> bit
> > of hand holding.
> >
> > I've created two scripts to automate the process. One of them generates
> > initial offset files, and the other does incremental hadoop consumption.
> >
> > I personally use a cron job to periodically call the incremental consumer
> > script with specific parameters (for topic and HDFS path output).
> >
> > You can find all of the required files in this gist:
> > https://gist.github.com/1671887
> >
> > The LinkedIn guys promised to release their full Hadoop/Kafka ETL code
> > eventually but I think they didn't have time to get around to it yet.
> When
> > they do release it, it's probably going to be better than my scripts, but
> > for now, I think those scripts are the only publically available way to
> do
> > this stuff without writing it yourself.
> >
> > I don't know about question 2 and 3.
> >
> > I hope this helps :) !
> >
> > --
> > Felix
> >
> >
> >
> > On Tue, Jan 24, 2012 at 3:24 AM, Paul Ingles <[EMAIL PROTECTED]> wrote:
> >
> >> Hi,
> >>
> >> I'm investigating using Kafka and would really appreciate getting some
> >> more experienced opinion on the way things work together.
> >>
> >> Our application instances are creating Protocol Buffer serialized
> messages
> >> and pushing them to topics in Kafka:
> >>
> >> * Web log requests
> >> * Product details viewed
> >> * Search performed
> >> * Email registered
> >> etc...
> >>
> >> I would like to be able to perform incremental loads from these topics
> >> into HDFS and then into the rest of the batch processing. I guess I had
> 3
> >> broad questions
> >>
> >> 1) How do people trigger the batch loads? Do you just point your
> >> SimpleKafkaETLJob input to the previous runs outputted offset file? Do
> you
> >> move files between runs of the SimpleKafkaETLJob- move the part-* file
> into
> >> one place and move the offsets into an input directory ready for the
> next
> >> run?
> >>
> >> 2) Yesterday I noticed that the hadoop-consumer's SimpleKafkaETLMapper
> >> outputs Long/Text writables and is marked as deprecated (this is in the
> 0.7
> >> source). Is there an alternative class that should be used instead, or
> is
> >> the hadoop-consumer being deprecated overall?
> >>
> >> 3) Given the SimpleKafkaETLMapper reads bytes in but outputs Text lines,
> >> are most people using Kafka for passing text messages around or using
> JSON
> >> data etc.?
> >>
> >> Thanks,
> >> Paul
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB