Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Hadoop Consumer


Copy link to this message
-
Re: Hadoop Consumer
Thanks for the info, that's interesting :) ...

And thanks for the link Min :) Having a hadoop consumer that manages the
offsets with ZK is cool :) ...

--
Felix

On Wed, Jul 4, 2012 at 9:04 AM, Sybrandy, Casey <
[EMAIL PROTECTED]> wrote:

> We're using CDH3 update 2 or 3.  I don't know how much the version
> matters, so it may work on plain-old Hadoop.
> _____________________
> From: Murtaza Doctor [[EMAIL PROTECTED]]
> Sent: Tuesday, July 03, 2012 1:56 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Hadoop Consumer
>
> +1 This surely sounds interesting.
>
> On 7/3/12 10:05 AM, "Felix GV" <[EMAIL PROTECTED]> wrote:
>
> >Hmm that's surprising. I didn't know about that...!
> >
> >I wonder if it's a new feature... Judging from your email, I assume you're
> >using CDH? What version?
> >
> >Interesting :) ...
> >
> >--
> >Felix
> >
> >
> >
> >On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey <
> >[EMAIL PROTECTED]> wrote:
> >
> >> >> - Is there a version of consumer which appends to an existing file on
> >> HDFS
> >> >> until it reaches a specific size?
> >> >>
> >> >
> >> >No there isn't, as far as I know. Potential solutions to this would be:
> >> >
> >> >   1. Leave the data in the broker long enough for it to reach the size
> >> you
> >> >   want. Running the SimpleKafkaETLJob at those intervals would give
> >>you
> >> the
> >> >   file size you want. This is the simplest thing to do, but the
> >>drawback
> >> is
> >> >   that your data in HDFS will be less real-time.
> >> >   2. Run the SimpleKafkaETLJob as frequently as you want, and then
> >>roll
> >> up
> >> >   / compact your small files into one bigger file. You would need to
> >> come up
> >> >   with the hadoop job that does the roll up, or find one somewhere.
> >> >   3. Don't use the SimpleKafkaETLJob at all and write a new job that
> >> makes
> >> >   use of hadoop append instead...
> >> >
> >> >Also, you may be interested to take a look at these
> >> >scripts<
> >>
> >>
> http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/
> >> >I
> >> >posted a while ago. If you follow the links in this post, you can get
> >> >more details about how the scripts work and why it was necessary to do
> >>the
> >> >things it does... or you can just use them without reading. They should
> >> >work pretty much out of the box...
> >>
> >> Where I work, we discovered that you can keep a file in HDFS open and
> >> still run MapReduce jobs against the data in that file.  What you do is
> >>you
> >> flush the data periodically (every record for us), but you don't close
> >>the
> >> file right away.  This allows us to have data files that contain 24
> >>hours
> >> worth of data, but not have to close the file to run the jobs or to
> >> schedule the jobs for after the file is closed.  You can also check the
> >> file size periodically and rotate the files based on size.  We use Avro
> >> files, but sequence files should work too according to Cloudera.
> >>
> >> It's a great compromise for when you want the latest and greatest data,
> >> but don't want to have to wait until all of the files are closed to get
> >>it.
> >>
> >> Casey
>
>