Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka, mail # user - Hadoop Consumer


Copy link to this message
-
Re: Hadoop Consumer
Felix GV 2012-07-03, 17:05
Hmm that's surprising. I didn't know about that...!

I wonder if it's a new feature... Judging from your email, I assume you're
using CDH? What version?

Interesting :) ...

--
Felix

On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey <
[EMAIL PROTECTED]> wrote:

> >> - Is there a version of consumer which appends to an existing file on
> HDFS
> >> until it reaches a specific size?
> >>
> >
> >No there isn't, as far as I know. Potential solutions to this would be:
> >
> >   1. Leave the data in the broker long enough for it to reach the size
> you
> >   want. Running the SimpleKafkaETLJob at those intervals would give you
> the
> >   file size you want. This is the simplest thing to do, but the drawback
> is
> >   that your data in HDFS will be less real-time.
> >   2. Run the SimpleKafkaETLJob as frequently as you want, and then roll
> up
> >   / compact your small files into one bigger file. You would need to
> come up
> >   with the hadoop job that does the roll up, or find one somewhere.
> >   3. Don't use the SimpleKafkaETLJob at all and write a new job that
> makes
> >   use of hadoop append instead...
> >
> >Also, you may be interested to take a look at these
> >scripts<
> http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/
> >I
> >posted a while ago. If you follow the links in this post, you can get
> >more details about how the scripts work and why it was necessary to do the
> >things it does... or you can just use them without reading. They should
> >work pretty much out of the box...
>
> Where I work, we discovered that you can keep a file in HDFS open and
> still run MapReduce jobs against the data in that file.  What you do is you
> flush the data periodically (every record for us), but you don't close the
> file right away.  This allows us to have data files that contain 24 hours
> worth of data, but not have to close the file to run the jobs or to
> schedule the jobs for after the file is closed.  You can also check the
> file size periodically and rotate the files based on size.  We use Avro
> files, but sequence files should work too according to Cloudera.
>
> It's a great compromise for when you want the latest and greatest data,
> but don't want to have to wait until all of the files are closed to get it.
>
> Casey