Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Hadoop Consumer


Copy link to this message
-
Re: Hadoop Consumer
Hmm that's surprising. I didn't know about that...!

I wonder if it's a new feature... Judging from your email, I assume you're
using CDH? What version?

Interesting :) ...

--
Felix

On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey <
[EMAIL PROTECTED]> wrote:

> >> - Is there a version of consumer which appends to an existing file on
> HDFS
> >> until it reaches a specific size?
> >>
> >
> >No there isn't, as far as I know. Potential solutions to this would be:
> >
> >   1. Leave the data in the broker long enough for it to reach the size
> you
> >   want. Running the SimpleKafkaETLJob at those intervals would give you
> the
> >   file size you want. This is the simplest thing to do, but the drawback
> is
> >   that your data in HDFS will be less real-time.
> >   2. Run the SimpleKafkaETLJob as frequently as you want, and then roll
> up
> >   / compact your small files into one bigger file. You would need to
> come up
> >   with the hadoop job that does the roll up, or find one somewhere.
> >   3. Don't use the SimpleKafkaETLJob at all and write a new job that
> makes
> >   use of hadoop append instead...
> >
> >Also, you may be interested to take a look at these
> >scripts<
> http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/
> >I
> >posted a while ago. If you follow the links in this post, you can get
> >more details about how the scripts work and why it was necessary to do the
> >things it does... or you can just use them without reading. They should
> >work pretty much out of the box...
>
> Where I work, we discovered that you can keep a file in HDFS open and
> still run MapReduce jobs against the data in that file.  What you do is you
> flush the data periodically (every record for us), but you don't close the
> file right away.  This allows us to have data files that contain 24 hours
> worth of data, but not have to close the file to run the jobs or to
> schedule the jobs for after the file is closed.  You can also check the
> file size periodically and rotate the files based on size.  We use Avro
> files, but sequence files should work too according to Cloudera.
>
> It's a great compromise for when you want the latest and greatest data,
> but don't want to have to wait until all of the files are closed to get it.
>
> Casey
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB