Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Hadoop Consumer


Copy link to this message
-
Re: Hadoop Consumer
Thanks for the info, that's interesting :) ...

And thanks for the link Min :) Having a hadoop consumer that manages the
offsets with ZK is cool :) ...

--
Felix

On Wed, Jul 4, 2012 at 9:04 AM, Sybrandy, Casey <
[EMAIL PROTECTED]> wrote:

> We're using CDH3 update 2 or 3.  I don't know how much the version
> matters, so it may work on plain-old Hadoop.
> _____________________
> From: Murtaza Doctor [[EMAIL PROTECTED]]
> Sent: Tuesday, July 03, 2012 1:56 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Hadoop Consumer
>
> +1 This surely sounds interesting.
>
> On 7/3/12 10:05 AM, "Felix GV" <[EMAIL PROTECTED]> wrote:
>
> >Hmm that's surprising. I didn't know about that...!
> >
> >I wonder if it's a new feature... Judging from your email, I assume you're
> >using CDH? What version?
> >
> >Interesting :) ...
> >
> >--
> >Felix
> >
> >
> >
> >On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey <
> >[EMAIL PROTECTED]> wrote:
> >
> >> >> - Is there a version of consumer which appends to an existing file on
> >> HDFS
> >> >> until it reaches a specific size?
> >> >>
> >> >
> >> >No there isn't, as far as I know. Potential solutions to this would be:
> >> >
> >> >   1. Leave the data in the broker long enough for it to reach the size
> >> you
> >> >   want. Running the SimpleKafkaETLJob at those intervals would give
> >>you
> >> the
> >> >   file size you want. This is the simplest thing to do, but the
> >>drawback
> >> is
> >> >   that your data in HDFS will be less real-time.
> >> >   2. Run the SimpleKafkaETLJob as frequently as you want, and then
> >>roll
> >> up
> >> >   / compact your small files into one bigger file. You would need to
> >> come up
> >> >   with the hadoop job that does the roll up, or find one somewhere.
> >> >   3. Don't use the SimpleKafkaETLJob at all and write a new job that
> >> makes
> >> >   use of hadoop append instead...
> >> >
> >> >Also, you may be interested to take a look at these
> >> >scripts<
> >>
> >>
> http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/
> >> >I
> >> >posted a while ago. If you follow the links in this post, you can get
> >> >more details about how the scripts work and why it was necessary to do
> >>the
> >> >things it does... or you can just use them without reading. They should
> >> >work pretty much out of the box...
> >>
> >> Where I work, we discovered that you can keep a file in HDFS open and
> >> still run MapReduce jobs against the data in that file.  What you do is
> >>you
> >> flush the data periodically (every record for us), but you don't close
> >>the
> >> file right away.  This allows us to have data files that contain 24
> >>hours
> >> worth of data, but not have to close the file to run the jobs or to
> >> schedule the jobs for after the file is closed.  You can also check the
> >> file size periodically and rotate the files based on size.  We use Avro
> >> files, but sequence files should work too according to Cloudera.
> >>
> >> It's a great compromise for when you want the latest and greatest data,
> >> but don't want to have to wait until all of the files are closed to get
> >>it.
> >>
> >> Casey
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB