|
Murtaza Doctor
2012-06-30, 01:24
Felix GV
2012-07-03, 16:15
Sybrandy, Casey
2012-07-03, 16:34
Felix GV
2012-07-03, 17:05
Murtaza Doctor
2012-07-03, 17:56
Murtaza Doctor
2012-07-03, 17:56
Min
2012-07-04, 01:29
Sybrandy, Casey
2012-07-04, 13:04
Felix GV
2012-07-04, 16:18
Grégoire Seux
2012-07-04, 16:25
Murtaza Doctor
2012-07-12, 19:57
Min
2012-07-16, 03:23
|
-
Hadoop ConsumerMurtaza Doctor 2012-06-30, 01:24
Had a few questions around the Hadoop Consumer.
- We have event data under the topic "foo" written to the kafka Server/Broker in avro format and want to write those events to HDFS. Does the Hadoop consumer expect the data written to HDFS already? Based on the doc looks like the DataGenerator is pulling events from the broker and writing to HDFS. In our case we only wanted to utilize the SimpleKafkaETLJob to write to HDFS. I am surely missing something here? - Is there a version of consumer which appends to an existing file on HDFS until it reaches a specific size? Thanks, murtaza
-
Re: Hadoop ConsumerFelix GV 2012-07-03, 16:15
Answer inlined...
-- Felix On Fri, Jun 29, 2012 at 9:24 PM, Murtaza Doctor <[EMAIL PROTECTED]>wrote: > Had a few questions around the Hadoop Consumer. > > - We have event data under the topic "foo" written to the kafka > Server/Broker in avro format and want to write those events to HDFS. Does > the Hadoop consumer expect the data written to HDFS already? No it doesn't expect the data to be written into HDFS already... There wouldn't be much point to it, otherwise, no ;) ? > Based on the > doc looks like the DataGenerator is pulling events from the broker and > writing to HDFS. In our case we only wanted to utilize the > SimpleKafkaETLJob to write to HDFS. That's what it does. It spawns a (map only) Map Reduce job that pulls in parallel from the broker(s) and writes that data into HDFS. > I am surely missing something here? > Maybe...? I don't know. Do tell if anything is not clear still...! > - Is there a version of consumer which appends to an existing file on HDFS > until it reaches a specific size? > No there isn't, as far as I know. Potential solutions to this would be: 1. Leave the data in the broker long enough for it to reach the size you want. Running the SimpleKafkaETLJob at those intervals would give you the file size you want. This is the simplest thing to do, but the drawback is that your data in HDFS will be less real-time. 2. Run the SimpleKafkaETLJob as frequently as you want, and then roll up / compact your small files into one bigger file. You would need to come up with the hadoop job that does the roll up, or find one somewhere. 3. Don't use the SimpleKafkaETLJob at all and write a new job that makes use of hadoop append instead... Also, you may be interested to take a look at these scripts<http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/>I posted a while ago. If you follow the links in this post, you can get more details about how the scripts work and why it was necessary to do the things it does... or you can just use them without reading. They should work pretty much out of the box... > > Thanks, > murtaza > >
-
RE: Hadoop ConsumerSybrandy, Casey 2012-07-03, 16:34
>> - Is there a version of consumer which appends to an existing file on HDFS
>> until it reaches a specific size? >> > >No there isn't, as far as I know. Potential solutions to this would be: > > 1. Leave the data in the broker long enough for it to reach the size you > want. Running the SimpleKafkaETLJob at those intervals would give you the > file size you want. This is the simplest thing to do, but the drawback is > that your data in HDFS will be less real-time. > 2. Run the SimpleKafkaETLJob as frequently as you want, and then roll up > / compact your small files into one bigger file. You would need to come up > with the hadoop job that does the roll up, or find one somewhere. > 3. Don't use the SimpleKafkaETLJob at all and write a new job that makes > use of hadoop append instead... > >Also, you may be interested to take a look at these >scripts<http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/>I >posted a while ago. If you follow the links in this post, you can get >more details about how the scripts work and why it was necessary to do the >things it does... or you can just use them without reading. They should >work pretty much out of the box... Where I work, we discovered that you can keep a file in HDFS open and still run MapReduce jobs against the data in that file. What you do is you flush the data periodically (every record for us), but you don't close the file right away. This allows us to have data files that contain 24 hours worth of data, but not have to close the file to run the jobs or to schedule the jobs for after the file is closed. You can also check the file size periodically and rotate the files based on size. We use Avro files, but sequence files should work too according to Cloudera. It's a great compromise for when you want the latest and greatest data, but don't want to have to wait until all of the files are closed to get it. Casey
-
Re: Hadoop ConsumerFelix GV 2012-07-03, 17:05
Hmm that's surprising. I didn't know about that...!
I wonder if it's a new feature... Judging from your email, I assume you're using CDH? What version? Interesting :) ... -- Felix On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey < [EMAIL PROTECTED]> wrote: > >> - Is there a version of consumer which appends to an existing file on > HDFS > >> until it reaches a specific size? > >> > > > >No there isn't, as far as I know. Potential solutions to this would be: > > > > 1. Leave the data in the broker long enough for it to reach the size > you > > want. Running the SimpleKafkaETLJob at those intervals would give you > the > > file size you want. This is the simplest thing to do, but the drawback > is > > that your data in HDFS will be less real-time. > > 2. Run the SimpleKafkaETLJob as frequently as you want, and then roll > up > > / compact your small files into one bigger file. You would need to > come up > > with the hadoop job that does the roll up, or find one somewhere. > > 3. Don't use the SimpleKafkaETLJob at all and write a new job that > makes > > use of hadoop append instead... > > > >Also, you may be interested to take a look at these > >scripts< > http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/ > >I > >posted a while ago. If you follow the links in this post, you can get > >more details about how the scripts work and why it was necessary to do the > >things it does... or you can just use them without reading. They should > >work pretty much out of the box... > > Where I work, we discovered that you can keep a file in HDFS open and > still run MapReduce jobs against the data in that file. What you do is you > flush the data periodically (every record for us), but you don't close the > file right away. This allows us to have data files that contain 24 hours > worth of data, but not have to close the file to run the jobs or to > schedule the jobs for after the file is closed. You can also check the > file size periodically and rotate the files based on size. We use Avro > files, but sequence files should work too according to Cloudera. > > It's a great compromise for when you want the latest and greatest data, > but don't want to have to wait until all of the files are closed to get it. > > Casey
-
Re: Hadoop ConsumerMurtaza Doctor 2012-07-03, 17:56
+1 This surely sounds interesting.
On 7/3/12 10:05 AM, "Felix GV" <[EMAIL PROTECTED]> wrote: >Hmm that's surprising. I didn't know about that...! > >I wonder if it's a new feature... Judging from your email, I assume you're >using CDH? What version? > >Interesting :) ... > >-- >Felix > > > >On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey < >[EMAIL PROTECTED]> wrote: > >> >> - Is there a version of consumer which appends to an existing file on >> HDFS >> >> until it reaches a specific size? >> >> >> > >> >No there isn't, as far as I know. Potential solutions to this would be: >> > >> > 1. Leave the data in the broker long enough for it to reach the size >> you >> > want. Running the SimpleKafkaETLJob at those intervals would give >>you >> the >> > file size you want. This is the simplest thing to do, but the >>drawback >> is >> > that your data in HDFS will be less real-time. >> > 2. Run the SimpleKafkaETLJob as frequently as you want, and then >>roll >> up >> > / compact your small files into one bigger file. You would need to >> come up >> > with the hadoop job that does the roll up, or find one somewhere. >> > 3. Don't use the SimpleKafkaETLJob at all and write a new job that >> makes >> > use of hadoop append instead... >> > >> >Also, you may be interested to take a look at these >> >scripts< >> >>http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/ >> >I >> >posted a while ago. If you follow the links in this post, you can get >> >more details about how the scripts work and why it was necessary to do >>the >> >things it does... or you can just use them without reading. They should >> >work pretty much out of the box... >> >> Where I work, we discovered that you can keep a file in HDFS open and >> still run MapReduce jobs against the data in that file. What you do is >>you >> flush the data periodically (every record for us), but you don't close >>the >> file right away. This allows us to have data files that contain 24 >>hours >> worth of data, but not have to close the file to run the jobs or to >> schedule the jobs for after the file is closed. You can also check the >> file size periodically and rotate the files based on size. We use Avro >> files, but sequence files should work too according to Cloudera. >> >> It's a great compromise for when you want the latest and greatest data, >> but don't want to have to wait until all of the files are closed to get >>it. >> >> Casey
-
Re: Hadoop ConsumerMurtaza Doctor 2012-07-03, 17:56
>>
>>- We have event data under the topic "foo" written to the kafka >> Server/Broker in avro format and want to write those events to HDFS. >>Does >> the Hadoop consumer expect the data written to HDFS already? > > >No it doesn't expect the data to be written into HDFS already... There >wouldn't be much point to it, otherwise, no ;) ? > Sorry, my note was unclear. I meant the SimpleKafkaETLJob requires a sequence file with an offset written to HDFS and then uses that as a bookmark to pull the data from the broker? This file has a checksum and I was trying to modify the topic in it, which then of course messes up the checksum. I already have events generated on my Kafka server and all I wanted to do is run SimpleKafkaETLJob to pull out the data and write to HDFS. Was trying to fulfill the sequence file pre-requisite and that does not seem to work for me. > >> Based on the >> doc looks like the DataGenerator is pulling events from the broker and >> writing to HDFS. In our case we only wanted to utilize the >> SimpleKafkaETLJob to write to HDFS. > > >That's what it does. It spawns a (map only) Map Reduce job that pulls in >parallel from the broker(s) and writes that data into HDFS. > > >> I am surely missing something here? >> > >Maybe...? I don't know. Do tell if anything is not clear still...! Thanks for asserting, just want to make sure I got it right. > > >> - Is there a version of consumer which appends to an existing file on >>HDFS >> until it reaches a specific size? >> > >No there isn't, as far as I know. Potential solutions to this would be: > > 1. Leave the data in the broker long enough for it to reach the size >you > want. Running the SimpleKafkaETLJob at those intervals would give you >the > file size you want. This is the simplest thing to do, but the drawback >is > that your data in HDFS will be less real-time. > 2. Run the SimpleKafkaETLJob as frequently as you want, and then roll >up > / compact your small files into one bigger file. You would need to >come up > with the hadoop job that does the roll up, or find one somewhere. > 3. Don't use the SimpleKafkaETLJob at all and write a new job that >makes > use of hadoop append instead... These options are very useful. I like option 3 the most :) > >Also, you may be interested to take a look at these >scripts<http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-co >nsumer/>I >posted a while ago. If you follow the links in this post, you can get >more details about how the scripts work and why it was necessary to do the >things it does... or you can just use them without reading. They should >work pretty much out of the box... Will surely give them a spin. Thanks! > >> >> Thanks, >> murtaza >> >>
-
Re: Hadoop ConsumerMin 2012-07-04, 01:29
I've created another hadoop consumer which uses zookeeper.
https://github.com/miniway/kafka-hadoop-consumer With a hadoop OutputFormatter, I could add new files to the existing target directory. Hope this would help. Thanks Min 2012/7/4 Murtaza Doctor <[EMAIL PROTECTED]>: > +1 This surely sounds interesting. > > On 7/3/12 10:05 AM, "Felix GV" <[EMAIL PROTECTED]> wrote: > >>Hmm that's surprising. I didn't know about that...! >> >>I wonder if it's a new feature... Judging from your email, I assume you're >>using CDH? What version? >> >>Interesting :) ... >> >>-- >>Felix >> >> >> >>On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey < >>[EMAIL PROTECTED]> wrote: >> >>> >> - Is there a version of consumer which appends to an existing file on >>> HDFS >>> >> until it reaches a specific size? >>> >> >>> > >>> >No there isn't, as far as I know. Potential solutions to this would be: >>> > >>> > 1. Leave the data in the broker long enough for it to reach the size >>> you >>> > want. Running the SimpleKafkaETLJob at those intervals would give >>>you >>> the >>> > file size you want. This is the simplest thing to do, but the >>>drawback >>> is >>> > that your data in HDFS will be less real-time. >>> > 2. Run the SimpleKafkaETLJob as frequently as you want, and then >>>roll >>> up >>> > / compact your small files into one bigger file. You would need to >>> come up >>> > with the hadoop job that does the roll up, or find one somewhere. >>> > 3. Don't use the SimpleKafkaETLJob at all and write a new job that >>> makes >>> > use of hadoop append instead... >>> > >>> >Also, you may be interested to take a look at these >>> >scripts< >>> >>>http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/ >>> >I >>> >posted a while ago. If you follow the links in this post, you can get >>> >more details about how the scripts work and why it was necessary to do >>>the >>> >things it does... or you can just use them without reading. They should >>> >work pretty much out of the box... >>> >>> Where I work, we discovered that you can keep a file in HDFS open and >>> still run MapReduce jobs against the data in that file. What you do is >>>you >>> flush the data periodically (every record for us), but you don't close >>>the >>> file right away. This allows us to have data files that contain 24 >>>hours >>> worth of data, but not have to close the file to run the jobs or to >>> schedule the jobs for after the file is closed. You can also check the >>> file size periodically and rotate the files based on size. We use Avro >>> files, but sequence files should work too according to Cloudera. >>> >>> It's a great compromise for when you want the latest and greatest data, >>> but don't want to have to wait until all of the files are closed to get >>>it. >>> >>> Casey >
-
RE: Hadoop ConsumerSybrandy, Casey 2012-07-04, 13:04
We're using CDH3 update 2 or 3. I don't know how much the version matters, so it may work on plain-old Hadoop.
_____________________ From: Murtaza Doctor [[EMAIL PROTECTED]] Sent: Tuesday, July 03, 2012 1:56 PM To: [EMAIL PROTECTED] Subject: Re: Hadoop Consumer +1 This surely sounds interesting. On 7/3/12 10:05 AM, "Felix GV" <[EMAIL PROTECTED]> wrote: >Hmm that's surprising. I didn't know about that...! > >I wonder if it's a new feature... Judging from your email, I assume you're >using CDH? What version? > >Interesting :) ... > >-- >Felix > > > >On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey < >[EMAIL PROTECTED]> wrote: > >> >> - Is there a version of consumer which appends to an existing file on >> HDFS >> >> until it reaches a specific size? >> >> >> > >> >No there isn't, as far as I know. Potential solutions to this would be: >> > >> > 1. Leave the data in the broker long enough for it to reach the size >> you >> > want. Running the SimpleKafkaETLJob at those intervals would give >>you >> the >> > file size you want. This is the simplest thing to do, but the >>drawback >> is >> > that your data in HDFS will be less real-time. >> > 2. Run the SimpleKafkaETLJob as frequently as you want, and then >>roll >> up >> > / compact your small files into one bigger file. You would need to >> come up >> > with the hadoop job that does the roll up, or find one somewhere. >> > 3. Don't use the SimpleKafkaETLJob at all and write a new job that >> makes >> > use of hadoop append instead... >> > >> >Also, you may be interested to take a look at these >> >scripts< >> >>http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/ >> >I >> >posted a while ago. If you follow the links in this post, you can get >> >more details about how the scripts work and why it was necessary to do >>the >> >things it does... or you can just use them without reading. They should >> >work pretty much out of the box... >> >> Where I work, we discovered that you can keep a file in HDFS open and >> still run MapReduce jobs against the data in that file. What you do is >>you >> flush the data periodically (every record for us), but you don't close >>the >> file right away. This allows us to have data files that contain 24 >>hours >> worth of data, but not have to close the file to run the jobs or to >> schedule the jobs for after the file is closed. You can also check the >> file size periodically and rotate the files based on size. We use Avro >> files, but sequence files should work too according to Cloudera. >> >> It's a great compromise for when you want the latest and greatest data, >> but don't want to have to wait until all of the files are closed to get >>it. >> >> Casey
-
Re: Hadoop ConsumerFelix GV 2012-07-04, 16:18
Thanks for the info, that's interesting :) ...
And thanks for the link Min :) Having a hadoop consumer that manages the offsets with ZK is cool :) ... -- Felix On Wed, Jul 4, 2012 at 9:04 AM, Sybrandy, Casey < [EMAIL PROTECTED]> wrote: > We're using CDH3 update 2 or 3. I don't know how much the version > matters, so it may work on plain-old Hadoop. > _____________________ > From: Murtaza Doctor [[EMAIL PROTECTED]] > Sent: Tuesday, July 03, 2012 1:56 PM > To: [EMAIL PROTECTED] > Subject: Re: Hadoop Consumer > > +1 This surely sounds interesting. > > On 7/3/12 10:05 AM, "Felix GV" <[EMAIL PROTECTED]> wrote: > > >Hmm that's surprising. I didn't know about that...! > > > >I wonder if it's a new feature... Judging from your email, I assume you're > >using CDH? What version? > > > >Interesting :) ... > > > >-- > >Felix > > > > > > > >On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey < > >[EMAIL PROTECTED]> wrote: > > > >> >> - Is there a version of consumer which appends to an existing file on > >> HDFS > >> >> until it reaches a specific size? > >> >> > >> > > >> >No there isn't, as far as I know. Potential solutions to this would be: > >> > > >> > 1. Leave the data in the broker long enough for it to reach the size > >> you > >> > want. Running the SimpleKafkaETLJob at those intervals would give > >>you > >> the > >> > file size you want. This is the simplest thing to do, but the > >>drawback > >> is > >> > that your data in HDFS will be less real-time. > >> > 2. Run the SimpleKafkaETLJob as frequently as you want, and then > >>roll > >> up > >> > / compact your small files into one bigger file. You would need to > >> come up > >> > with the hadoop job that does the roll up, or find one somewhere. > >> > 3. Don't use the SimpleKafkaETLJob at all and write a new job that > >> makes > >> > use of hadoop append instead... > >> > > >> >Also, you may be interested to take a look at these > >> >scripts< > >> > >> > http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consumer/ > >> >I > >> >posted a while ago. If you follow the links in this post, you can get > >> >more details about how the scripts work and why it was necessary to do > >>the > >> >things it does... or you can just use them without reading. They should > >> >work pretty much out of the box... > >> > >> Where I work, we discovered that you can keep a file in HDFS open and > >> still run MapReduce jobs against the data in that file. What you do is > >>you > >> flush the data periodically (every record for us), but you don't close > >>the > >> file right away. This allows us to have data files that contain 24 > >>hours > >> worth of data, but not have to close the file to run the jobs or to > >> schedule the jobs for after the file is closed. You can also check the > >> file size periodically and rotate the files based on size. We use Avro > >> files, but sequence files should work too according to Cloudera. > >> > >> It's a great compromise for when you want the latest and greatest data, > >> but don't want to have to wait until all of the files are closed to get > >>it. > >> > >> Casey > >
-
RE: Hadoop ConsumerGrégoire Seux 2012-07-04, 16:25
Thanks a lot Min, this is indeed very useful.
-- Greg -----Original Message----- From: Felix GV [mailto:[EMAIL PROTECTED]] Sent: mercredi 4 juillet 2012 18:19 To: [EMAIL PROTECTED] Subject: Re: Hadoop Consumer Thanks for the info, that's interesting :) ... And thanks for the link Min :) Having a hadoop consumer that manages the offsets with ZK is cool :) ... -- Felix On Wed, Jul 4, 2012 at 9:04 AM, Sybrandy, Casey < [EMAIL PROTECTED]> wrote: > We're using CDH3 update 2 or 3. I don't know how much the version > matters, so it may work on plain-old Hadoop. > _____________________ > From: Murtaza Doctor [[EMAIL PROTECTED]] > Sent: Tuesday, July 03, 2012 1:56 PM > To: [EMAIL PROTECTED] > Subject: Re: Hadoop Consumer > > +1 This surely sounds interesting. > > On 7/3/12 10:05 AM, "Felix GV" <[EMAIL PROTECTED]> wrote: > > >Hmm that's surprising. I didn't know about that...! > > > >I wonder if it's a new feature... Judging from your email, I assume > >you're using CDH? What version? > > > >Interesting :) ... > > > >-- > >Felix > > > > > > > >On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey < > >[EMAIL PROTECTED]> wrote: > > > >> >> - Is there a version of consumer which appends to an existing > >> >> file on > >> HDFS > >> >> until it reaches a specific size? > >> >> > >> > > >> >No there isn't, as far as I know. Potential solutions to this would be: > >> > > >> > 1. Leave the data in the broker long enough for it to reach the > >> > size > >> you > >> > want. Running the SimpleKafkaETLJob at those intervals would > >> > give > >>you > >> the > >> > file size you want. This is the simplest thing to do, but the > >>drawback > >> is > >> > that your data in HDFS will be less real-time. > >> > 2. Run the SimpleKafkaETLJob as frequently as you want, and > >> > then > >>roll > >> up > >> > / compact your small files into one bigger file. You would need > >> > to > >> come up > >> > with the hadoop job that does the roll up, or find one somewhere. > >> > 3. Don't use the SimpleKafkaETLJob at all and write a new job > >> > that > >> makes > >> > use of hadoop append instead... > >> > > >> >Also, you may be interested to take a look at these scripts< > >> > >> > http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consum > er/ > >> >I > >> >posted a while ago. If you follow the links in this post, you can > >> >get more details about how the scripts work and why it was > >> >necessary to do > >>the > >> >things it does... or you can just use them without reading. They > >> >should work pretty much out of the box... > >> > >> Where I work, we discovered that you can keep a file in HDFS open > >>and still run MapReduce jobs against the data in that file. What > >>you do is you flush the data periodically (every record for us), > >>but you don't close the file right away. This allows us to have > >>data files that contain 24 hours worth of data, but not have to > >>close the file to run the jobs or to schedule the jobs for after > >>the file is closed. You can also check the file size periodically > >>and rotate the files based on size. We use Avro files, but > >>sequence files should work too according to Cloudera. > >> > >> It's a great compromise for when you want the latest and greatest > >>data, but don't want to have to wait until all of the files are > >>closed to get it. > >> > >> Casey > >
-
Re: Hadoop ConsumerMurtaza Doctor 2012-07-12, 19:57
Hello Min,
In your github project source code are you missing the ConsumerConfig class? I was trying to download and play with the source code. Thanks, murtaza On 7/3/12 6:29 PM, "Min" <[EMAIL PROTECTED]> wrote: >I've created another hadoop consumer which uses zookeeper. > >https://github.com/miniway/kafka-hadoop-consumer > >With a hadoop OutputFormatter, I could add new files to the existing >target directory. >Hope this would help. > >Thanks >Min > >2012/7/4 Murtaza Doctor <[EMAIL PROTECTED]>: >> +1 This surely sounds interesting. >> >> On 7/3/12 10:05 AM, "Felix GV" <[EMAIL PROTECTED]> wrote: >> >>>Hmm that's surprising. I didn't know about that...! >>> >>>I wonder if it's a new feature... Judging from your email, I assume >>>you're >>>using CDH? What version? >>> >>>Interesting :) ... >>> >>>-- >>>Felix >>> >>> >>> >>>On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey < >>>[EMAIL PROTECTED]> wrote: >>> >>>> >> - Is there a version of consumer which appends to an existing file >>>>on >>>> HDFS >>>> >> until it reaches a specific size? >>>> >> >>>> > >>>> >No there isn't, as far as I know. Potential solutions to this would >>>>be: >>>> > >>>> > 1. Leave the data in the broker long enough for it to reach the >>>>size >>>> you >>>> > want. Running the SimpleKafkaETLJob at those intervals would give >>>>you >>>> the >>>> > file size you want. This is the simplest thing to do, but the >>>>drawback >>>> is >>>> > that your data in HDFS will be less real-time. >>>> > 2. Run the SimpleKafkaETLJob as frequently as you want, and then >>>>roll >>>> up >>>> > / compact your small files into one bigger file. You would need to >>>> come up >>>> > with the hadoop job that does the roll up, or find one somewhere. >>>> > 3. Don't use the SimpleKafkaETLJob at all and write a new job that >>>> makes >>>> > use of hadoop append instead... >>>> > >>>> >Also, you may be interested to take a look at these >>>> >scripts< >>>> >>>>http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consume >>>>r/ >>>> >I >>>> >posted a while ago. If you follow the links in this post, you can get >>>> >more details about how the scripts work and why it was necessary to >>>>do >>>>the >>>> >things it does... or you can just use them without reading. They >>>>should >>>> >work pretty much out of the box... >>>> >>>> Where I work, we discovered that you can keep a file in HDFS open and >>>> still run MapReduce jobs against the data in that file. What you do >>>>is >>>>you >>>> flush the data periodically (every record for us), but you don't close >>>>the >>>> file right away. This allows us to have data files that contain 24 >>>>hours >>>> worth of data, but not have to close the file to run the jobs or to >>>> schedule the jobs for after the file is closed. You can also check >>>>the >>>> file size periodically and rotate the files based on size. We use >>>>Avro >>>> files, but sequence files should work too according to Cloudera. >>>> >>>> It's a great compromise for when you want the latest and greatest >>>>data, >>>> but don't want to have to wait until all of the files are closed to >>>>get >>>>it. >>>> >>>> Casey >>
-
Re: Hadoop ConsumerMin 2012-07-16, 03:23
ConsumerConfig is in the kafka's main trunk.
As I used the same package namespace, kafka.consumer, (sure I don't think it's good approach), I didn't have to import it explicitly. kafka jar is not on the maven repository, you might have to register it into your local maven repository. > mvn install:install-file -Dfile=kafka-0.7.0.jar -DgroupId=kafka -DartifactId=kafka -Dversion=0.7.0 -Dpackaging=jar Thanks Min 2012/7/13 Murtaza Doctor <[EMAIL PROTECTED]>: > Hello Min, > > In your github project source code are you missing the ConsumerConfig > class? I was trying to download and play with the source code. > > Thanks, > murtaza > > On 7/3/12 6:29 PM, "Min" <[EMAIL PROTECTED]> wrote: > >>I've created another hadoop consumer which uses zookeeper. >> >>https://github.com/miniway/kafka-hadoop-consumer >> >>With a hadoop OutputFormatter, I could add new files to the existing >>target directory. >>Hope this would help. >> >>Thanks >>Min >> >>2012/7/4 Murtaza Doctor <[EMAIL PROTECTED]>: >>> +1 This surely sounds interesting. >>> >>> On 7/3/12 10:05 AM, "Felix GV" <[EMAIL PROTECTED]> wrote: >>> >>>>Hmm that's surprising. I didn't know about that...! >>>> >>>>I wonder if it's a new feature... Judging from your email, I assume >>>>you're >>>>using CDH? What version? >>>> >>>>Interesting :) ... >>>> >>>>-- >>>>Felix >>>> >>>> >>>> >>>>On Tue, Jul 3, 2012 at 12:34 PM, Sybrandy, Casey < >>>>[EMAIL PROTECTED]> wrote: >>>> >>>>> >> - Is there a version of consumer which appends to an existing file >>>>>on >>>>> HDFS >>>>> >> until it reaches a specific size? >>>>> >> >>>>> > >>>>> >No there isn't, as far as I know. Potential solutions to this would >>>>>be: >>>>> > >>>>> > 1. Leave the data in the broker long enough for it to reach the >>>>>size >>>>> you >>>>> > want. Running the SimpleKafkaETLJob at those intervals would give >>>>>you >>>>> the >>>>> > file size you want. This is the simplest thing to do, but the >>>>>drawback >>>>> is >>>>> > that your data in HDFS will be less real-time. >>>>> > 2. Run the SimpleKafkaETLJob as frequently as you want, and then >>>>>roll >>>>> up >>>>> > / compact your small files into one bigger file. You would need to >>>>> come up >>>>> > with the hadoop job that does the roll up, or find one somewhere. >>>>> > 3. Don't use the SimpleKafkaETLJob at all and write a new job that >>>>> makes >>>>> > use of hadoop append instead... >>>>> > >>>>> >Also, you may be interested to take a look at these >>>>> >scripts< >>>>> >>>>>http://felixgv.com/post/88/kafka-distributed-incremental-hadoop-consume >>>>>r/ >>>>> >I >>>>> >posted a while ago. If you follow the links in this post, you can get >>>>> >more details about how the scripts work and why it was necessary to >>>>>do >>>>>the >>>>> >things it does... or you can just use them without reading. They >>>>>should >>>>> >work pretty much out of the box... >>>>> >>>>> Where I work, we discovered that you can keep a file in HDFS open and >>>>> still run MapReduce jobs against the data in that file. What you do >>>>>is >>>>>you >>>>> flush the data periodically (every record for us), but you don't close >>>>>the >>>>> file right away. This allows us to have data files that contain 24 >>>>>hours >>>>> worth of data, but not have to close the file to run the jobs or to >>>>> schedule the jobs for after the file is closed. You can also check >>>>>the >>>>> file size periodically and rotate the files based on size. We use >>>>>Avro >>>>> files, but sequence files should work too according to Cloudera. >>>>> >>>>> It's a great compromise for when you want the latest and greatest >>>>>data, >>>>> but don't want to have to wait until all of the files are closed to >>>>>get >>>>>it. >>>>> >>>>> Casey >>> > |