Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> jobtracker / hadoop comsumer


Copy link to this message
-
Re: jobtracker / hadoop comsumer
That part might not be written in the simple example.

You may need to manually move that offset file to an empty input directory
and have your map reduce job use that as your input path. That offset file
is similar format to the 1.dat file that was created by the DataGenerator
and should be compatible as input for subsequent runs.
If you look inside the file, you may see that the only difference between
the output file and the 1.dat is the offset value (it'll be greater than
-1).

This is essentially what I did for a fully functional version of this kafka
hadoop consumer.

1. Read previous offset files if they exist.
2. Write/copy offset files to new directory and rename if necessary.
3. Run job against new directory. I usually have the kafka data output to
the same directory as input. The reason is that I can just delete the
directory and have the hadoop job repull the data from previous hadoop job
without having to do complex offset management.
4. Repeat 1.

Hope this clarifies things.
-Richard

On Wed, Aug 31, 2011 at 4:40 PM, Sam William <[EMAIL PROTECTED]> wrote:

> Richard,
>     I read somewhere that the  mappers write out the offset to the output
> dir , so that  further attempts (after a task failure) can start from the
> right offset.
> I see that the  offset is generated .   But where is the logic to read this
> and adjust the offset for the next read ? . I wasnt able to find it.
>
> Sam
>
> On Aug 31, 2011, at 12:48 PM, Richard Park wrote:
>
> > It really looks like your mapper tasks may be failing to connect to your
> > kafka server.
> >
> > Here's a brief overview of what that demo job is doing so you can
> understand
> > where the example may have gone wrong.
> > DataGenerator:
> >
> >   1. When DataGenerator is run, it needs the property 'kafka.etl.topic',
> >   and 'kafka.server.uri' set in the properties file. When you run
> > ./run-class.sh
> >   kafka.etl.impl.DataGenerator test/test.properties, you can tell that
> >   they're properly set by the output 'topic=<blah>' and 'server
> uri=<kafka
> >   server url>.
> >   2. The DataGenerator will create a bunch of dummy messages and pump it
> to
> >   that kafka server. Afterwards, it will write a file to HDFS at path
> 'input'
> >   which you also set in the properties file. The file that is created
> will be
> >   named something like 1.dat.
> >   3. 1.dat is a sequence file, so if it isn't compressed, you should be
> >   able to see its contents in plain text. The contents will essentially
> list
> >   the kafka server url, the partition number and the topic as well as the
> >   offset.
> >   4. In a real scenario, you'll probably create several of these files
> for
> >   each broker and possibly partition, but for this example, you only need
> one
> >   file. Each file will spawn a mapper during the mapred step.
> >
> > CopyJars:
> >
> >   1. This should copy the necessary jars for kafka hadoop, and push them
> >   into HDFS for the distributed cache. If the jars are copied locally
> instead
> >   of to a remote cluster, most likely HADOOP_CONF_DIR hasn't been set up
> >   correctly. The environment should probably be set by the script, so
> someone
> >   can change that.
> >
> > SimpleKafkaETLJob
> >
> >   1. This job will then setup the distributed classpath, and the input
> path
> >   should be the directory that 1.dat was written to.
> >   2. Internally, the mappers will then load 1.dat and use the connection
> >   data contained in it to connect to kafka. If it's trying to connect to
> >   anything but your kafka server, than this file was incorrectly written.
> >   3. The RecordReader wraps all of this and hides all the connection
> stuff
> >   so that your Mapper should see a stream of Kafka messages rather than
> the
> >   contents of 1.dat.
> >
> > So please see if you can figure out what is wrong with your example and
> feel
> > free beef up the README instructions to take in account your pitfalls.
> >
> > Thanks,
> > -Richard
> >
> >
> >
> > On Wed, Aug 31, 2011 at 12:02 PM, Ben Ciceron <[EMAIL PROTECTED]> wrote:
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB