Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Java jars and MapReduce


Copy link to this message
-
Re: Java jars and MapReduce
A few basic questions -
1) is the rate limiting step the Java processing or storage in accumulo.
Hadoop may not be able to speed up a database which is not designed to work
in a distributed manner.

B)  Can ObjectD or any intermediate objects be serialized  possibly to xml
and efficiently deserialized. If so they can be passed outside the databse.
If so you might consider passing the serialized form and maybe several jobs
in series.

C) Alternatively if the object can be defined by a single identifier in the
database and intermediate steps are in the database and if the processing
and not the database is the rate limiting step then every step could be a
separate job with a custom InputFormat passing ids to the mapper.

You need to spend a lot of time thinking about which steps in the
processing are rate limiting and how much of the performance bottlenecks
arein the database.

Steven M. Lewis PhD
 On Mar 1, 2013 9:48 AM, "Aji Janis" <[EMAIL PROTECTED]> wrote:

> Hello,
>
> Current Design: I have a java object MyObjectA. MyObjectA goes through
> Three processors (jars) that are run in sequence and do a lot of processing
> to beef up A with tons of additional stuff (think ETL) and the final result
> is MyObjectD (note: MyObjectD is really A with more fields if you will
> added to it but I wanted to clarify here that they are very different).
> MyObjectD when ready is saved to my non relational database (accumulo).
> Currently, all this is done by making use of Quartz Scheduler - a
> List<MyObjectA> is submitted for processing every N mintues. Everything is
> written in Java and there is a lot of talking back n forth with Accumulo
> (to access tables that will help convert A to D).
>
> We split the processing into three processors just because it was more
> convenient and we didn't want everything rolled up in one processor. Having
> said that I can definitely merge the three into ONE processor. But my
> question is, what are all the things (obviously generically speaking) I
> need to be concerned about/ look into to make this a map reduce job? I am
> asking for pointers on where to even start here.
>
> Lets say, all my processing is done in mappers. So my input will be
> MyObjectA and my output will be MyObjectD from each mapper. And then my
> reducer simple writes all MyObjectD objects to accumulo. Is achieving this
> as easy as just submitting the jar to hadoop ????
>
> I guess overall, I want to know how does one go about blindly submitting a
> .jar (java apps) and make this a map reduce task.
> We are going this route, because multi-threading won't solve our problem.
> We have to process objects in batch now and they are coming in every
> minute.
>
> Thank you in advance for any and all help.
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB