Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Execution handover in map/reduce pipeline


Copy link to this message
-
Execution handover in map/reduce pipeline
Hi...

I have an application that processes large amounts of proprietary
binary-encoded text data in the following sequence

   1. Gets a URL to a file or a directory as input
   2. Reads the list of the binary files found under the input URL
   3. Extracts the text data from each of those files
   4. Saves the text data into new files
   5. Informs the application about newly extracted files
   6. Processes each of the extracted text files
   7. Submits the processing results to a proprietary data repository

This whole processing is highly CPU-intensive and can be partially
parallelized, so I am thinking of trying Hadoop for achieving higher
performance.

So, assuming that all the above take place in HDFS (including the input URL
being an HDFS one), a MapReduce implementation could use

   - A lightweight non-Hadoop thread to kick-start the execution flow, i.e.
   implement step 1
   - A Mapper that would implement steps 2-4
   - A Reducer that would implement step 5 (receive the notifications)
   - A Mapper that would implement step 6
   - A Reducer that would implement step 7

The first mapper (for steps 2-4) will probably need to do its processing in
a single, non-parallelized step.

My question is, how is the first reducer going to hand over execution to
the second mapper, once done?

Or, is there a better way of implementing the above scenario?

Thanks!
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB