On 09/10/2010 02:18 AM, Angus Helm wrote:
> Hi all, I have a task which involves loading a large amount of data
> from a database and then using that data to process a large number of
> small files. I'm trying to split up the file processing via mapreduce,
> so each task runs as a map. However, the "loading from a database"
> part takes a long time and does not need to be done for each map task.
> Preferably it would be done once on each task node and then the task
> node would do all of the map reduces with the same data set. Currently
> the load from database takes several minutes while the processing of
> files takes a few seconds. So when it loads data for every task it's
> orders of magnitude slower.
> My question is if there is a well known best practice for doing
> something like this.
The way I handled this type of situation recently was to extract the
data from the DB into an HDFS file (a SequenceFile), rather than have
the individual tasks hit the database. That removes the DB from the
picture and works much better with Hadoop.