On 02/27/2013 01:42 PM, Adam Phelps wrote:
> We have a job that uses a large lookup structure that gets created as a
> static class during the map setup phase (and we have the JVM reused so
> this only takes place once). However of late this structure has grown
> drastically (due to items beyond our control) and we've seen a
> substantial increase in map time due to the lower available memory.
> Are there any easy solutions to this sort of problem? My first thought
> was to see if it was possible to have all tasks for a job execute in
> parallel within the same JVM, but I'm not seeing any setting that would
> allow that. Beyond that my only ideas are to move that data into an
> external one-per-node key-value store like memcached, but I'm worried
> the additional overhead of sending a query for each value being mapped
> would also kill the job performance.
> - Adam
We use a similar solution to what you suggested to address this issue.
Though, the in-memory app we run on each datanode is a proprietary one
which allows for pipelineing of queries, and obviously helps optimize this.
Still, even using off-the-shelf memcached, and incurring the overhead of
query-per-value, speed might work out to be more acceptable on this than
you think. Maybe give it a test in the small to benchmark first.