Are all of the tables used by all of the processes? Are all of the tables used all of the time or are some used infrequently? Does the data in these lookup tables change a lot or is it very stable? What is the actual size of the data, yes 1 million entries, but is this 1 million 1kB, 100kB, or 1MB entries? Also how critical is it that your map/reduce jobs are reproducible later on. If you have a shared resource like HBase it can change after a job runs, or even while a job is running, and you may not ever be able to reproduce that exact same result again.
Generally the processing that I have done is dominated by that last question so we tend to use cache archives to pull in versioned DBs that are optimized for reads and we use tools like CDB ( http://en.wikipedia.org/wiki/Constant_Data_Base) to do the lookups. Most of these tend to be small enough to fit into memory and so we don't have to worry about that too much. But I have seen other use cases too.
On 12/12/11 11:33 AM, "Mark Kerzner" <[EMAIL PROTECTED]> wrote:
I am planning a system to process information with Hadoop, and I will have
a few look-up tables that each processing node will need to query. There
are perhaps 20-50 such tables, and each has on the order of one million
entries. Which is the best mechanism for this look-up? Memcache, HBase,
JavaSpace, Lucene index, anything else?