I am trying to figure out a good solution for such a scenario as following.
1. I have a 2T file (let's call it A), filled by key/value pairs,
which is stored in the HDFS with the default 64M block size. In A,
each key is less than 1K and each value is about 20M.
2. Occasionally, I will run analysis by using a different type of data
(usually less than 10G, and let's call it B) and do look-up table
alike operations by using the values in A. B resides in HDFS as well.
3. This analysis would require loading only a small number of values
from A (usually less than 1000 of them) into the memory for fast
look-up against the data in B. The way B finds the few values in A is
by looking up for the key in A.
Is there an efficient way to do this?
I was thinking if I could identify the locality of the block that
contains the few values, I might be able to push the B into the few
nodes that contains the few values in A? Since I only need to do this
occasionally, maintaining a distributed database such as HBase cant be