William Kang 2013-02-13, 05:24
Please do not use the general@ lists for any user-oriented questions.
Please redirect them to [EMAIL PROTECTED] lists, which is where
the user community and questions lie.
I've moved your post there and have added you on CC in case you
haven't subscribed there. Please reply back only to the user@
addresses. The general@ list is for Apache Hadoop project-level
management and release oriented discussions alone.
On Wed, Feb 13, 2013 at 10:54 AM, William Kang <[EMAIL PROTECTED]> wrote:
> Hi All,
> I am trying to figure out a good solution for such a scenario as following.
> 1. I have a 2T file (let's call it A), filled by key/value pairs,
> which is stored in the HDFS with the default 64M block size. In A,
> each key is less than 1K and each value is about 20M.
> 2. Occasionally, I will run analysis by using a different type of data
> (usually less than 10G, and let's call it B) and do look-up table
> alike operations by using the values in A. B resides in HDFS as well.
> 3. This analysis would require loading only a small number of values
> from A (usually less than 1000 of them) into the memory for fast
> look-up against the data in B. The way B finds the few values in A is
> by looking up for the key in A.
> Is there an efficient way to do this?
> I was thinking if I could identify the locality of the block that
> contains the few values, I might be able to push the B into the few
> nodes that contains the few values in A? Since I only need to do this
> occasionally, maintaining a distributed database such as HBase cant be
> Many thanks.