Thanks a lot for your reply and great suggestions.
In the practical cases, the values usually do not reside in the same
data node. Instead, they are mostly distributed by the key range
itself. So, it does require 20G of memory, but distributed in
The MapFile solution is very intriguing. I am not very familiar with
it though. I assume it kinda resemble the basic idea of HBase? I will
certainly try it out and follow up if there are questions.
I agree that using HBase would be much easier. But the value size
makes worry if it is going to push it to the edge. If I do this more
often, I will definitely consider using HBase.
Many thanks for the great reply.
On Wed, Feb 13, 2013 at 12:38 AM, Harsh J <[EMAIL PROTECTED]> wrote:
> My reply to your questions is inline.
> On Wed, Feb 13, 2013 at 10:59 AM, Harsh J <[EMAIL PROTECTED]> wrote:
>> Please do not use the general@ lists for any user-oriented questions.
>> Please redirect them to [EMAIL PROTECTED] lists, which is where
>> the user community and questions lie.
>> I've moved your post there and have added you on CC in case you
>> haven't subscribed there. Please reply back only to the user@
>> addresses. The general@ list is for Apache Hadoop project-level
>> management and release oriented discussions alone.
>> On Wed, Feb 13, 2013 at 10:54 AM, William Kang <[EMAIL PROTECTED]> wrote:
>>> Hi All,
>>> I am trying to figure out a good solution for such a scenario as following.
>>> 1. I have a 2T file (let's call it A), filled by key/value pairs,
>>> which is stored in the HDFS with the default 64M block size. In A,
>>> each key is less than 1K and each value is about 20M.
>>> 2. Occasionally, I will run analysis by using a different type of data
>>> (usually less than 10G, and let's call it B) and do look-up table
>>> alike operations by using the values in A. B resides in HDFS as well.
>>> 3. This analysis would require loading only a small number of values
>>> from A (usually less than 1000 of them) into the memory for fast
>>> look-up against the data in B. The way B finds the few values in A is
>>> by looking up for the key in A.
> About 1000 such rows would equal a memory expense of near 20 GB, given
> the value size of A you've noted above. The solution may need to be
> considered with this in mind, if the whole lookup table is to be
> eventually generated into the memory and never discarded until the end
> of processing.
>>> Is there an efficient way to do this?
> Since HBase may be too much for your simple needs, have you instead
> considered using MapFiles, which allow fast key lookups at a file
> level over HDFS/MR? You can have these files either highly replicated
> (if their size is large), or distributed via the distributed cache in
> the lookup jobs (if they are infrequently used and small sized), and
> be able to use the MapFile reader API to perform lookups of keys and
> read values only when you want them.
>>> I was thinking if I could identify the locality of the block that
>>> contains the few values, I might be able to push the B into the few
>>> nodes that contains the few values in A? Since I only need to do this
>>> occasionally, maintaining a distributed database such as HBase cant be
> I agree that HBase may not be wholly suited to be run just for this
> purpose (unless A's also gonna be scaling over time).
> Maintaining value -> locality mapping would need to be done by you. FS
> APIs provide locality info calls, and your files may be
> key-partitioned enough to identify each one's range, and you can
> combine the knowledge of these two to do something along these lines.
> Using HBase may also turn out to be "easier", but thats upto you. You
> can also choose to tear it down (i.e. the services) when not needed,
>>> Many thanks.
>> Harsh J
> Harsh J