My reply to your questions is inline.
On Wed, Feb 13, 2013 at 10:59 AM, Harsh J <[EMAIL PROTECTED]> wrote:
> Please do not use the general@ lists for any user-oriented questions.
> Please redirect them to [EMAIL PROTECTED] lists, which is where
> the user community and questions lie.
> I've moved your post there and have added you on CC in case you
> haven't subscribed there. Please reply back only to the user@
> addresses. The general@ list is for Apache Hadoop project-level
> management and release oriented discussions alone.
> On Wed, Feb 13, 2013 at 10:54 AM, William Kang <[EMAIL PROTECTED]> wrote:
>> Hi All,
>> I am trying to figure out a good solution for such a scenario as following.
>> 1. I have a 2T file (let's call it A), filled by key/value pairs,
>> which is stored in the HDFS with the default 64M block size. In A,
>> each key is less than 1K and each value is about 20M.
>> 2. Occasionally, I will run analysis by using a different type of data
>> (usually less than 10G, and let's call it B) and do look-up table
>> alike operations by using the values in A. B resides in HDFS as well.
>> 3. This analysis would require loading only a small number of values
>> from A (usually less than 1000 of them) into the memory for fast
>> look-up against the data in B. The way B finds the few values in A is
>> by looking up for the key in A.
About 1000 such rows would equal a memory expense of near 20 GB, given
the value size of A you've noted above. The solution may need to be
considered with this in mind, if the whole lookup table is to be
eventually generated into the memory and never discarded until the end
>> Is there an efficient way to do this?
Since HBase may be too much for your simple needs, have you instead
considered using MapFiles, which allow fast key lookups at a file
level over HDFS/MR? You can have these files either highly replicated
(if their size is large), or distributed via the distributed cache in
the lookup jobs (if they are infrequently used and small sized), and
be able to use the MapFile reader API to perform lookups of keys and
read values only when you want them.
>> I was thinking if I could identify the locality of the block that
>> contains the few values, I might be able to push the B into the few
>> nodes that contains the few values in A? Since I only need to do this
>> occasionally, maintaining a distributed database such as HBase cant be
I agree that HBase may not be wholly suited to be run just for this
purpose (unless A's also gonna be scaling over time).
Maintaining value -> locality mapping would need to be done by you. FS
APIs provide locality info calls, and your files may be
key-partitioned enough to identify each one's range, and you can
combine the knowledge of these two to do something along these lines.
Using HBase may also turn out to be "easier", but thats upto you. You
can also choose to tear it down (i.e. the services) when not needed,
>> Many thanks.
> Harsh J