-Re: Namespace partitioning using Locality Sensitive Hashing
springring 2010-03-02, 14:03
I have a question.
Now that hadoop have "maper" and "reducer" , how about the solution like Map<Tree(nodepair set), root(node) > and Reduce<root(node), Tree(nodepair set)>
or that directly Reduce<root(node), nodepair> here nodepair can be look as a branch...
----- Original Message -----
From: "Konstantin Shvachko" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, March 02, 2010 10:21 AM
Subject: Re: Namespace partitioning using Locality Sensitive Hashing
> Symlinks is a brand new feature in HDFS.
> You can read about it in
> Documentation is here:
> Symbolic links in HDFS can point to a directory in a different file system,
> particularly on a different HDSF cluster. They are like mount points in this case.
> So you can create a symlink on cluster C1 pointing to the root of cluster C2.
> This makes C2 a sub-namespace of C1.
> On 3/1/2010 5:42 PM, Ketan Dixit wrote:
>> Thank you Konstantin and Allen for your reply. The information
>> provided really helped to improve my understanding.
>> However I still have few questions.
>> How Symlinks/ soft links are used to solve the probem of partitioning.
>> (Where do the symlinks point to? All the mapping is
>> stored in memory but symlinks point to file objects? This is little
>> confusing to me)
>> Can you please provide insight into this?
>> On Mon, Mar 1, 2010 at 3:26 PM, Konstantin Shvachko<[EMAIL PROTECTED]> wrote:
>>> Hi Ketan,
>>> AFAIU, hashing is used to map files and directories into different name-nodes.
>>> Suppose you use a simple hash function on a file path h(path), and that files
>>> with the same hash value (or within a hash range) are mapped to the same name-node.
>>> Then files with the same parent will be randomly mapped into different
>>> name-nodes: Pr(h(/dir/file1) = h(/dir/file2)) - is small.
>>> The ides with LSH is to add some locality factor in the hash function in order
>>> to increase probability of placing files from the same directory (or a subtree)
>>> into the same name-node.
>>> Example 1.
>>> Suppose that you apply MD5 only to the path to the parent rather
>>> then to the entire file path: h(/root/dir/file) = MD5(/root/dir)
>>> Then all files of the same directory will have the same hash value and therefore
>>> will be mapped into the same name-node.
>>> Example 2.
>>> If a path consists of path components pi, where i = 1,..,n
>>> Lets define the following hash function:
>>> h(/p1/.../pn) = 0, if n< 7
>>> h(/p1/.../pn) = MD5(/p1/.../p7), if n>= 7
>>> With this hash function each subtree rooted at level 7 of the namepsace hierarchy
>>> will be entirely mapped to the same name-node.
>>> There could be more elaborate examples.
>>> Symlinks do provide a way to partition the namespace, as Allen points out,
>>> although this is a static partitioning. Static partitions as opposed to dynamic
>>> ones do not guarantee that the partitions will be "equal" in size, where "size"
>>> may have different meanings (like number of files, or space occupied by the
>>> files, or number of blocks).
>>> A good hash function need to conform to some equal partitioning requirement.
>>> Function from Example 2 would be considered bad in this sense, while Example 1
>>> defines a good function.
>>> This is my take on the problem. Hope it makes sense,
>>> On 3/1/2010 8:48 AM, Ketan Dixit wrote:
>>>> I am a graduate student in Computer Science department at SUNY Stony Brook.
>>>> I am thinking of doing a project on Hadoop for my course "Cloud Computing"
>>>> conducted by Prof. Radu Sion.
>>>> While going through the links of the "Yahoo open source projects for
>>>> students" page I found the idea
>>>> "Research on new hashing schemes for filesystem namespace partitioning"
>>>> interesting. It looks to me the idea is