Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # dev >> Namespace partitioning using Locality Sensitive Hashing


Copy link to this message
-
Re: Namespace partitioning using Locality Sensitive Hashing
I have a question.

Now that hadoop have "maper" and "reducer" , how about the solution like Map<Tree(nodepair set), root(node) > and Reduce<root(node), Tree(nodepair set)>
or that directly Reduce<root(node), nodepair> here nodepair can be look as a branch...
----- Original Message -----
From: "Konstantin Shvachko" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, March 02, 2010 10:21 AM
Subject: Re: Namespace partitioning using Locality Sensitive Hashing
> Symlinks is a brand new feature in HDFS.
> You can read about it in
> https://issues.apache.org/jira/browse/HDFS-245
> Documentation is here:
> https://issues.apache.org/jira/secure/attachment/12434745/design-doc-v4.txt
>
> Symbolic links in HDFS can point to a directory in a different file system,
> particularly on a different HDSF cluster. They are like mount points in this case.
> So you can create a symlink on cluster C1 pointing to the root of cluster C2.
> This makes C2 a sub-namespace of C1.
>
> --Konstantin
>
> On 3/1/2010 5:42 PM, Ketan Dixit wrote:
>> Hello,
>> Thank you Konstantin and  Allen for your reply. The information
>> provided really helped to improve my understanding.
>> However I still have few questions.
>> How Symlinks/ soft links are used to solve the probem of partitioning.
>> (Where do the symlinks point to? All the mapping is
>> stored in memory but symlinks point to file objects? This is little
>> confusing to me)
>> Can you please provide insight into this?
>>
>> Thanks,
>> Ketan
>>
>> On Mon, Mar 1, 2010 at 3:26 PM, Konstantin Shvachko<[EMAIL PROTECTED]>  wrote:
>>>
>>> Hi Ketan,
>>>
>>> AFAIU, hashing is used to map files and directories into different name-nodes.
>>> Suppose you use a simple hash function on a file path h(path), and that files
>>> with the same hash value (or within a hash range) are mapped to the same name-node.
>>> Then files with the same parent will be randomly mapped into different
>>> name-nodes: Pr(h(/dir/file1) = h(/dir/file2)) - is small.
>>>
>>> The ides with LSH is to add some locality factor in the hash function in order
>>> to increase probability of placing files from the same directory (or a subtree)
>>> into the same name-node.
>>>
>>> Example 1.
>>> Suppose that you apply MD5 only to the path to the parent rather
>>> then to the entire file path: h(/root/dir/file) = MD5(/root/dir)
>>> Then all files of the same directory will have the same hash value and therefore
>>> will be mapped into the same name-node.
>>>
>>> Example 2.
>>> If a path consists of path components pi, where i = 1,..,n
>>> Lets define the following hash function:
>>> h(/p1/.../pn) = 0, if n<  7
>>> h(/p1/.../pn) = MD5(/p1/.../p7), if n>= 7
>>> With this hash function each subtree rooted at level 7 of the namepsace hierarchy
>>> will be entirely mapped to the same name-node.
>>>
>>> There could be more elaborate examples.
>>>
>>> Symlinks do provide a way to partition the namespace, as Allen points out,
>>> although this is a static partitioning. Static partitions as opposed to dynamic
>>> ones do not guarantee that the partitions will be "equal" in size, where "size"
>>> may have different meanings (like number of files, or space occupied by the
>>> files, or number of blocks).
>>> A good hash function need to conform to some equal partitioning requirement.
>>> Function from Example 2 would be considered bad in this sense, while Example 1
>>> defines a good function.
>>>
>>> This is my take on the problem. Hope it makes sense,
>>> --Konstantin
>>>
>>>
>>> On 3/1/2010 8:48 AM, Ketan Dixit wrote:
>>>>
>>>> Hi,
>>>> I am a graduate student in Computer Science department at SUNY Stony Brook.
>>>>   I am thinking of doing a project on Hadoop for my course "Cloud Computing"
>>>> conducted by Prof. Radu Sion.
>>>> While going through the links of the "Yahoo open source projects for
>>>> students"  page I found the idea
>>>> "Research on new hashing schemes for filesystem namespace partitioning"
>>>> interesting. It looks to me the idea is
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB