|
Claudio Martella
2011-07-15, 12:32
Michael Segel
2011-07-15, 14:48
Claudio Martella
2011-07-15, 14:58
Stack
2011-07-15, 16:24
Claudio Martella
2011-07-15, 17:06
Stack
2011-07-16, 20:08
Claudio Martella
2011-07-18, 11:04
Stack
2011-07-18, 16:05
Claudio Martella
2011-07-18, 16:22
Stack
2011-07-18, 16:32
Casey Stella
2011-07-19, 01:02
Claudio Martella
2011-07-19, 10:15
Casey Stella
2011-07-19, 12:33
Eric Charles
2011-07-16, 08:38
Michel Segel
2011-07-16, 12:11
Eric Charles
2011-07-17, 04:30
|
-
Hash indexing of HFilesClaudio Martella 2011-07-15, 12:32
Hello list,
at SIGMOD this year i've seen a spreading of different storage files for HBase, with different techniques. My scenario and usage doesn't really require range queries, so I thought I'd take advantage of even faster random i/o from hash indexing of data in each sequence file. Does anybody know if anybody has developed other indexing techniques for sequence files other than Btrees? Thanks! -- Claudio Martella Free Software & Open Technologies Analyst TIS innovation park Via Siemens 19 | Siemensstr. 19 39100 Bolzano | 39100 Bozen Tel. +39 0471 068 123 Fax +39 0471 068 129 [EMAIL PROTECTED] http://www.tis.bz.it Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to [EMAIL PROTECTED] in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it. +
Claudio Martella 2011-07-15, 12:32
-
RE: Hash indexing of HFilesMichael Segel 2011-07-15, 14:48
Claudio, I'm not sure on how to answer this... Yes, we've got a prototype of a Lucene on HBase w Spatial that we're starting to test. With respect to hashing... In one project we just hashed the key using the SHA-1 hash already in Java. This gave us the randomness without having to try to build a separate index. But we're still using the base key for the row. Its not like we're creating a secondary index on a column value. There are a couple of other projects out there on Git Hub so you may want to check them out. HTH -Mike > Date: Fri, 15 Jul 2011 14:32:50 +0200 > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > Subject: Hash indexing of HFiles > > Hello list, > > at SIGMOD this year i've seen a spreading of different storage files for > HBase, with different techniques. My scenario and usage doesn't really > require range queries, so I thought I'd take advantage of even faster > random i/o from hash indexing of data in each sequence file. > > Does anybody know if anybody has developed other indexing techniques for > sequence files other than Btrees? > > > Thanks! > > -- > Claudio Martella > Free Software & Open Technologies > Analyst > > TIS innovation park > Via Siemens 19 | Siemensstr. 19 > 39100 Bolzano | 39100 Bozen > Tel. +39 0471 068 123 > Fax +39 0471 068 129 > [EMAIL PROTECTED] http://www.tis.bz.it > > Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to [EMAIL PROTECTED] in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it. > > > > +
Michael Segel 2011-07-15, 14:48
-
Re: Hash indexing of HFilesClaudio Martella 2011-07-15, 14:58
Hi Michal,
what I was talking about is more of a vector-of-offsets kind of approach in stead of the Btree created by the "block starting with key x" approach which is used right now. Imagine that after the Records segment you have a vector of N longs (in stead of the block records we have right now), where N=the number of key/value pairs in the file. You get the right item inside of the vector by doing hash(key) % N, and read the exact position of the record inside of the file (which you can use for a direct seek). This is naive, of course, because it doesn't handle collisions, but should make the idea simple to understand. F.e. to handle collisions the offset could be to the bucket (a linked-list) after the vector. I've implemented this approach here: https://github.com/claudiomartella/sketches and it has very good random read performance (faster than leveldb, in my preliminary micro-benchmarks). On 7/15/11 4:48 PM, Michael Segel wrote: > Claudio, > > I'm not sure on how to answer this... > > Yes, we've got a prototype of a Lucene on HBase w Spatial that we're starting to test. > > With respect to hashing... > In one project we just hashed the key using the SHA-1 hash already in Java. This gave us the randomness without having to try to build a separate index. > But we're still using the base key for the row. Its not like we're creating a secondary index on a column value. > > There are a couple of other projects out there on Git Hub so you may want to check them out. > > HTH > > -Mike > > >> Date: Fri, 15 Jul 2011 14:32:50 +0200 >> From: [EMAIL PROTECTED] >> To: [EMAIL PROTECTED] >> Subject: Hash indexing of HFiles >> >> Hello list, >> >> at SIGMOD this year i've seen a spreading of different storage files for >> HBase, with different techniques. My scenario and usage doesn't really >> require range queries, so I thought I'd take advantage of even faster >> random i/o from hash indexing of data in each sequence file. >> >> Does anybody know if anybody has developed other indexing techniques for >> sequence files other than Btrees? >> >> >> Thanks! >> >> -- >> Claudio Martella >> Free Software & Open Technologies >> Analyst >> >> TIS innovation park >> Via Siemens 19 | Siemensstr. 19 >> 39100 Bolzano | 39100 Bozen >> Tel. +39 0471 068 123 >> Fax +39 0471 068 129 >> [EMAIL PROTECTED] http://www.tis.bz.it >> >> Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to [EMAIL PROTECTED] in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it. >> >> >> >> > -- Claudio Martella Free Software & Open Technologies Analyst TIS innovation park Via Siemens 19 | Siemensstr. 19 39100 Bolzano | 39100 Bozen Tel. +39 0471 068 123 Fax +39 0471 068 129 [EMAIL PROTECTED] http://www.tis.bz.it Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to [EMAIL PROTECTED] in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it. +
Claudio Martella 2011-07-15, 14:58
-
Re: Hash indexing of HFilesStack 2011-07-15, 16:24
How do you figure the N in the below Claudio? And the hash is a
function that respects the sort? Hadoop MapFile has something similar where the index entry is made every M entries (irrespective of size). Any chance of you trying out your suggestion in hfile? IIRC, we have performance evaluation for various file types (You might be interested in this recent posting by Mikhail Bautin of an hfile v2). St.Ack On Fri, Jul 15, 2011 at 7:58 AM, Claudio Martella <[EMAIL PROTECTED]> wrote: > Hi Michal, > > > what I was talking about is more of a vector-of-offsets kind of approach > in stead of the Btree created by the "block starting with key x" > approach which is used right now. Imagine that after the Records segment > you have a vector of N longs (in stead of the block records we have > right now), where N=the number of key/value pairs in the file. You get > the right item inside of the vector by doing hash(key) % N, and read the > exact position of the record inside of the file (which you can use for a > direct seek). This is naive, of course, because it doesn't handle > collisions, but should make the idea simple to understand. F.e. to > handle collisions the offset could be to the bucket (a linked-list) > after the vector. I've implemented this approach here: > > https://github.com/claudiomartella/sketches > > and it has very good random read performance (faster than leveldb, in my > preliminary micro-benchmarks). > > > On 7/15/11 4:48 PM, Michael Segel wrote: >> Claudio, >> >> I'm not sure on how to answer this... >> >> Yes, we've got a prototype of a Lucene on HBase w Spatial that we're starting to test. >> >> With respect to hashing... >> In one project we just hashed the key using the SHA-1 hash already in Java. This gave us the randomness without having to try to build a separate index. >> But we're still using the base key for the row. Its not like we're creating a secondary index on a column value. >> >> There are a couple of other projects out there on Git Hub so you may want to check them out. >> >> HTH >> >> -Mike >> >> >>> Date: Fri, 15 Jul 2011 14:32:50 +0200 >>> From: [EMAIL PROTECTED] >>> To: [EMAIL PROTECTED] >>> Subject: Hash indexing of HFiles >>> >>> Hello list, >>> >>> at SIGMOD this year i've seen a spreading of different storage files for >>> HBase, with different techniques. My scenario and usage doesn't really >>> require range queries, so I thought I'd take advantage of even faster >>> random i/o from hash indexing of data in each sequence file. >>> >>> Does anybody know if anybody has developed other indexing techniques for >>> sequence files other than Btrees? >>> >>> >>> Thanks! >>> >>> -- >>> Claudio Martella >>> Free Software & Open Technologies >>> Analyst >>> >>> TIS innovation park >>> Via Siemens 19 | Siemensstr. 19 >>> 39100 Bolzano | 39100 Bozen >>> Tel. +39 0471 068 123 >>> Fax +39 0471 068 129 >>> [EMAIL PROTECTED] http://www.tis.bz.it >>> >>> Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to [EMAIL PROTECTED] in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it. >>> >>> >>> >>> >> > > > -- +
Stack 2011-07-15, 16:24
-
Re: Hash indexing of HFilesClaudio Martella 2011-07-15, 17:06
On 7/15/11 6:24 PM, Stack wrote:
> How do you figure the N in the below Claudio? N is the total amount of pairs in the sequence file. You know that when you finish flushing a memstore or compacting files. N can also be N*loadFactor, in case you want to go for performance. > And the hash is a function that respects the sort? As I said, this might be interesting when you don't need range queries (so when you need to implement java.util.Map). As a matter of fact, you can still do range queries starting from an existing key, as you can seek there and start the scan until you pass the stopKey. It is a problem though when you have multiple sequence files as older files might contain entries in the range that are not the startKey (so where do you seek to?). Records are still ordered, I'm just proposing a different indexing of the ordered set. > Hadoop MapFile has something similar > where the index entry is made every M entries (irrespective of size). That is interesting, the principle looks like the same as in HFile (blocks and Btree) but I have to understand the difference further. > Any chance of you trying out your suggestion in hfile? I implemented the concept in my project which uses the same architecture of memstore+flush+compaction. I wrote here as I want to test it out on HFile, but first I wanted to hear if such an effort had been done already. > IIRC, we have > performance evaluation for various file types (You might be interested > in this recent posting by Mikhail Bautin of an hfile v2). I'd be interested in that, do you have a reference to it? > St.Ack > > On Fri, Jul 15, 2011 at 7:58 AM, Claudio Martella > <[EMAIL PROTECTED]> wrote: >> Hi Michal, >> >> >> what I was talking about is more of a vector-of-offsets kind of approach >> in stead of the Btree created by the "block starting with key x" >> approach which is used right now. Imagine that after the Records segment >> you have a vector of N longs (in stead of the block records we have >> right now), where N=the number of key/value pairs in the file. You get >> the right item inside of the vector by doing hash(key) % N, and read the >> exact position of the record inside of the file (which you can use for a >> direct seek). This is naive, of course, because it doesn't handle >> collisions, but should make the idea simple to understand. F.e. to >> handle collisions the offset could be to the bucket (a linked-list) >> after the vector. I've implemented this approach here: >> >> https://github.com/claudiomartella/sketches >> >> and it has very good random read performance (faster than leveldb, in my >> preliminary micro-benchmarks). >> >> >> On 7/15/11 4:48 PM, Michael Segel wrote: >>> Claudio, >>> >>> I'm not sure on how to answer this... >>> >>> Yes, we've got a prototype of a Lucene on HBase w Spatial that we're starting to test. >>> >>> With respect to hashing... >>> In one project we just hashed the key using the SHA-1 hash already in Java. This gave us the randomness without having to try to build a separate index. >>> But we're still using the base key for the row. Its not like we're creating a secondary index on a column value. >>> >>> There are a couple of other projects out there on Git Hub so you may want to check them out. >>> >>> HTH >>> >>> -Mike >>> >>> >>>> Date: Fri, 15 Jul 2011 14:32:50 +0200 >>>> From: [EMAIL PROTECTED] >>>> To: [EMAIL PROTECTED] >>>> Subject: Hash indexing of HFiles >>>> >>>> Hello list, >>>> >>>> at SIGMOD this year i've seen a spreading of different storage files for >>>> HBase, with different techniques. My scenario and usage doesn't really >>>> require range queries, so I thought I'd take advantage of even faster >>>> random i/o from hash indexing of data in each sequence file. >>>> >>>> Does anybody know if anybody has developed other indexing techniques for >>>> sequence files other than Btrees? >>>> >>>> >>>> Thanks! >>>> >>>> -- >>>> Claudio Martella >>>> Free Software & Open Technologies >> Claudio Martella Free Software & Open Technologies Analyst TIS innovation park Via Siemens 19 | Siemensstr. 19 39100 Bolzano | 39100 Bozen Tel. +39 0471 068 123 Fax +39 0471 068 129 [EMAIL PROTECTED] http://www.tis.bz.it Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to [EMAIL PROTECTED] in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it. +
Claudio Martella 2011-07-15, 17:06
-
Re: Hash indexing of HFilesStack 2011-07-16, 20:08
On Fri, Jul 15, 2011 at 10:06 AM, Claudio Martella
<[EMAIL PROTECTED]> wrote: > On 7/15/11 6:24 PM, Stack wrote: >> How do you figure the N in the below Claudio? > N is the total amount of pairs in the sequence file. You know that when > you finish flushing a memstore or compacting files. So a perfect index? If this is what you mean, won't the index often be often be close in size to the data size? (And index lives in memory at the moment; we have yet to make it so unused indices are let go if unused. And if this is what you mean, you can have a perfect index now by setting hfile block size < your smallest cell size. > That is interesting, the principle looks like the same as in HFile > (blocks and Btree) but I have to understand the difference further. > In HFile, the index is for every ~64k blocks or so; one block could have one entry only while the next block could have hundreds of entries; either has one entry in index only. So, its not the same as mapfile. It will write an index entry every 128 (IIRC) entries whether the entries are 10M each or 10bytes each (So, the indices could be wildly different in size based off cell size). >> IIRC, we have >> performance evaluation for various file types (You might be interested >> in this recent posting by Mikhail Bautin of an hfile v2). > I'd be interested in that, do you have a reference to it? > Here is Mikhail's post of an hfile v2 https://issues.apache.org/jira/browse/HBASE-3857 The HFile perf test tool is in code base at src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFilePerformance.java (It was used comparing hfile vs tfile vs mapfile and then hfile tweaks; it may be a little stale). Thanks for bringing up this topic Claudio and for taking the time to do a bit of research. St.Ack +
Stack 2011-07-16, 20:08
-
Re: Hash indexing of HFilesClaudio Martella 2011-07-18, 11:04
On 7/16/11 10:08 PM, Stack wrote:
> On Fri, Jul 15, 2011 at 10:06 AM, Claudio Martella > <[EMAIL PROTECTED]> wrote: >> On 7/15/11 6:24 PM, Stack wrote: >>> How do you figure the N in the below Claudio? >> N is the total amount of pairs in the sequence file. You know that when >> you finish flushing a memstore or compacting files. > So a perfect index? If this is what you mean, won't the index often > be often be close in size to the data size? (And index lives in memory > at the moment; we have yet to make it so unused indices are let go if > unused. And if this is what you mean, you can have a perfect index > now by setting hfile block size < your smallest cell size. No, you can have collisions, so the index is not perfect (which means you can have buckets for colliding keys and empty unused entries in the hashtable directory). Also, the index stores only offsets, not the keys. So they overhead of the index is sizeof(long) * number of key-value pairs in the file (in the optimal case, the overhead is 16 bytes for each colliding key-value pair as it's managed through a linked-list). For this reason using a load-factor > 1 can actually save memory and increase performance. Inside of the bucket, managed through a linked list, you scan the linked-list seeking to the key-value pair in the data segment and checking for key equality. this can require some seeks for colliding keys (as many as the number of entries in the bucket before the right one), but should not be the average case and can be decreased by increasing the load factor. This strategy is also used by cdb (http://cr.yp.to/cdb.html) and is quite effective in my benchmarks. > >> That is interesting, the principle looks like the same as in HFile >> (blocks and Btree) but I have to understand the difference further. >> > In HFile, the index is for every ~64k blocks or so; one block could > have one entry only while the next block could have hundreds of > entries; either has one entry in index only. So, its not the same as > mapfile. It will write an index entry every 128 (IIRC) entries > whether the entries are 10M each or 10bytes each (So, the indices > could be wildly different in size based off cell size). Thanks, this makes it all clear. >>> IIRC, we have >>> performance evaluation for various file types (You might be interested >>> in this recent posting by Mikhail Bautin of an hfile v2). >> I'd be interested in that, do you have a reference to it? >> > Here is Mikhail's post of an hfile v2 > https://issues.apache.org/jira/browse/HBASE-3857 > > The HFile perf test tool is in code base at > src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFilePerformance.java > (It was used comparing hfile vs tfile vs mapfile and then hfile > tweaks; it may be a little stale). > > Thanks for bringing up this topic Claudio and for taking the time to > do a bit of research. > Thank you very much for all this information. I'll try to implement a prototype in September, I'm pretty busy with deadlines right now. > St.Ack > -- Claudio Martella Free Software & Open Technologies Analyst TIS innovation park Via Siemens 19 | Siemensstr. 19 39100 Bolzano | 39100 Bozen Tel. +39 0471 068 123 Fax +39 0471 068 129 [EMAIL PROTECTED] http://www.tis.bz.it Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to [EMAIL PROTECTED] in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it. +
Claudio Martella 2011-07-18, 11:04
-
Re: Hash indexing of HFilesStack 2011-07-18, 16:05
On Mon, Jul 18, 2011 at 4:04 AM, Claudio Martella
<[EMAIL PROTECTED]> wrote: > No, you can have collisions, so the index is not perfect (which means > you can have buckets for colliding keys and empty unused entries in the > hashtable directory). Well, if a perfect index is what you are after, you can generate hashing functions that avoid collisions (You know this quality work by your fellow countrymen: http://sux.dsi.unimi.it/docs/it/unimi/dsi/sux4j/mph/MinimalPerfectHashFunction.html?) > Also, the index stores only offsets, not the keys. So they overhead of > the index is sizeof(long) * number of key-value pairs in the file (in > the optimal case, the overhead is 16 bytes for each colliding key-value > pair as it's managed through a linked-list). For this reason using a > load-factor > 1 can actually save memory and increase performance. > OK > Thank you very much for all this information. I'll try to implement a > prototype in September, I'm pretty busy with deadlines right now. > Good stuff, St.Ack +
Stack 2011-07-18, 16:05
-
Re: Hash indexing of HFilesClaudio Martella 2011-07-18, 16:22
On 7/18/11 6:05 PM, Stack wrote:
> On Mon, Jul 18, 2011 at 4:04 AM, Claudio Martella > <[EMAIL PROTECTED]> wrote: >> No, you can have collisions, so the index is not perfect (which means >> you can have buckets for colliding keys and empty unused entries in the >> hashtable directory). > Well, if a perfect index is what you are after, you can generate > hashing functions that avoid collisions (You know this quality work by > your fellow countrymen: > http://sux.dsi.unimi.it/docs/it/unimi/dsi/sux4j/mph/MinimalPerfectHashFunction.html?) > Yes, I had a look at it a while ago. For what I know perfect hashing doesn't work that good for many elements. With millions of items it should be computationally expensive and the probability of finding such a perfect hashing. Did you ever test this out? I think I can easily generate some millions of UUIDs and see how it goes. >> Also, the index stores only offsets, not the keys. So they overhead of >> the index is sizeof(long) * number of key-value pairs in the file (in >> the optimal case, the overhead is 16 bytes for each colliding key-value >> pair as it's managed through a linked-list). For this reason using a >> load-factor > 1 can actually save memory and increase performance. >> > OK > >> Thank you very much for all this information. I'll try to implement a >> prototype in September, I'm pretty busy with deadlines right now. >> > Good stuff, > St.Ack > -- Claudio Martella Free Software & Open Technologies Analyst TIS innovation park Via Siemens 19 | Siemensstr. 19 39100 Bolzano | 39100 Bozen Tel. +39 0471 068 123 Fax +39 0471 068 129 [EMAIL PROTECTED] http://www.tis.bz.it Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to [EMAIL PROTECTED] in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it. +
Claudio Martella 2011-07-18, 16:22
-
Re: Hash indexing of HFilesStack 2011-07-18, 16:32
On Mon, Jul 18, 2011 at 9:22 AM, Claudio Martella
<[EMAIL PROTECTED]> wrote: > Yes, I had a look at it a while ago. For what I know perfect hashing > doesn't work that good for many elements. With millions of items it > should be computationally expensive and the probability of finding such > a perfect hashing. Did you ever test this out? I think I can easily > generate some millions of UUIDs and see how it goes. I never tried it. I was just citing the technique. Even if it worked, Sebastiano's work is all LGPL so we would not be able to use it in our Apache project (unfortunately). St.Ack +
Stack 2011-07-18, 16:32
-
Re: Hash indexing of HFilesCasey Stella 2011-07-19, 01:02
I looked into MPH a while ago and came across Sebastiano's work, but was
even more intrigued by CMPH (http://cmph.sourceforge.net/), which claims to work on the order of a billion keys. I attempted a java port of BDZ (acyclic random 3-graphs FTW :) at one point, but gave up as I found something else a bit more suitable for my needs. On Mon, Jul 18, 2011 at 12:32 PM, Stack <[EMAIL PROTECTED]> wrote: > On Mon, Jul 18, 2011 at 9:22 AM, Claudio Martella > <[EMAIL PROTECTED]> wrote: > > Yes, I had a look at it a while ago. For what I know perfect hashing > > doesn't work that good for many elements. With millions of items it > > should be computationally expensive and the probability of finding such > > a perfect hashing. Did you ever test this out? I think I can easily > > generate some millions of UUIDs and see how it goes. > > I never tried it. I was just citing the technique. Even if it > worked, Sebastiano's work is all LGPL so we would not be able to use > it in our Apache project (unfortunately). > > St.Ack > +
Casey Stella 2011-07-19, 01:02
-
Re: Hash indexing of HFilesClaudio Martella 2011-07-19, 10:15
This looks great. Actually, more than BDZ, the intriguing part is CHM as
it's order preserving. I guess how it behaves for unseen keys. Do you know about it? What did you find more intriguing on this topic? :) On 7/19/11 3:02 AM, Casey Stella wrote: > I looked into MPH a while ago and came across Sebastiano's work, but was > even more intrigued by CMPH (http://cmph.sourceforge.net/), > which claims to work on the order of a billion keys. I attempted a java > port of BDZ (acyclic random 3-graphs FTW :) at one point, but gave up as I > found something > else a bit more suitable for my needs. > > On Mon, Jul 18, 2011 at 12:32 PM, Stack <[EMAIL PROTECTED]> wrote: > >> On Mon, Jul 18, 2011 at 9:22 AM, Claudio Martella >> <[EMAIL PROTECTED]> wrote: >>> Yes, I had a look at it a while ago. For what I know perfect hashing >>> doesn't work that good for many elements. With millions of items it >>> should be computationally expensive and the probability of finding such >>> a perfect hashing. Did you ever test this out? I think I can easily >>> generate some millions of UUIDs and see how it goes. >> I never tried it. I was just citing the technique. Even if it >> worked, Sebastiano's work is all LGPL so we would not be able to use >> it in our Apache project (unfortunately). >> >> St.Ack >> -- Claudio Martella Free Software & Open Technologies Analyst TIS innovation park Via Siemens 19 | Siemensstr. 19 39100 Bolzano | 39100 Bozen Tel. +39 0471 068 123 Fax +39 0471 068 129 [EMAIL PROTECTED] http://www.tis.bz.it Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to [EMAIL PROTECTED] in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it. +
Claudio Martella 2011-07-19, 10:15
-
Re: Hash indexing of HFilesCasey Stella 2011-07-19, 12:33
I didn't get a chance to investigate thoroughly or get any benchmarks. We
were looking for an alternate indexing strategy to B+ trees (with JDBM2) since we knew the keys a priori, but looking at the source I was a bit daunted at porting it and the license wasn't something that we could use. I started and it bit back enough that I abandoned pursuit. ;) On Tue, Jul 19, 2011 at 6:15 AM, Claudio Martella < [EMAIL PROTECTED]> wrote: > This looks great. Actually, more than BDZ, the intriguing part is CHM as > it's order preserving. > I guess how it behaves for unseen keys. Do you know about it? > > What did you find more intriguing on this topic? :) > > > On 7/19/11 3:02 AM, Casey Stella wrote: > > I looked into MPH a while ago and came across Sebastiano's work, but was > > even more intrigued by CMPH (http://cmph.sourceforge.net/), > > which claims to work on the order of a billion keys. I attempted a java > > port of BDZ (acyclic random 3-graphs FTW :) at one point, but gave up as > I > > found something > > else a bit more suitable for my needs. > > > > On Mon, Jul 18, 2011 at 12:32 PM, Stack <[EMAIL PROTECTED]> wrote: > > > >> On Mon, Jul 18, 2011 at 9:22 AM, Claudio Martella > >> <[EMAIL PROTECTED]> wrote: > >>> Yes, I had a look at it a while ago. For what I know perfect hashing > >>> doesn't work that good for many elements. With millions of items it > >>> should be computationally expensive and the probability of finding such > >>> a perfect hashing. Did you ever test this out? I think I can easily > >>> generate some millions of UUIDs and see how it goes. > >> I never tried it. I was just citing the technique. Even if it > >> worked, Sebastiano's work is all LGPL so we would not be able to use > >> it in our Apache project (unfortunately). > >> > >> St.Ack > >> > > > -- > Claudio Martella > Free Software & Open Technologies > Analyst > > TIS innovation park > Via Siemens 19 | Siemensstr. 19 > 39100 Bolzano | 39100 Bozen > Tel. +39 0471 068 123 > Fax +39 0471 068 129 > [EMAIL PROTECTED] http://www.tis.bz.it > > Short information regarding use of personal data. According to Section 13 > of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we > process your personal data in order to fulfil contractual and fiscal > obligations and also to send you information regarding our services and > events. Your personal data are processed with and without electronic means > and by respecting data subjects' rights, fundamental freedoms and dignity, > particularly with regard to confidentiality, personal identity and the right > to personal data protection. At any time and without formalities you can > write an e-mail to [EMAIL PROTECTED] in order to object the processing of > your personal data for the purpose of sending advertising materials and also > to exercise the right to access personal data and other rights referred to > in Section 7 of Decree 196/2003. The data controller is TIS Techno > Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the > complete information on the web site www.tis.bz.it. > > > > > +
Casey Stella 2011-07-19, 12:33
-
Re: Hash indexing of HFilesEric Charles 2011-07-16, 08:38
On 15/07/11 16:48, Michael Segel wrote:
> > Claudio, > > I'm not sure on how to answer this... > > Yes, we've got a prototype of a Lucene on HBase w Spatial that we're starting to test. > That's Cool Michael. Is there a chance to read more on your prototype ? ;) > With respect to hashing... > In one project we just hashed the key using the SHA-1 hash already in Java. This gave us the randomness without having to try to build a separate index. > But we're still using the base key for the row. Its not like we're creating a secondary index on a column value. > > There are a couple of other projects out there on Git Hub so you may want to check them out. > > HTH > > -Mike > > >> Date: Fri, 15 Jul 2011 14:32:50 +0200 >> From: [EMAIL PROTECTED] >> To: [EMAIL PROTECTED] >> Subject: Hash indexing of HFiles >> >> Hello list, >> >> at SIGMOD this year i've seen a spreading of different storage files for >> HBase, with different techniques. My scenario and usage doesn't really >> require range queries, so I thought I'd take advantage of even faster >> random i/o from hash indexing of data in each sequence file. >> >> Does anybody know if anybody has developed other indexing techniques for >> sequence files other than Btrees? >> >> >> Thanks! >> >> -- >> Claudio Martella >> Free Software& Open Technologies >> Analyst >> >> TIS innovation park >> Via Siemens 19 | Siemensstr. 19 >> 39100 Bolzano | 39100 Bozen >> Tel. +39 0471 068 123 >> Fax +39 0471 068 129 >> [EMAIL PROTECTED] http://www.tis.bz.it >> >> Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to [EMAIL PROTECTED] in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web site www.tis.bz.it. >> >> >> >> > -- Eric Charles http://about.echarles.net +
Eric Charles 2011-07-16, 08:38
-
Re: Hash indexing of HFilesMichel Segel 2011-07-16, 12:11
Eric,
It depends... 1) does it work well enough? 2) I'm a contractor so it's not my call. It's up to my client. Having said that, I think that if our tests go well, it might get out to git hub. At least that's our plan. But there's definitely more work to be done in terms of testing and tweaking the code. Sent from a remote device. Please excuse any typos... Mike Segel On Jul 16, 2011, at 3:38 AM, Eric Charles <[EMAIL PROTECTED]> wrote: > On 15/07/11 16:48, Michael Segel wrote: >> >> Claudio, >> >> I'm not sure on how to answer this... >> >> Yes, we've got a prototype of a Lucene on HBase w Spatial that we're starting to test. >> > > That's Cool Michael. > Is there a chance to read more on your prototype ? ;) > > >> With respect to hashing... >> In one project we just hashed the key using the SHA-1 hash already in Java. This gave us the randomness without having to try to build a separate index. >> But we're still using the base key for the row. Its not like we're creating a secondary index on a column value. >> >> There are a couple of other projects out there on Git Hub so you may want to check them out. >> >> HTH >> >> -Mike >> >> >>> Date: Fri, 15 Jul 2011 14:32:50 +0200 >>> From: [EMAIL PROTECTED] >>> To: [EMAIL PROTECTED] >>> Subject: Hash indexing of HFiles >>> >>> Hello list, >>> >>> at SIGMOD this year i've seen a spreading of different storage files for >>> HBase, with different techniques. My scenario and usage doesn't really >>> require range queries, so I thought I'd take advantage of even faster >>> random i/o from hash indexing of data in each sequence file. >>> >>> Does anybody know if anybody has developed other indexing techniques for >>> sequence files other than Btrees? >>> >>> >>> Thanks! >>> >>> -- >>> Claudio Martella >>> Free Software& Open Technologies >>> Analyst >>> >>> TIS innovation park >>> Via Siemens 19 | Siemensstr. 19 >>> 39100 Bolzano | 39100 Bozen >>> Tel. +39 0471 068 123 >>> Fax +39 0471 068 129 >>> [EMAIL PROTECTED] http://www.tis.bz.it >>> >>> Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to [EMAIL PROTECTED] in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the web > site www.tis.bz.it. >>> >>> >>> >>> >> > > > -- > Eric Charles > http://about.echarles.net > +
Michel Segel 2011-07-16, 12:11
-
Re: Hash indexing of HFilesEric Charles 2011-07-17, 04:30
On 16/07/11 14:11, Michel Segel wrote:
> Eric, > It depends... > 1) does it work well enough? > 2) I'm a contractor so it's not my call. It's up to my client. Having said that, I think that if our tests go well, it might get out to git hub. At least that's our plan. > > But there's definitely more work to be done in terms of testing and tweaking the code. > Hi Mike, Thx for the answer and have fun with the prototype :) Eric > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On Jul 16, 2011, at 3:38 AM, Eric Charles<[EMAIL PROTECTED]> wrote: > >> On 15/07/11 16:48, Michael Segel wrote: >>> >>> Claudio, >>> >>> I'm not sure on how to answer this... >>> >>> Yes, we've got a prototype of a Lucene on HBase w Spatial that we're starting to test. >>> >> >> That's Cool Michael. >> Is there a chance to read more on your prototype ? ;) >> >> >>> With respect to hashing... >>> In one project we just hashed the key using the SHA-1 hash already in Java. This gave us the randomness without having to try to build a separate index. >>> But we're still using the base key for the row. Its not like we're creating a secondary index on a column value. >>> >>> There are a couple of other projects out there on Git Hub so you may want to check them out. >>> >>> HTH >>> >>> -Mike >>> >>> >>>> Date: Fri, 15 Jul 2011 14:32:50 +0200 >>>> From: [EMAIL PROTECTED] >>>> To: [EMAIL PROTECTED] >>>> Subject: Hash indexing of HFiles >>>> >>>> Hello list, >>>> >>>> at SIGMOD this year i've seen a spreading of different storage files for >>>> HBase, with different techniques. My scenario and usage doesn't really >>>> require range queries, so I thought I'd take advantage of even faster >>>> random i/o from hash indexing of data in each sequence file. >>>> >>>> Does anybody know if anybody has developed other indexing techniques for >>>> sequence files other than Btrees? >>>> >>>> >>>> Thanks! >>>> >>>> -- >>>> Claudio Martella >>>> Free Software& Open Technologies >>>> Analyst >>>> >>>> TIS innovation park >>>> Via Siemens 19 | Siemensstr. 19 >>>> 39100 Bolzano | 39100 Bozen >>>> Tel. +39 0471 068 123 >>>> Fax +39 0471 068 129 >>>> [EMAIL PROTECTED] http://www.tis.bz.it >>>> >>>> Short information regarding use of personal data. According to Section 13 of Italian Legislative Decree no. 196 of 30 June 2003, we inform you that we process your personal data in order to fulfil contractual and fiscal obligations and also to send you information regarding our services and events. Your personal data are processed with and without electronic means and by respecting data subjects' rights, fundamental freedoms and dignity, particularly with regard to confidentiality, personal identity and the right to personal data protection. At any time and without formalities you can write an e-mail to [EMAIL PROTECTED] in order to object the processing of your personal data for the purpose of sending advertising materials and also to exercise the right to access personal data and other rights referred to in Section 7 of Decree 196/2003. The data controller is TIS Techno Innovation Alto Adige, Siemens Street n. 19, Bolzano. You can find the complete information on the w eb >> site www.tis.bz.it. >>>> >>>> >>>> >>>> >>> >> >> >> -- >> Eric Charles >> http://about.echarles.net >> -- Eric Charles http://about.echarles.net +
Eric Charles 2011-07-17, 04:30
|