|
|
+
Otis Gospodnetic 2012-10-05, 03:34
-
Re: Lucene instead of HFiles?
Adrien Mogenet 2012-10-05, 06:36
"Don't bother trying this in production" ;-) 1. Are you sure lookup by key are faster ? 2. Updating Lucene files in a lock-free maneer and ensuring good concurrency can be a bit tricky 3. AFAIK, Lucene files don't fit in HDFS and thus another distributed storage is required. Katta does not look as powerful as Hadoop. On Fri, Oct 5, 2012 at 5:34 AM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Hi, > > Has anyone attempted using Lucene instead of HFiles (see > https://twitter.com/otisg/status/254047978174701568 )? > > Is that a completely crazy, bad, would-never-work, > don't-bother-trying-this-at-home, it's-too-late-go-to-sleep idea? Or > not? > > Thanks, > Otis > -- > Search Analytics - http://sematext.com/search-analytics/index.html> Performance Monitoring - http://sematext.com/spm/index.html-- Adrien Mogenet 06.59.16.64.22 http://www.mogenet.me
+
Adrien Mogenet 2012-10-05, 06:36
-
Re: Lucene instead of HFiles?
Renaud Delbru 2012-10-05, 08:48
Hi, With respect to point 3, I know there is a new codec in Lucene 4.0 for append-only filesystem such as hdfs (LUCENE-2373) Also, it would also depend on the use case. At the moment, for storing data, I would expect HFile to be much more efficient in term of compression than Lucene file system (in fact, there is no real comnpression, apart by compressing yourself the field byte stream before storing it). There is some work to try to make Lucene more efficient for small and medium sized fields (LUCENE-4226 - block-style compression and storing), but I think HFile is far more optimised for this task. In fact, another interesting idea would be to investigate the use of HFile as a StoredFieldFormat in Lucene. Efficient storage of data in Lucene is imho quite a missing feature. my2c Regards -- Renaud Delbru On 05/10/12 07:36, Adrien Mogenet wrote: > "Don't bother trying this in production" ;-) > > 1. Are you sure lookup by key are faster ? > 2. Updating Lucene files in a lock-free maneer and ensuring good > concurrency can be a bit tricky > 3. AFAIK, Lucene files don't fit in HDFS and thus another distributed > storage is required. Katta does not look as powerful as Hadoop. > > On Fri, Oct 5, 2012 at 5:34 AM, Otis Gospodnetic > <[EMAIL PROTECTED]> wrote: >> Hi, >> >> Has anyone attempted using Lucene instead of HFiles (see >> https://twitter.com/otisg/status/254047978174701568 )? >> >> Is that a completely crazy, bad, would-never-work, >> don't-bother-trying-this-at-home, it's-too-late-go-to-sleep idea? Or >> not? >> >> Thanks, >> Otis >> -- >> Search Analytics - http://sematext.com/search-analytics/index.html>> Performance Monitoring - http://sematext.com/spm/index.html> > >
+
Renaud Delbru 2012-10-05, 08:48
-
Re: Lucene instead of HFiles?
Otis Gospodnetic 2012-10-06, 02:31
Hi Renaud, On Fri, Oct 5, 2012 at 4:48 AM, Renaud Delbru <[EMAIL PROTECTED]> wrote: > Hi, > > With respect to point 3, I know there is a new codec in Lucene 4.0 for > append-only filesystem such as hdfs (LUCENE-2373) Yeah. Though I think nobody wants to search indices directly in HDFS for performance reasons. > Also, it would also depend on the use case. At the moment, for storing data, > I would expect HFile to be much more efficient in term of compression than > Lucene file system (in fact, there is no real comnpression, apart by > compressing yourself the field byte stream before storing it). There is some > work to try to make Lucene more efficient for small and medium sized fields > (LUCENE-4226 - block-style compression and storing), but I think HFile is > far more optimised for this task. I wouldn't know... though I was under the impression there has been other work around packing things tightly both on disk and in memory. Check http://www.slideshare.net/lucenerevolution/willnauer-simon-doc-values-column-stride-fields-in-lucene... slide 16, etc. > In fact, another interesting idea would be to investigate the use of HFile > as a StoredFieldFormat in Lucene. Efficient storage of data in Lucene is > imho quite a missing feature. Otis -- Search Analytics - http://sematext.com/search-analytics/index.htmlPerformance Monitoring - http://sematext.com/spm/index.html> On 05/10/12 07:36, Adrien Mogenet wrote: >> >> "Don't bother trying this in production" ;-) >> >> 1. Are you sure lookup by key are faster ? >> 2. Updating Lucene files in a lock-free maneer and ensuring good >> concurrency can be a bit tricky >> 3. AFAIK, Lucene files don't fit in HDFS and thus another distributed >> storage is required. Katta does not look as powerful as Hadoop. >> >> On Fri, Oct 5, 2012 at 5:34 AM, Otis Gospodnetic >> <[EMAIL PROTECTED]> wrote: >>> >>> Hi, >>> >>> Has anyone attempted using Lucene instead of HFiles (see >>> https://twitter.com/otisg/status/254047978174701568 )? >>> >>> Is that a completely crazy, bad, would-never-work, >>> don't-bother-trying-this-at-home, it's-too-late-go-to-sleep idea? Or >>> not? >>> >>> Thanks, >>> Otis >>> -- >>> Search Analytics - http://sematext.com/search-analytics/index.html>>> Performance Monitoring - http://sematext.com/spm/index.html>> >> >> >> >
+
Otis Gospodnetic 2012-10-06, 02:31
-
Re: Lucene instead of HFiles?
Otis Gospodnetic 2012-10-06, 02:21
Hi, On Fri, Oct 5, 2012 at 2:36 AM, Adrien Mogenet <[EMAIL PROTECTED]> wrote: > "Don't bother trying this in production" ;-) > > 1. Are you sure lookup by key are faster ? No clue. But I also didn't say it's faster, just fast. :) > 2. Updating Lucene files in a lock-free maneer and ensuring good > concurrency can be a bit tricky AFAIK Lucene files are immutable. Updates are delete and add. Deletes are flags like tombstone markers in HBase. > 3. AFAIK, Lucene files don't fit in HDFS and thus another distributed > storage is required. Katta does not look as powerful as Hadoop. Katta and Hadoop are two different tools, though. From what I recall, Katta simply used HDFS for storing indices, but would push them elsewhere for searching purposes. Otis -- Search Analytics - http://sematext.com/search-analytics/index.htmlPerformance Monitoring - http://sematext.com/spm/index.html> On Fri, Oct 5, 2012 at 5:34 AM, Otis Gospodnetic > <[EMAIL PROTECTED]> wrote: >> Hi, >> >> Has anyone attempted using Lucene instead of HFiles (see >> https://twitter.com/otisg/status/254047978174701568 )? >> >> Is that a completely crazy, bad, would-never-work, >> don't-bother-trying-this-at-home, it's-too-late-go-to-sleep idea? Or >> not? >> >> Thanks, >> Otis >> -- >> Search Analytics - http://sematext.com/search-analytics/index.html>> Performance Monitoring - http://sematext.com/spm/index.html> > > > -- > Adrien Mogenet > 06.59.16.64.22 > http://www.mogenet.me
+
Otis Gospodnetic 2012-10-06, 02:21
-
RE: Lucene instead of HFiles?
Fuad Efendi 2012-10-06, 02:35
Lucene sucks with traditional "secondary indices" for traditional tables... engineering overhead, too much... and you indeed already have kind of "secondary indices" with HFile and Bloom Filter structure... just design "secondary" Bloom filters etc.......
Yes, Lucene/Solr already implement this functionality. But we can improve it for "non-tokenized" secondary indices. -Fuad
+
Fuad Efendi 2012-10-06, 02:35
-
RE: Lucene instead of HFiles?
Fuad Efendi 2012-10-06, 02:41
If you don't like HFiles, and prefer Solr instead, consider Map. It is very nice... : - )
What about EhCache? Still synchronized?......... use LinkedHashMap......
You just need "inverted table" for a search by secondary index, and you are comparing Lucene with HTable... wow... everything depends on use case... I prefer auxiliary tables in HBase with extra fastest FIFO in-memory caches, and if I don't need transactions - I don't use them...
-Fuad -----Original Message----- From: Fuad Efendi [mailto:[EMAIL PROTECTED]] Sent: October-05-12 10:35 PM To: [EMAIL PROTECTED] Subject: RE: Lucene instead of HFiles?
Lucene sucks with traditional "secondary indices" for traditional tables... engineering overhead, too much... and you indeed already have kind of "secondary indices" with HFile and Bloom Filter structure... just design "secondary" Bloom filters etc.......
Yes, Lucene/Solr already implement this functionality. But we can improve it for "non-tokenized" secondary indices. -Fuad
+
Fuad Efendi 2012-10-06, 02:41
-
Re: Lucene instead of HFiles?
Lars George 2012-10-05, 09:11
Hi Otis, My initial reaction was, "interesting idea". On second thoughts though I do not see how this makes more sense compared to what we have now. HFiles combined with Bloom filters are fast to look up anyways. Adding Lucene as another "Storage Engine" (getting us close to Voldemort or MySQL with replaceable storage backends) does seem to not add any value, and more so, might even have a few drawbacks. Especially range scans will suffer, as HFiles and their block oriented layout plus caching makes for really fast I/O. Lucene is for search, not xyzbytes of data transfers. And simply replacing the block index and Blooms with Lucene is also I think overkill. Just saying. Lars On Oct 5, 2012, at 5:34 AM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Hi, > > Has anyone attempted using Lucene instead of HFiles (see > https://twitter.com/otisg/status/254047978174701568 )? > > Is that a completely crazy, bad, would-never-work, > don't-bother-trying-this-at-home, it's-too-late-go-to-sleep idea? Or > not? > > Thanks, > Otis > -- > Search Analytics - http://sematext.com/search-analytics/index.html> Performance Monitoring - http://sematext.com/spm/index.html
+
Lars George 2012-10-05, 09:11
-
Re: Lucene instead of HFiles?
Michael Segel 2012-10-05, 11:14
Actually I think you'd want to do the reverse. Store your Lucene index in HBase. Which is what we did a while back. This could be extended to SOLR, but we never had time to do it. On Oct 5, 2012, at 4:11 AM, Lars George <[EMAIL PROTECTED]> wrote: > Hi Otis, > > My initial reaction was, "interesting idea". On second thoughts though I do not see how this makes more sense compared to what we have now. HFiles combined with Bloom filters are fast to look up anyways. Adding Lucene as another "Storage Engine" (getting us close to Voldemort or MySQL with replaceable storage backends) does seem to not add any value, and more so, might even have a few drawbacks. Especially range scans will suffer, as HFiles and their block oriented layout plus caching makes for really fast I/O. Lucene is for search, not xyzbytes of data transfers. And simply replacing the block index and Blooms with Lucene is also I think overkill. Just saying. > > Lars > > On Oct 5, 2012, at 5:34 AM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > >> Hi, >> >> Has anyone attempted using Lucene instead of HFiles (see >> https://twitter.com/otisg/status/254047978174701568 )? >> >> Is that a completely crazy, bad, would-never-work, >> don't-bother-trying-this-at-home, it's-too-late-go-to-sleep idea? Or >> not? >> >> Thanks, >> Otis >> -- >> Search Analytics - http://sematext.com/search-analytics/index.html>> Performance Monitoring - http://sematext.com/spm/index.html> >
+
Michael Segel 2012-10-05, 11:14
-
Re: Lucene instead of HFiles?
Otis Gospodnetic 2012-10-06, 02:38
Hi Lars, Yeah, maybe. Somewhere in the back of my head was a completely fuzzy idea that if one were to sneak in Lucene at that low level one could get that full-text search over HBase data that comes up periodically. Also, I was thinking, having Lucene down there could make it possible to get ad-hoc reports on data in HBase and one wouldn't have to figure out the key structure ahead of time. But I think Jacques makes a good point - there are already ElasticSearch and Solr. They are full-text search engines, but people also use them for pure boolean matching, as key value stores, etc. Otis -- Search Analytics - http://sematext.com/search-analytics/index.htmlPerformance Monitoring - http://sematext.com/spm/index.htmlOn Fri, Oct 5, 2012 at 5:11 AM, Lars George <[EMAIL PROTECTED]> wrote: > Hi Otis, > > My initial reaction was, "interesting idea". On second thoughts though I do not see how this makes more sense compared to what we have now. HFiles combined with Bloom filters are fast to look up anyways. Adding Lucene as another "Storage Engine" (getting us close to Voldemort or MySQL with replaceable storage backends) does seem to not add any value, and more so, might even have a few drawbacks. Especially range scans will suffer, as HFiles and their block oriented layout plus caching makes for really fast I/O. Lucene is for search, not xyzbytes of data transfers. And simply replacing the block index and Blooms with Lucene is also I think overkill. Just saying. > > Lars > > On Oct 5, 2012, at 5:34 AM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > >> Hi, >> >> Has anyone attempted using Lucene instead of HFiles (see >> https://twitter.com/otisg/status/254047978174701568 )? >> >> Is that a completely crazy, bad, would-never-work, >> don't-bother-trying-this-at-home, it's-too-late-go-to-sleep idea? Or >> not? >> >> Thanks, >> Otis >> -- >> Search Analytics - http://sematext.com/search-analytics/index.html>> Performance Monitoring - http://sematext.com/spm/index.html>
+
Otis Gospodnetic 2012-10-06, 02:38
-
Re: Lucene instead of HFiles?
Jacques 2012-10-05, 13:43
Abstractly, isn't this what Elastic Search and Katta already are: range-sharded data stores based on top of Lucene? J On Thu, Oct 4, 2012 at 8:34 PM, Otis Gospodnetic <[EMAIL PROTECTED] > wrote: > Hi, > > Has anyone attempted using Lucene instead of HFiles (see > https://twitter.com/otisg/status/254047978174701568 )? > > Is that a completely crazy, bad, would-never-work, > don't-bother-trying-this-at-home, it's-too-late-go-to-sleep idea? Or > not? > > Thanks, > Otis > -- > Search Analytics - http://sematext.com/search-analytics/index.html> Performance Monitoring - http://sematext.com/spm/index.html>
+
Jacques 2012-10-05, 13:43
|
|