|
Jason Rutherglen
2011-02-11, 23:10
Ted Dunning
2011-02-11, 23:27
Jason Rutherglen
2011-02-11, 23:50
Ted Dunning
2011-02-12, 00:13
Jason Rutherglen
2011-02-12, 00:44
Ted Dunning
2011-02-12, 02:20
Jason Rutherglen
2011-02-12, 02:56
Ted Dunning
2011-02-12, 03:00
Jason Rutherglen
2011-02-12, 03:21
Bruno Dumon
2011-02-12, 11:02
Jason Rutherglen
2011-02-12, 15:13
Jason Rutherglen
2011-02-12, 21:01
Ted Dunning
2011-02-12, 21:14
Jason Rutherglen
2011-02-12, 21:31
Ted Dunning
2011-02-13, 09:36
Bruno Dumon
2011-02-13, 13:13
Thomas Koch
2011-02-13, 16:26
Sean Bigdatafun
2011-02-13, 17:37
Ted Dunning
2011-02-13, 20:07
Ted Dunning
2011-02-13, 20:10
Jason Rutherglen
2011-02-13, 23:21
Jason Rutherglen
2011-02-13, 23:37
Jason Rutherglen
2011-02-14, 02:01
Jason Rutherglen
2011-02-14, 02:09
Ted Dunning
2011-02-14, 06:47
Ted Dunning
2011-02-14, 06:49
Ted Dunning
2011-02-14, 06:51
Jason Rutherglen
2011-02-14, 14:22
Jason Rutherglen
2011-02-14, 15:08
Jason Rutherglen
2011-02-14, 17:19
Bruno Dumon
2011-02-14, 17:28
Jason Rutherglen
2011-02-14, 17:48
Jean-Daniel Cryans
2011-02-14, 17:51
Ted Dunning
2011-02-14, 18:55
Ted Dunning
2011-02-14, 18:57
Jason Rutherglen
2011-02-14, 19:06
Jason Rutherglen
2011-02-14, 19:09
Ted Dunning
2011-02-14, 19:18
Bruno Dumon
2011-02-14, 19:21
Jason Rutherglen
2011-02-14, 19:28
Ted Dunning
2011-02-14, 20:04
Jason Rutherglen
2011-02-14, 20:18
Ted Dunning
2011-02-14, 20:20
Ted Dunning
2011-02-14, 20:22
Jason Rutherglen
2011-02-14, 20:37
Jason Rutherglen
2011-02-14, 21:03
Jason Rutherglen
2011-02-14, 22:04
Jason Rutherglen
2011-04-15, 01:18
Ted Yu
2011-04-15, 02:41
Jason Rutherglen
2011-04-15, 13:19
Jason Rutherglen
2011-04-15, 16:15
tsuna
2011-04-20, 06:50
Otis Gospodnetic
2011-04-20, 12:06
tsuna
2011-04-20, 20:25
Jason Rutherglen
2011-04-20, 20:55
|
-
HBase and Lucene for realtime searchJason Rutherglen 2011-02-11, 23:10
Hello,
I'm curious as to what a 'good' approach would be for implementing search in HBase (using Lucene) with the end goal being the integration of realtime search into HBase. I think the use case makes sense as HBase is realtime and has a write-ahead log, performs automatic partitioning, splitting of data, failover, redundancy, etc. These are all things Lucene does not have out of the box, that we'd essentially get for 'free'. For starters: Where would be the right place to store Lucene segments or postings? Eg, we need to be able to efficiently perform a linear iteration of the per-term posting list(s). Thanks! Jason Rutherglen
-
Re: HBase and Lucene for realtime searchTed Dunning 2011-02-11, 23:27
Jason,
I can't imagine that the speed achieved by using Hbase would be even within orders of magnitude of what you can do in Lucene 4 (or even 3). For reference, I think that Michi Busch's search based on flexible indexing is able to handle >10,000 inserts and >40,000 searches per second on a laptop. Each search involves a number of scans of posting vectors so this is roughly equivalent to >100,000 scans per second (on a single host). The rumor is that the insert speed is so high that it is quickly to re-index 500 million documents than to load an index. I don't think that hbase is intended to be anywhere near this kind of speed. On Fri, Feb 11, 2011 at 3:10 PM, Jason Rutherglen < [EMAIL PROTECTED]> wrote: > Hello, > > I'm curious as to what a 'good' approach would be for implementing > search in HBase (using Lucene) with the end goal being the integration > of realtime search into HBase. I think the use case makes sense as > HBase is realtime and has a write-ahead log, performs automatic > partitioning, splitting of data, failover, redundancy, etc. These are > all things Lucene does not have out of the box, that we'd essentially > get for 'free'. > > For starters: Where would be the right place to store Lucene segments > or postings? Eg, we need to be able to efficiently perform a linear > iteration of the per-term posting list(s). > > Thanks! > > Jason Rutherglen >
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-02-11, 23:50
> I can't imagine that the speed achieved by using Hbase would be even within
> orders of magnitude of what you can do in Lucene 4 (or even 3). The indexing speed in Lucene hasn't changed in quite a while, are you saying HBase would somehow be overloaded? That doesn't seem to jive with the sequential writes HBase performs? On the query side, I think they should be fine as well? At the rock bottom, all we need need to be able to do is sequentially scan the posting lists? The speed of indexing is a function of creating segments, with flexible indexing, the underlying segment files (and postings) may be significantly altered from the default file structures, eg, placed into HBase in various ways. The posting lists could even be split along with HBase regions? > For reference, I think that Michi Busch's search based on flexible indexing You mean for Twitter? I can't comment on that, however as far as I know the internals don't use Lucene, eg, it's a entirely new inverted index structure specifically for Twitter. I think this's illustrated in these slides: http://www.lucenerevolution.org/sites/default/files/Lucene%20Rev%20Preso%20Busch%20Realtime_Search_LR1010.pdf On Fri, Feb 11, 2011 at 3:27 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Jason, > > I can't imagine that the speed achieved by using Hbase would be even within > orders of magnitude of what you can do in Lucene 4 (or even 3). > > For reference, I think that Michi Busch's search based on flexible indexing > is able to handle >10,000 inserts and >40,000 searches per second on a > laptop. Each search involves a number of scans of posting vectors so this > is roughly equivalent to >100,000 scans per second (on a single host). > > The rumor is that the insert speed is so high that it is quickly to re-index > 500 million documents than to load an index. > > I don't think that hbase is intended to be anywhere near this kind of speed. > > > On Fri, Feb 11, 2011 at 3:10 PM, Jason Rutherglen < > [EMAIL PROTECTED]> wrote: > >> Hello, >> >> I'm curious as to what a 'good' approach would be for implementing >> search in HBase (using Lucene) with the end goal being the integration >> of realtime search into HBase. I think the use case makes sense as >> HBase is realtime and has a write-ahead log, performs automatic >> partitioning, splitting of data, failover, redundancy, etc. These are >> all things Lucene does not have out of the box, that we'd essentially >> get for 'free'. >> >> For starters: Where would be the right place to store Lucene segments >> or postings? Eg, we need to be able to efficiently perform a linear >> iteration of the per-term posting list(s). >> >> Thanks! >> >> Jason Rutherglen >> >
-
Re: HBase and Lucene for realtime searchTed Dunning 2011-02-12, 00:13
On Fri, Feb 11, 2011 at 3:50 PM, Jason Rutherglen <
[EMAIL PROTECTED]> wrote: > > I can't imagine that the speed achieved by using Hbase would be even > within > > orders of magnitude of what you can do in Lucene 4 (or even 3). > > The indexing speed in Lucene hasn't changed in quite a while, are you > saying HBase would somehow be overloaded? That doesn't seem to jive > with the sequential writes HBase performs? > Michi's stuff uses flexible indexing with a zero lock architecture. The speed *is* much higher. The real problem is that hbase repeats keys. If you were to store entire posting vectors as values with terms as keys, you might be OK. Very long posting vectors or add-ons could be added using a key+serial number trick. Short queries would involve reading and merging several posting vectors. In that mode, query speeds might be OK, but there isn't a lot of Lucene left at that point. For updates, speed would only be acceptable if you batch up a lot updates or possibly if you build in a value append function as a co-processor. > The speed of indexing is a function of creating segments, with > flexible indexing, the underlying segment files (and postings) may be > significantly altered from the default file structures, eg, placed > into HBase in various ways. The posting lists could even be split > along with HBase regions? > Possibly. But if you use term + counter and post vectors of limited length you might be OK.
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-02-12, 00:44
> Michi's stuff uses flexible indexing with a zero lock architecture. The
> speed *is* much higher. The speed's higher, and there isn't much Lucene left there either, as I believe it was built specifically for the 140 characters use case (eg, not the general use case). I don't think most indexes can be compressed to only exist in RAM on a single server? The Twitter use case isn't one that the HBase RT search solution is useful for? > If you were to store entire posting vectors as values with terms as keys, > you might be OK. Very long posting vectors or add-ons could be added using > a key+serial number trick. This sounds like the right approach to try. Also, the Lucene terms dict is sorted anyways, so moving the terms into HBase's sorted keys probably makes sense. > For updates, speed would only be acceptable if you batch up a > lot updates or possibly if you build in a value append function as a > co-processor. Hmm... I think the main issue would be the way Lucene implements deletes (eg, today as a BitVector). I think we'd keep that functionality. The new docs/updates would be added to the in-RAM-buffer. I think there'd be a RAM size based flush as there is today. Where that'd be flushed to is an open question. I think the key advantages to the RT + HBase architecture is the index would live alongside HBase columns, and so all other scaling problems (especially those related to scaling RT, such as synchronization of distributed data and updates) goes away. A distributed query would remain the same, eg, it'd hit N servers? In addition, Lucene offers a wide variety of new query types which HBase'd get in realtime for free. On Fri, Feb 11, 2011 at 4:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > On Fri, Feb 11, 2011 at 3:50 PM, Jason Rutherglen < > [EMAIL PROTECTED]> wrote: > >> > I can't imagine that the speed achieved by using Hbase would be even >> within >> > orders of magnitude of what you can do in Lucene 4 (or even 3). >> >> The indexing speed in Lucene hasn't changed in quite a while, are you >> saying HBase would somehow be overloaded? That doesn't seem to jive >> with the sequential writes HBase performs? >> > > Michi's stuff uses flexible indexing with a zero lock architecture. The > speed *is* much higher. > > The real problem is that hbase repeats keys. > > If you were to store entire posting vectors as values with terms as keys, > you might be OK. Very long posting vectors or add-ons could be added using > a key+serial number trick. > > Short queries would involve reading and merging several posting vectors. In > that mode, query speeds might be OK, but there isn't a lot of Lucene left at > that point. For updates, speed would only be acceptable if you batch up a > lot updates or possibly if you build in a value append function as a > co-processor. > > > >> The speed of indexing is a function of creating segments, with >> flexible indexing, the underlying segment files (and postings) may be >> significantly altered from the default file structures, eg, placed >> into HBase in various ways. The posting lists could even be split >> along with HBase regions? >> > > Possibly. But if you use term + counter and post vectors of limited length > you might be OK. >
-
Re: HBase and Lucene for realtime searchTed Dunning 2011-02-12, 02:20
Go for it!
On Fri, Feb 11, 2011 at 4:44 PM, Jason Rutherglen < [EMAIL PROTECTED]> wrote: > > Michi's stuff uses flexible indexing with a zero lock architecture. The > > speed *is* much higher. > > The speed's higher, and there isn't much Lucene left there either, as > I believe it was built specifically for the 140 characters use case > (eg, not the general use case). I don't think most indexes can be > compressed to only exist in RAM on a single server? The Twitter use > case isn't one that the HBase RT search solution is useful for? > > > If you were to store entire posting vectors as values with terms as keys, > > you might be OK. Very long posting vectors or add-ons could be added > using > > a key+serial number trick. > > This sounds like the right approach to try. Also, the Lucene terms > dict is sorted anyways, so moving the terms into HBase's sorted keys > probably makes sense. > > > For updates, speed would only be acceptable if you batch up a > > lot updates or possibly if you build in a value append function as a > > co-processor. > > Hmm... I think the main issue would be the way Lucene implements > deletes (eg, today as a BitVector). I think we'd keep that > functionality. The new docs/updates would be added to the > in-RAM-buffer. I think there'd be a RAM size based flush as there is > today. Where that'd be flushed to is an open question. > > I think the key advantages to the RT + HBase architecture is the index > would live alongside HBase columns, and so all other scaling problems > (especially those related to scaling RT, such as synchronization of > distributed data and updates) goes away. > > A distributed query would remain the same, eg, it'd hit N servers? > > In addition, Lucene offers a wide variety of new query types which > HBase'd get in realtime for free. > > On Fri, Feb 11, 2011 at 4:13 PM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > On Fri, Feb 11, 2011 at 3:50 PM, Jason Rutherglen < > > [EMAIL PROTECTED]> wrote: > > > >> > I can't imagine that the speed achieved by using Hbase would be even > >> within > >> > orders of magnitude of what you can do in Lucene 4 (or even 3). > >> > >> The indexing speed in Lucene hasn't changed in quite a while, are you > >> saying HBase would somehow be overloaded? That doesn't seem to jive > >> with the sequential writes HBase performs? > >> > > > > Michi's stuff uses flexible indexing with a zero lock architecture. The > > speed *is* much higher. > > > > The real problem is that hbase repeats keys. > > > > If you were to store entire posting vectors as values with terms as keys, > > you might be OK. Very long posting vectors or add-ons could be added > using > > a key+serial number trick. > > > > Short queries would involve reading and merging several posting vectors. > In > > that mode, query speeds might be OK, but there isn't a lot of Lucene left > at > > that point. For updates, speed would only be acceptable if you batch up > a > > lot updates or possibly if you build in a value append function as a > > co-processor. > > > > > > > >> The speed of indexing is a function of creating segments, with > >> flexible indexing, the underlying segment files (and postings) may be > >> significantly altered from the default file structures, eg, placed > >> into HBase in various ways. The posting lists could even be split > >> along with HBase regions? > >> > > > > Possibly. But if you use term + counter and post vectors of limited > length > > you might be OK. > > >
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-02-12, 02:56
Thanks! In browsing the HBase code, I think it'd be optimal to stream
the posting/binary data directly from the underlying storage (instead of loading the entire byte[]), it doesn't look like there's a way to do this (yet)? On Fri, Feb 11, 2011 at 6:20 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Go for it! > > On Fri, Feb 11, 2011 at 4:44 PM, Jason Rutherglen < > [EMAIL PROTECTED]> wrote: > >> > Michi's stuff uses flexible indexing with a zero lock architecture. The >> > speed *is* much higher. >> >> The speed's higher, and there isn't much Lucene left there either, as >> I believe it was built specifically for the 140 characters use case >> (eg, not the general use case). I don't think most indexes can be >> compressed to only exist in RAM on a single server? The Twitter use >> case isn't one that the HBase RT search solution is useful for? >> >> > If you were to store entire posting vectors as values with terms as keys, >> > you might be OK. Very long posting vectors or add-ons could be added >> using >> > a key+serial number trick. >> >> This sounds like the right approach to try. Also, the Lucene terms >> dict is sorted anyways, so moving the terms into HBase's sorted keys >> probably makes sense. >> >> > For updates, speed would only be acceptable if you batch up a >> > lot updates or possibly if you build in a value append function as a >> > co-processor. >> >> Hmm... I think the main issue would be the way Lucene implements >> deletes (eg, today as a BitVector). I think we'd keep that >> functionality. The new docs/updates would be added to the >> in-RAM-buffer. I think there'd be a RAM size based flush as there is >> today. Where that'd be flushed to is an open question. >> >> I think the key advantages to the RT + HBase architecture is the index >> would live alongside HBase columns, and so all other scaling problems >> (especially those related to scaling RT, such as synchronization of >> distributed data and updates) goes away. >> >> A distributed query would remain the same, eg, it'd hit N servers? >> >> In addition, Lucene offers a wide variety of new query types which >> HBase'd get in realtime for free. >> >> On Fri, Feb 11, 2011 at 4:13 PM, Ted Dunning <[EMAIL PROTECTED]> >> wrote: >> > On Fri, Feb 11, 2011 at 3:50 PM, Jason Rutherglen < >> > [EMAIL PROTECTED]> wrote: >> > >> >> > I can't imagine that the speed achieved by using Hbase would be even >> >> within >> >> > orders of magnitude of what you can do in Lucene 4 (or even 3). >> >> >> >> The indexing speed in Lucene hasn't changed in quite a while, are you >> >> saying HBase would somehow be overloaded? That doesn't seem to jive >> >> with the sequential writes HBase performs? >> >> >> > >> > Michi's stuff uses flexible indexing with a zero lock architecture. The >> > speed *is* much higher. >> > >> > The real problem is that hbase repeats keys. >> > >> > If you were to store entire posting vectors as values with terms as keys, >> > you might be OK. Very long posting vectors or add-ons could be added >> using >> > a key+serial number trick. >> > >> > Short queries would involve reading and merging several posting vectors. >> In >> > that mode, query speeds might be OK, but there isn't a lot of Lucene left >> at >> > that point. For updates, speed would only be acceptable if you batch up >> a >> > lot updates or possibly if you build in a value append function as a >> > co-processor. >> > >> > >> > >> >> The speed of indexing is a function of creating segments, with >> >> flexible indexing, the underlying segment files (and postings) may be >> >> significantly altered from the default file structures, eg, placed >> >> into HBase in various ways. The posting lists could even be split >> >> along with HBase regions? >> >> >> > >> > Possibly. But if you use term + counter and post vectors of limited >> length >> > you might be OK. >> > >> >
-
Re: HBase and Lucene for realtime searchTed Dunning 2011-02-12, 03:00
No. And I doubt there ever will be.
That was one reason to split the larger posting vectors. That way you can multi-thread the fetching and the scoring. On Fri, Feb 11, 2011 at 6:56 PM, Jason Rutherglen < [EMAIL PROTECTED]> wrote: > Thanks! In browsing the HBase code, I think it'd be optimal to stream > the posting/binary data directly from the underlying storage (instead > of loading the entire byte[]), it doesn't look like there's a way to > do this (yet)? > > On Fri, Feb 11, 2011 at 6:20 PM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > Go for it! > > > > On Fri, Feb 11, 2011 at 4:44 PM, Jason Rutherglen < > > [EMAIL PROTECTED]> wrote: > > > >> > Michi's stuff uses flexible indexing with a zero lock architecture. > The > >> > speed *is* much higher. > >> > >> The speed's higher, and there isn't much Lucene left there either, as > >> I believe it was built specifically for the 140 characters use case > >> (eg, not the general use case). I don't think most indexes can be > >> compressed to only exist in RAM on a single server? The Twitter use > >> case isn't one that the HBase RT search solution is useful for? > >> > >> > If you were to store entire posting vectors as values with terms as > keys, > >> > you might be OK. Very long posting vectors or add-ons could be added > >> using > >> > a key+serial number trick. > >> > >> This sounds like the right approach to try. Also, the Lucene terms > >> dict is sorted anyways, so moving the terms into HBase's sorted keys > >> probably makes sense. > >> > >> > For updates, speed would only be acceptable if you batch up a > >> > lot updates or possibly if you build in a value append function as a > >> > co-processor. > >> > >> Hmm... I think the main issue would be the way Lucene implements > >> deletes (eg, today as a BitVector). I think we'd keep that > >> functionality. The new docs/updates would be added to the > >> in-RAM-buffer. I think there'd be a RAM size based flush as there is > >> today. Where that'd be flushed to is an open question. > >> > >> I think the key advantages to the RT + HBase architecture is the index > >> would live alongside HBase columns, and so all other scaling problems > >> (especially those related to scaling RT, such as synchronization of > >> distributed data and updates) goes away. > >> > >> A distributed query would remain the same, eg, it'd hit N servers? > >> > >> In addition, Lucene offers a wide variety of new query types which > >> HBase'd get in realtime for free. > >> > >> On Fri, Feb 11, 2011 at 4:13 PM, Ted Dunning <[EMAIL PROTECTED]> > >> wrote: > >> > On Fri, Feb 11, 2011 at 3:50 PM, Jason Rutherglen < > >> > [EMAIL PROTECTED]> wrote: > >> > > >> >> > I can't imagine that the speed achieved by using Hbase would be > even > >> >> within > >> >> > orders of magnitude of what you can do in Lucene 4 (or even 3). > >> >> > >> >> The indexing speed in Lucene hasn't changed in quite a while, are you > >> >> saying HBase would somehow be overloaded? That doesn't seem to jive > >> >> with the sequential writes HBase performs? > >> >> > >> > > >> > Michi's stuff uses flexible indexing with a zero lock architecture. > The > >> > speed *is* much higher. > >> > > >> > The real problem is that hbase repeats keys. > >> > > >> > If you were to store entire posting vectors as values with terms as > keys, > >> > you might be OK. Very long posting vectors or add-ons could be added > >> using > >> > a key+serial number trick. > >> > > >> > Short queries would involve reading and merging several posting > vectors. > >> In > >> > that mode, query speeds might be OK, but there isn't a lot of Lucene > left > >> at > >> > that point. For updates, speed would only be acceptable if you batch > up > >> a > >> > lot updates or possibly if you build in a value append function as a > >> > co-processor. > >> > > >> > > >> > > >> >> The speed of indexing is a function of creating segments, with > >> >> flexible indexing, the underlying segment files (and postings) may be
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-02-12, 03:21
> No. And I doubt there ever will be
Hmm... Because of the use of blocks at a low level? This isn't too much different than an OS' filesystem, however I wonder how much overhead's in the use of HBase blocks? If the posting exceeded the block size, yeah, that'd be an issue. Spanning key values pairs for a posting, that sounds a little scary. However it's seems possible to provide direct access to the underlying filesystem in a separate API? I'm surprised this isn't a more requested feature given HBase is 'based' on BigTable which can store large BLOBs? If the query performance degrades at all, then this isn't a viable solution. Though the advantages of storing the indexes into HBase, and then leveraging the data storage, replication, distribution capabilities would seem to make sense. On Fri, Feb 11, 2011 at 7:00 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > No. And I doubt there ever will be. > > That was one reason to split the larger posting vectors. That way you can > multi-thread the fetching and the scoring. > > On Fri, Feb 11, 2011 at 6:56 PM, Jason Rutherglen < > [EMAIL PROTECTED]> wrote: > >> Thanks! In browsing the HBase code, I think it'd be optimal to stream >> the posting/binary data directly from the underlying storage (instead >> of loading the entire byte[]), it doesn't look like there's a way to >> do this (yet)? >> >> On Fri, Feb 11, 2011 at 6:20 PM, Ted Dunning <[EMAIL PROTECTED]> >> wrote: >> > Go for it! >> > >> > On Fri, Feb 11, 2011 at 4:44 PM, Jason Rutherglen < >> > [EMAIL PROTECTED]> wrote: >> > >> >> > Michi's stuff uses flexible indexing with a zero lock architecture. >> The >> >> > speed *is* much higher. >> >> >> >> The speed's higher, and there isn't much Lucene left there either, as >> >> I believe it was built specifically for the 140 characters use case >> >> (eg, not the general use case). I don't think most indexes can be >> >> compressed to only exist in RAM on a single server? The Twitter use >> >> case isn't one that the HBase RT search solution is useful for? >> >> >> >> > If you were to store entire posting vectors as values with terms as >> keys, >> >> > you might be OK. Very long posting vectors or add-ons could be added >> >> using >> >> > a key+serial number trick. >> >> >> >> This sounds like the right approach to try. Also, the Lucene terms >> >> dict is sorted anyways, so moving the terms into HBase's sorted keys >> >> probably makes sense. >> >> >> >> > For updates, speed would only be acceptable if you batch up a >> >> > lot updates or possibly if you build in a value append function as a >> >> > co-processor. >> >> >> >> Hmm... I think the main issue would be the way Lucene implements >> >> deletes (eg, today as a BitVector). I think we'd keep that >> >> functionality. The new docs/updates would be added to the >> >> in-RAM-buffer. I think there'd be a RAM size based flush as there is >> >> today. Where that'd be flushed to is an open question. >> >> >> >> I think the key advantages to the RT + HBase architecture is the index >> >> would live alongside HBase columns, and so all other scaling problems >> >> (especially those related to scaling RT, such as synchronization of >> >> distributed data and updates) goes away. >> >> >> >> A distributed query would remain the same, eg, it'd hit N servers? >> >> >> >> In addition, Lucene offers a wide variety of new query types which >> >> HBase'd get in realtime for free. >> >> >> >> On Fri, Feb 11, 2011 at 4:13 PM, Ted Dunning <[EMAIL PROTECTED]> >> >> wrote: >> >> > On Fri, Feb 11, 2011 at 3:50 PM, Jason Rutherglen < >> >> > [EMAIL PROTECTED]> wrote: >> >> > >> >> >> > I can't imagine that the speed achieved by using Hbase would be >> even >> >> >> within >> >> >> > orders of magnitude of what you can do in Lucene 4 (or even 3). >> >> >> >> >> >> The indexing speed in Lucene hasn't changed in quite a while, are you >> >> >> saying HBase would somehow be overloaded? That doesn't seem to jive
-
Re: HBase and Lucene for realtime searchBruno Dumon 2011-02-12, 11:02
Hi,
AFAIU scaling fulltext search is usually done by processing partitions of posting lists concurrently. That is essentially what you get with sharded solr/katta/elasticsearch. I wonder how you would map things to HBase so that this would be possible. HBase scales on the row key, so if you use the term as row key you can have an quasi-unlimited amount of terms, but not unlimited long posting lists (i.e., documents) for those terms. The posting lists would not be sharded. If you use a 'term+seqnr' approach (manual sharding), the terms will usually end up in the same region, so reading them will all touch the same server. There is something to say for keeping the fulltext index for all rows stored in one HBase region alongside the region, but when a region splits, splitting the fulltext index would be expensive. BTW, here is another attempt to build fulltext search on top of HBase: http://bizosyshsearch.sourceforge.net/ But from what I understood their approach to scalability is partitioning by term (instead of by document), and sharding over multiple HBase clusters: http://sourceforge.net/projects/bizosyshsearch/forums/forum/1295149/topic/4006417 On Sat, Feb 12, 2011 at 4:21 AM, Jason Rutherglen < [EMAIL PROTECTED]> wrote: > > No. And I doubt there ever will be > > Hmm... Because of the use of blocks at a low level? This isn't too > much different than an OS' filesystem, however I wonder how much > overhead's in the use of HBase blocks? If the posting exceeded the > block size, yeah, that'd be an issue. Spanning key values pairs for a > posting, that sounds a little scary. However it's seems possible to > provide direct access to the underlying filesystem in a separate API? > I'm surprised this isn't a more requested feature given HBase is > 'based' on BigTable which can store large BLOBs? If the query > performance degrades at all, then this isn't a viable solution. > Though the advantages of storing the indexes into HBase, and then > leveraging the data storage, replication, distribution capabilities > would seem to make sense. > > On Fri, Feb 11, 2011 at 7:00 PM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > No. And I doubt there ever will be. > > > > That was one reason to split the larger posting vectors. That way you > can > > multi-thread the fetching and the scoring. > > > > On Fri, Feb 11, 2011 at 6:56 PM, Jason Rutherglen < > > [EMAIL PROTECTED]> wrote: > > > >> Thanks! In browsing the HBase code, I think it'd be optimal to stream > >> the posting/binary data directly from the underlying storage (instead > >> of loading the entire byte[]), it doesn't look like there's a way to > >> do this (yet)? > >> > >> On Fri, Feb 11, 2011 at 6:20 PM, Ted Dunning <[EMAIL PROTECTED]> > >> wrote: > >> > Go for it! > >> > > >> > On Fri, Feb 11, 2011 at 4:44 PM, Jason Rutherglen < > >> > [EMAIL PROTECTED]> wrote: > >> > > >> >> > Michi's stuff uses flexible indexing with a zero lock architecture. > >> The > >> >> > speed *is* much higher. > >> >> > >> >> The speed's higher, and there isn't much Lucene left there either, as > >> >> I believe it was built specifically for the 140 characters use case > >> >> (eg, not the general use case). I don't think most indexes can be > >> >> compressed to only exist in RAM on a single server? The Twitter use > >> >> case isn't one that the HBase RT search solution is useful for? > >> >> > >> >> > If you were to store entire posting vectors as values with terms as > >> keys, > >> >> > you might be OK. Very long posting vectors or add-ons could be > added > >> >> using > >> >> > a key+serial number trick. > >> >> > >> >> This sounds like the right approach to try. Also, the Lucene terms > >> >> dict is sorted anyways, so moving the terms into HBase's sorted keys > >> >> probably makes sense. > >> >> > >> >> > For updates, speed would only be acceptable if you batch up a > >> >> > lot updates or possibly if you build in a value append function as Bruno Dumon Outerthought http://outerthought.org/
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-02-12, 15:13
Brian,
Thanks for the response. > solr/katta/elasticsearch These don't have a distributed solution for realtime search [yet]. Eg, a transaction log is required, and a place to store the versioned documents, sounds a lot like HBase? The technique of query sharding/partitioning is fairly trivial, and something that this solution'd need to leverage as well. > http://bizosyshsearch.sourceforge.net/ I looked. I'm a little confused as to why this and things like Lucandra/Solandra create their own indexes, as this is [probably] going to yield unpredictable RAM and performance inefficiencies that Lucene has traversed and solved long ago. The user will [likely] want queries that are as fast as possible. This's why Lucene 4.x's flexible indexing is interesting to make use of in conjunction with HBase, eg, there won't be a slow down in queries, unless there's IO overhead added by the low level usage of HBase to store and iterate the postings. I'd imagine the documents pertaining to an index would 'stick' with that index, meaning they'd stay in the same region. I'm not sure how that'd be implemented in HBase. > HBase scales on the row key, so if you use the term > as row key you can have an quasi-unlimited amount of terms, but not > unlimited long posting lists (i.e., documents) for those terms. The posting > lists would not be sharded. If you use a 'term+seqnr' approach (manual > sharding), the terms will usually end up in the same region, so reading them > will all touch the same server. The posting list'll need to stay in the same region and likely the [few] posting lists that span rows may not actually impact performance, eg, they'll probably only need to span once? That'll need to be tested. I'm not sure how we'd efficiently map doc-ids to keys to the actual document data. > There is something to say for keeping the fulltext index for all rows stored > in one HBase region alongside the region, but when a region splits, > splitting the fulltext index would be expensive. Right, splitting postings was briefly discussed in Lucene-land, and is probably implementable in an efficient way. Jason On Sat, Feb 12, 2011 at 3:02 AM, Bruno Dumon <[EMAIL PROTECTED]> wrote: > Hi, > > AFAIU scaling fulltext search is usually done by processing partitions of > posting lists concurrently. That is essentially what you get with sharded > solr/katta/elasticsearch. I wonder how you would map things to HBase so that > this would be possible. HBase scales on the row key, so if you use the term > as row key you can have an quasi-unlimited amount of terms, but not > unlimited long posting lists (i.e., documents) for those terms. The posting > lists would not be sharded. If you use a 'term+seqnr' approach (manual > sharding), the terms will usually end up in the same region, so reading them > will all touch the same server. > > There is something to say for keeping the fulltext index for all rows stored > in one HBase region alongside the region, but when a region splits, > splitting the fulltext index would be expensive. > > BTW, here is another attempt to build fulltext search on top of HBase: > > http://bizosyshsearch.sourceforge.net/ > > But from what I understood their approach to scalability is partitioning by > term (instead of by document), and sharding over multiple HBase clusters: > > http://sourceforge.net/projects/bizosyshsearch/forums/forum/1295149/topic/4006417 > > > On Sat, Feb 12, 2011 at 4:21 AM, Jason Rutherglen < > [EMAIL PROTECTED]> wrote: >
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-02-12, 21:01
So in giving this a day of breathing room, it looks like HBase loads
values as it's scanning a column? I think that'd be a killer to some Lucene queries, eg, we'd be loading entire/part-of posting lists just for a linear scan of the terms dict? Or we'd probably instead want to place the posting list into it's own column? Another approach would be to feed off the HLog, place updates into a dedicated RT Lucene index (eg, outside of HBase). With the latter system we'd get transactional consistency, and we wouldn't need to work so hard to force Lucene's index into HBase columns etc (which's extremely high risk). On being built, the indexes could be offloaded automatically into HDFS. This architecture would be more of a 'parallel' to HBase Lucene index. We'd still gain the removal of doc-stores, we wouldn't need to sorry about tacking on new HBase specific merge policies, and we'd gain [probably most importantly] a consistent transactional view of the data, while also being able to query that data using con/disjunction and phrase queries, amongst others. A delete or update in HBase'd cascade into a Lucene delete, and this'd be performed atomically, and vice versa. On Fri, Feb 11, 2011 at 7:00 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > No. And I doubt there ever will be. > > That was one reason to split the larger posting vectors. That way you can > multi-thread the fetching and the scoring. > > On Fri, Feb 11, 2011 at 6:56 PM, Jason Rutherglen < > [EMAIL PROTECTED]> wrote: > >> Thanks! In browsing the HBase code, I think it'd be optimal to stream >> the posting/binary data directly from the underlying storage (instead >> of loading the entire byte[]), it doesn't look like there's a way to >> do this (yet)? >> >> On Fri, Feb 11, 2011 at 6:20 PM, Ted Dunning <[EMAIL PROTECTED]> >> wrote: >> > Go for it! >> > >> > On Fri, Feb 11, 2011 at 4:44 PM, Jason Rutherglen < >> > [EMAIL PROTECTED]> wrote: >> > >> >> > Michi's stuff uses flexible indexing with a zero lock architecture. >> The >> >> > speed *is* much higher. >> >> >> >> The speed's higher, and there isn't much Lucene left there either, as >> >> I believe it was built specifically for the 140 characters use case >> >> (eg, not the general use case). I don't think most indexes can be >> >> compressed to only exist in RAM on a single server? The Twitter use >> >> case isn't one that the HBase RT search solution is useful for? >> >> >> >> > If you were to store entire posting vectors as values with terms as >> keys, >> >> > you might be OK. Very long posting vectors or add-ons could be added >> >> using >> >> > a key+serial number trick. >> >> >> >> This sounds like the right approach to try. Also, the Lucene terms >> >> dict is sorted anyways, so moving the terms into HBase's sorted keys >> >> probably makes sense. >> >> >> >> > For updates, speed would only be acceptable if you batch up a >> >> > lot updates or possibly if you build in a value append function as a >> >> > co-processor. >> >> >> >> Hmm... I think the main issue would be the way Lucene implements >> >> deletes (eg, today as a BitVector). I think we'd keep that >> >> functionality. The new docs/updates would be added to the >> >> in-RAM-buffer. I think there'd be a RAM size based flush as there is >> >> today. Where that'd be flushed to is an open question. >> >> >> >> I think the key advantages to the RT + HBase architecture is the index >> >> would live alongside HBase columns, and so all other scaling problems >> >> (especially those related to scaling RT, such as synchronization of >> >> distributed data and updates) goes away. >> >> >> >> A distributed query would remain the same, eg, it'd hit N servers? >> >> >> >> In addition, Lucene offers a wide variety of new query types which >> >> HBase'd get in realtime for free. >> >> >> >> On Fri, Feb 11, 2011 at 4:13 PM, Ted Dunning <[EMAIL PROTECTED]> >> >> wrote: >> >> > On Fri, Feb 11, 2011 at 3:50 PM, Jason Rutherglen <
-
Re: HBase and Lucene for realtime searchTed Dunning 2011-02-12, 21:14
I really think that putting update semantics into Katta would be much
easier. Building the write-ahead log for the lucene case isn't all that hard. If you follow the Zookeeper model of having a WAL thread that writes batches of log entries you can get pretty high speed as well. The basic idea is that update requests are put into a queue of pending log writes, but are written to the index immediately. When the WAL thread finishes the previous trenche of log items, it comes back around and takes everything that is pending. When it finishes a trenche of writes, it releases all of the pending updates in a batch. If updates are lot frequent, then you lose no latency. If you updates are very high speed, then you transition seamlessly to a bandwidth oriented scheme of large updates while latency is roughly bounded to 2-3x the original case. If you put the write-ahead log on a reliable replicated file system then, as you say, much of the complexity of write ahead logging goes away. But this verges off topic for hbase. On Sat, Feb 12, 2011 at 1:01 PM, Jason Rutherglen < [EMAIL PROTECTED]> wrote: > So in giving this a day of breathing room, it looks like HBase loads > values as it's scanning a column? I think that'd be a killer to some > Lucene queries, eg, we'd be loading entire/part-of posting lists just > for a linear scan of the terms dict? Or we'd probably instead want to > place the posting list into it's own column? > > Another approach would be to feed off the HLog, place updates into a > dedicated RT Lucene index (eg, outside of HBase). With the latter > system we'd get transactional consistency, and we wouldn't need to > work so hard to force Lucene's index into HBase columns etc (which's > extremely high risk). On being built, the indexes could be offloaded > automatically into HDFS. This architecture would be more of a > 'parallel' to HBase Lucene index. We'd still gain the removal of > doc-stores, we wouldn't need to sorry about tacking on new HBase > specific merge policies, and we'd gain [probably most importantly] a > consistent transactional view of the data, while also being able to > query that data using con/disjunction and phrase queries, amongst > others. A delete or update in HBase'd cascade into a Lucene delete, > and this'd be performed atomically, and vice versa. > > On Fri, Feb 11, 2011 at 7:00 PM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > No. And I doubt there ever will be. > > > > That was one reason to split the larger posting vectors. That way you > can > > multi-thread the fetching and the scoring. > > > > On Fri, Feb 11, 2011 at 6:56 PM, Jason Rutherglen < > > [EMAIL PROTECTED]> wrote: > > > >> Thanks! In browsing the HBase code, I think it'd be optimal to stream > >> the posting/binary data directly from the underlying storage (instead > >> of loading the entire byte[]), it doesn't look like there's a way to > >> do this (yet)? > >> > >> On Fri, Feb 11, 2011 at 6:20 PM, Ted Dunning <[EMAIL PROTECTED]> > >> wrote: > >> > Go for it! > >> > > >> > On Fri, Feb 11, 2011 at 4:44 PM, Jason Rutherglen < > >> > [EMAIL PROTECTED]> wrote: > >> > > >> >> > Michi's stuff uses flexible indexing with a zero lock architecture. > >> The > >> >> > speed *is* much higher. > >> >> > >> >> The speed's higher, and there isn't much Lucene left there either, as > >> >> I believe it was built specifically for the 140 characters use case > >> >> (eg, not the general use case). I don't think most indexes can be > >> >> compressed to only exist in RAM on a single server? The Twitter use > >> >> case isn't one that the HBase RT search solution is useful for? > >> >> > >> >> > If you were to store entire posting vectors as values with terms as > >> keys, > >> >> > you might be OK. Very long posting vectors or add-ons could be > added > >> >> using > >> >> > a key+serial number trick. > >> >> > >> >> This sounds like the right approach to try. Also, the Lucene terms > >> >> dict is sorted anyways, so moving the terms into HBase's sorted keys
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-02-12, 21:31
Right, the concepts aren't that hard (write ahead log etc), however to
keep the data transactionally consistent with another datastore across servers [I believe] is a little more difficult? Also with RT there needs to be a primary data store somewhere outside of Lucene, otherwise we'd be storing the same data twice, eg, in HBase and Lucene, that's inefficient. I'm guessing it'll be easier to keep Lucene indexes in parallel with HBase regions across servers, and then use the Coprocessor architecture etc, to keep them in'sync, on the same server. When a region is split, we'd need to also split the Lucene index, this'd be the only 'new' technology that'd need to be created on the Lucene side. I think it's advantageous to build a distributed search system that mirrors the underlying data, if the search indices are on their own servers, I think there's always going to be sync'ing problems? On Sat, Feb 12, 2011 at 1:14 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > I really think that putting update semantics into Katta would be much > easier. > > Building the write-ahead log for the lucene case isn't all that hard. If > you follow the Zookeeper model of having a WAL thread that writes batches of > log entries you can get pretty high speed as well. The basic idea is that > update requests are put into a queue of pending log writes, but are written > to the index immediately. When the WAL thread finishes the previous trenche > of log items, it comes back around and takes everything that is pending. > When it finishes a trenche of writes, it releases all of the pending > updates in a batch. If updates are lot frequent, then you lose no latency. > If you updates are very high speed, then you transition seamlessly to a > bandwidth oriented scheme of large updates while latency is roughly bounded > to 2-3x the original case. > > If you put the write-ahead log on a reliable replicated file system then, as > you say, much of the complexity of write ahead logging goes away. > > But this verges off topic for hbase. > > On Sat, Feb 12, 2011 at 1:01 PM, Jason Rutherglen < > [EMAIL PROTECTED]> wrote: > >> So in giving this a day of breathing room, it looks like HBase loads >> values as it's scanning a column? I think that'd be a killer to some >> Lucene queries, eg, we'd be loading entire/part-of posting lists just >> for a linear scan of the terms dict? Or we'd probably instead want to >> place the posting list into it's own column? >> >> Another approach would be to feed off the HLog, place updates into a >> dedicated RT Lucene index (eg, outside of HBase). With the latter >> system we'd get transactional consistency, and we wouldn't need to >> work so hard to force Lucene's index into HBase columns etc (which's >> extremely high risk). On being built, the indexes could be offloaded >> automatically into HDFS. This architecture would be more of a >> 'parallel' to HBase Lucene index. We'd still gain the removal of >> doc-stores, we wouldn't need to sorry about tacking on new HBase >> specific merge policies, and we'd gain [probably most importantly] a >> consistent transactional view of the data, while also being able to >> query that data using con/disjunction and phrase queries, amongst >> others. A delete or update in HBase'd cascade into a Lucene delete, >> and this'd be performed atomically, and vice versa. >> >> On Fri, Feb 11, 2011 at 7:00 PM, Ted Dunning <[EMAIL PROTECTED]> >> wrote: >> > No. And I doubt there ever will be. >> > >> > That was one reason to split the larger posting vectors. That way you >> can >> > multi-thread the fetching and the scoring. >> > >> > On Fri, Feb 11, 2011 at 6:56 PM, Jason Rutherglen < >> > [EMAIL PROTECTED]> wrote: >> > >> >> Thanks! In browsing the HBase code, I think it'd be optimal to stream >> >> the posting/binary data directly from the underlying storage (instead >> >> of loading the entire byte[]), it doesn't look like there's a way to >> >> do this (yet)?
-
Re: HBase and Lucene for realtime searchTed Dunning 2011-02-13, 09:36
Transactional consistency isn't going to happen if you even involve more
than one hbase row. I haven't seen any search sites that absolutely need transactional consistency. What they need is that documents can be found very shortly after they are inserted and that crashes won't compromise that. On Sat, Feb 12, 2011 at 1:31 PM, Jason Rutherglen < [EMAIL PROTECTED]> wrote: > Right, the concepts aren't that hard (write ahead log etc), however to > keep the data transactionally consistent with another datastore across > servers [I believe] is a little more difficult? Also with RT there > needs to be a primary data store somewhere outside of Lucene, > otherwise we'd be storing the same data twice, eg, in HBase and > Lucene, that's inefficient. I'm guessing it'll be easier to keep > Lucene indexes in parallel with HBase regions across servers, and then > use the Coprocessor architecture etc, to keep them in'sync, on the > same server. When a region is split, we'd need to also split the > Lucene index, this'd be the only 'new' technology that'd need to be > created on the Lucene side. > > I think it's advantageous to build a distributed search system that > mirrors the underlying data, if the search indices are on their own > servers, I think there's always going to be sync'ing problems? > > On Sat, Feb 12, 2011 at 1:14 PM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > I really think that putting update semantics into Katta would be much > > easier. > > > > Building the write-ahead log for the lucene case isn't all that hard. If > > you follow the Zookeeper model of having a WAL thread that writes batches > of > > log entries you can get pretty high speed as well. The basic idea is > that > > update requests are put into a queue of pending log writes, but are > written > > to the index immediately. When the WAL thread finishes the previous > trenche > > of log items, it comes back around and takes everything that is pending. > > When it finishes a trenche of writes, it releases all of the pending > > updates in a batch. If updates are lot frequent, then you lose no > latency. > > If you updates are very high speed, then you transition seamlessly to a > > bandwidth oriented scheme of large updates while latency is roughly > bounded > > to 2-3x the original case. > > > > If you put the write-ahead log on a reliable replicated file system then, > as > > you say, much of the complexity of write ahead logging goes away. > > > > But this verges off topic for hbase. > > > > On Sat, Feb 12, 2011 at 1:01 PM, Jason Rutherglen < > > [EMAIL PROTECTED]> wrote: > > > >> So in giving this a day of breathing room, it looks like HBase loads > >> values as it's scanning a column? I think that'd be a killer to some > >> Lucene queries, eg, we'd be loading entire/part-of posting lists just > >> for a linear scan of the terms dict? Or we'd probably instead want to > >> place the posting list into it's own column? > >> > >> Another approach would be to feed off the HLog, place updates into a > >> dedicated RT Lucene index (eg, outside of HBase). With the latter > >> system we'd get transactional consistency, and we wouldn't need to > >> work so hard to force Lucene's index into HBase columns etc (which's > >> extremely high risk). On being built, the indexes could be offloaded > >> automatically into HDFS. This architecture would be more of a > >> 'parallel' to HBase Lucene index. We'd still gain the removal of > >> doc-stores, we wouldn't need to sorry about tacking on new HBase > >> specific merge policies, and we'd gain [probably most importantly] a > >> consistent transactional view of the data, while also being able to > >> query that data using con/disjunction and phrase queries, amongst > >> others. A delete or update in HBase'd cascade into a Lucene delete, > >> and this'd be performed atomically, and vice versa. > >> > >> On Fri, Feb 11, 2011 at 7:00 PM, Ted Dunning <[EMAIL PROTECTED]> > >> wrote: > >> > No. And I doubt there ever will be.
-
Re: HBase and Lucene for realtime searchBruno Dumon 2011-02-13, 13:13
On Sat, Feb 12, 2011 at 10:31 PM, Jason Rutherglen <
[EMAIL PROTECTED]> wrote: > Right, the concepts aren't that hard (write ahead log etc), however to > keep the data transactionally consistent with another datastore across > servers [I believe] is a little more difficult? I assume you don't really need ACID transactions, but only the guarantee that when you update an HBase row, its index will eventually be updated too? (possibly with a little "RT" delay). [As you probably know, ] the basic solution to do this across systems is a write-ahead-log outside of these systems, i.e. the sequence to perform an update would be: (1) write update to the WAL (2) perform update on HBase (3) perform update on Lucene If it fails anywhere in between, one can always replay from the WAL. If you add a write-ahead-log just to e.g. Katta, that won't help yet with the consistency across the systems, as it could fail between doing the update to HBase and writing to the Katta-WAL. We do have something like this in Lily (http://lilyproject.org, check the 'rowlog' thing), though it is somewhat different than above; to the "WAL" we only write the ID of the row, since we consider the update to the HBase row to be the main action and all what follows just secondary side-effects (i.e. there's no rollback). Slightly similar ideas can be found in Google's percolator paper. > Also with RT there > needs to be a primary data store somewhere outside of Lucene, > otherwise we'd be storing the same data twice, eg, in HBase and > Lucene, that's inefficient. I'm guessing it'll be easier to keep > Lucene indexes in parallel with HBase regions across servers, and then > use the Coprocessor architecture etc, to keep them in'sync, on the > same server. When a region is split, we'd need to also split the > Lucene index, this'd be the only 'new' technology that'd need to be > created on the Lucene side. > That would definitely be interesting, but I guess for it to work with good performance the ordering of the HBase row keys should be the same as that of the Lucene doc IDs (so that posting lists can be split in the middle rather than having to rearrange everything), and I don't see how that could be the case. Another issue is that maybe the scalability needs for search might be different. An HBase region is always only active in one region server, there are no active replica's, while often for search you need replicas to scale, since a search will typically hit all partitions. -- Bruno Dumon Outerthought http://outerthought.org/
-
Re: HBase and Lucene for realtime searchThomas Koch 2011-02-13, 16:26
Jason Rutherglen:
> Hello, > > I'm curious as to what a 'good' approach would be for implementing > search in HBase (using Lucene) with the end goal being the integration > of realtime search into HBase. I think the use case makes sense as > HBase is realtime and has a write-ahead log, performs automatic > partitioning, splitting of data, failover, redundancy, etc. These are > all things Lucene does not have out of the box, that we'd essentially > get for 'free'. > > For starters: Where would be the right place to store Lucene segments > or postings? Eg, we need to be able to efficiently perform a linear > iteration of the per-term posting list(s). > > Thanks! > > Jason Rutherglen Hi Jason, I had the same idea around last year but didn't continue it since I'm leaving the company right now. Do you want to do Term- or Document partitioning? Both have advantages and disadvantages. You can get a very good introduction in chapter 14.1 of this book: http://www.ir.uwaterloo.ca/book The following lecture gives a very interesting insight on Google's index architecture: http://videolectures.net/wsdm09_dean_cblirs Projects that do Document partitioning: distributed solr, katta, elasticsearch, linkedin's Sensei Projects that do Term partitioning: lucandra/solandra (using cassandra), hbasene (which is abandoned since a year) I very much thought that hbasene would be a perfect solution for scalable search, but the above book and video convinced me that improving katta would be the way to go: - implement an indexing solution for katta - serve the index shards from memory, as google apparently does Hope I could help, please keep us posted, Thomas Koch, http://www.koch.ro
-
Re: HBase and Lucene for realtime searchSean Bigdatafun 2011-02-13, 17:37
On Fri, Feb 11, 2011 at 4:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> On Fri, Feb 11, 2011 at 3:50 PM, Jason Rutherglen < > [EMAIL PROTECTED]> wrote: > > > > I can't imagine that the speed achieved by using Hbase would be even > > within > > > orders of magnitude of what you can do in Lucene 4 (or even 3). > > > > The indexing speed in Lucene hasn't changed in quite a while, are you > > saying HBase would somehow be overloaded? That doesn't seem to jive > > with the sequential writes HBase performs? > > > > Michi's stuff uses flexible indexing with a zero lock architecture. The > speed *is* much higher. > > The real problem is that hbase repeats keys. > > If you were to store entire posting vectors as values with terms as keys, > you might be OK. Very long posting vectors or add-ons could be added using > a key+serial number trick. > > Short queries would involve reading and merging several posting vectors. > In > that mode, query speeds might be OK, but there isn't a lot of Lucene left > at > that point. For updates, speed would only be acceptable if you batch up a > lot updates or possibly if you build in a value append function as a > co-processor. > "speed would only be acceptable if you batch up " -- I understand what you are talking about here (without batching-up, HBase simply become very sluggish). Can you comment if Cassandra needs a batch-up mode? (I recall Twitter said they just keep putting results into Cassandra for its analytics application) > > > > > The speed of indexing is a function of creating segments, with > > flexible indexing, the underlying segment files (and postings) may be > > significantly altered from the default file structures, eg, placed > > into HBase in various ways. The posting lists could even be split > > along with HBase regions? > > > > Possibly. But if you use term + counter and post vectors of limited length > you might be OK. > -- --Sean
-
Re: HBase and Lucene for realtime searchTed Dunning 2011-02-13, 20:07
THe situation here is particularly nice since the update to hbase and the
update to lucene are both idempotent. Adding the same document twice or deleting it twice has essentially the same effect. On Sun, Feb 13, 2011 at 5:13 AM, Bruno Dumon <[EMAIL PROTECTED]> wrote: > [As you probably know, ] the basic solution to do this across systems is a > write-ahead-log outside of these systems, i.e. the sequence to perform an > update would be: > (1) write update to the WAL > (2) perform update on HBase > (3) perform update on Lucene >
-
Re: HBase and Lucene for realtime searchTed Dunning 2011-02-13, 20:10
I really can't comment on Cassandra, but the flight time of transactions is
likely to be too slow for updates not to be batched. With a server round-trip in the way, you are looking at hundreds of microseconds at least and you need dozens to thousands of these to add a document to the index. You also would like a single document addition to be roughly transactional. That is a really hard thing to do for an inverted with any noSQL solution I have heard of. On Sun, Feb 13, 2011 at 9:37 AM, Sean Bigdatafun <[EMAIL PROTECTED]>wrote: > > Short queries would involve reading and merging several posting vectors. > > In > > that mode, query speeds might be OK, but there isn't a lot of Lucene left > > at > > that point. For updates, speed would only be acceptable if you batch up > a > > lot updates or possibly if you build in a value append function as a > > co-processor. > > > > "speed would only be acceptable if you batch up " -- I understand what you > are talking about here (without batching-up, HBase simply become very > sluggish). Can you comment if Cassandra needs a batch-up mode? (I recall > Twitter said they just keep putting results into Cassandra for its > analytics > application) >
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-02-13, 23:21
> Transactional consistency isn't going to happen if you even involve more
> than one hbase row. What does this mean? Or rather, can you elaborate? > What they need is that documents can be found very shortly > after they are inserted and that crashes won't compromise that. Right. I think HBase is built for this case? Adding the ability to 'search' over HBase without setting up a separate search cluster could be compelling if extremely convenient? On Sun, Feb 13, 2011 at 1:36 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Transactional consistency isn't going to happen if you even involve more > than one hbase row. > > I haven't seen any search sites that absolutely need transactional > consistency. What they need is that documents can be found very shortly > after they are inserted and that crashes won't compromise that. >
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-02-13, 23:37
> Google's percolator paper.
Can you post a link? > Another issue is that maybe the scalability needs for search might be > different. An HBase region is always only active in one region server, there > are no active replica's, while often for search you need replicas to scale, > since a search will typically hit all partitions. Really? That seems odd. > I assume you don't really need ACID transactions, but only the guarantee > that when you update an HBase row, its index will eventually be updated too? > (possibly with a little "RT" delay). While not "needed" it's definitely a worthy goal? Eg, with the newer RT functionality in Lucene this'll be more or less be available out of the box, with hopefully no delay. > If it fails anywhere in between, one can always replay from the WAL. If you > add a write-ahead-log just to e.g. Katta, that won't help yet with the > consistency across the systems Right, I think this's a real problem. My guess is it'll be easier to develop a scalable RT search system around HBase, then separate it out if it's possible/needed. > to be the main action and all what follows just secondary side-effects (i.e. > there's no rollback). I think inside a Coprocessor you could block the HBase 'commit' until a successful updateDoc call to Lucene (which is only an update to RAM anyways)? > That would definitely be interesting, but I guess for it to work with good > performance the ordering of the HBase row keys should be the same as that of > the Lucene doc IDs That'd be ideal, and/or being able to write the HBase key value file pointer into Lucene, though that seems a little far fetched. On Sun, Feb 13, 2011 at 5:13 AM, Bruno Dumon <[EMAIL PROTECTED]> wrote: > On Sat, Feb 12, 2011 at 10:31 PM, Jason Rutherglen < > [EMAIL PROTECTED]> wrote: > >> Right, the concepts aren't that hard (write ahead log etc), however to >> keep the data transactionally consistent with another datastore across >> servers [I believe] is a little more difficult? > > > I assume you don't really need ACID transactions, but only the guarantee > that when you update an HBase row, its index will eventually be updated too? > (possibly with a little "RT" delay). > > [As you probably know, ] the basic solution to do this across systems is a > write-ahead-log outside of these systems, i.e. the sequence to perform an > update would be: > (1) write update to the WAL > (2) perform update on HBase > (3) perform update on Lucene > > If it fails anywhere in between, one can always replay from the WAL. If you > add a write-ahead-log just to e.g. Katta, that won't help yet with the > consistency across the systems, as it could fail between doing the update to > HBase and writing to the Katta-WAL. > > We do have something like this in Lily (http://lilyproject.org, check the > 'rowlog' thing), though it is somewhat different than above; to the "WAL" we > only write the ID of the row, since we consider the update to the HBase row > to be the main action and all what follows just secondary side-effects (i.e. > there's no rollback). > > Slightly similar ideas can be found in Google's percolator paper. > > >> Also with RT there >> needs to be a primary data store somewhere outside of Lucene, >> otherwise we'd be storing the same data twice, eg, in HBase and >> Lucene, that's inefficient. I'm guessing it'll be easier to keep >> Lucene indexes in parallel with HBase regions across servers, and then >> use the Coprocessor architecture etc, to keep them in'sync, on the >> same server. When a region is split, we'd need to also split the >> Lucene index, this'd be the only 'new' technology that'd need to be >> created on the Lucene side. >> > > That would definitely be interesting, but I guess for it to work with good > performance the ordering of the HBase row keys should be the same as that of > the Lucene doc IDs (so that posting lists can be split in the middle rather > than having to rearrange everything), and I don't see how that could be the
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-02-14, 02:01
> Do you want to do Term- or Document partitioning?
It sounds like no one uses term partitioning, doc-partitioning seems to be the most logical default? > serve the index shards from memory In Lucene-land this's a function of allocating enough RAM for the system IO cache. On Sun, Feb 13, 2011 at 8:26 AM, Thomas Koch <[EMAIL PROTECTED]> wrote: > Jason Rutherglen: >> Hello, >> >> I'm curious as to what a 'good' approach would be for implementing >> search in HBase (using Lucene) with the end goal being the integration >> of realtime search into HBase. I think the use case makes sense as >> HBase is realtime and has a write-ahead log, performs automatic >> partitioning, splitting of data, failover, redundancy, etc. These are >> all things Lucene does not have out of the box, that we'd essentially >> get for 'free'. >> >> For starters: Where would be the right place to store Lucene segments >> or postings? Eg, we need to be able to efficiently perform a linear >> iteration of the per-term posting list(s). >> >> Thanks! >> >> Jason Rutherglen > Hi Jason, > > I had the same idea around last year but didn't continue it since I'm leaving > the company right now. > Do you want to do Term- or Document partitioning? Both have advantages and > disadvantages. You can get a very good introduction in chapter 14.1 of this > book: > http://www.ir.uwaterloo.ca/book > > The following lecture gives a very interesting insight on Google's index > architecture: > http://videolectures.net/wsdm09_dean_cblirs > > Projects that do Document partitioning: > distributed solr, katta, elasticsearch, linkedin's Sensei > Projects that do Term partitioning: > lucandra/solandra (using cassandra), hbasene (which is abandoned since a year) > > I very much thought that hbasene would be a perfect solution for scalable > search, but the above book and video convinced me that improving katta would > be the way to go: > - implement an indexing solution for katta > - serve the index shards from memory, as google apparently does > > Hope I could help, please keep us posted, > > Thomas Koch, http://www.koch.ro >
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-02-14, 02:09
I think there's another way to look at this, and that is what types of
queries do HBase users perform that search can enhance? Eg, given we can index extremely quickly with Lucene and with RT we can search with near-zero latency, perhaps there are new queries that would be of interest/useful to HBase users? Things like traditional SQLish queries with multiple clauses should be possible? > I haven't seen any search sites that absolutely need transactional > consistency. While this is true, databases usually require this? And so this is somewhat of an out-of-the-box view on search, and this's why it's perhaps better to frame it more in the context databases, eg, transactions, consistency, and complex queries. On Sun, Feb 13, 2011 at 1:36 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Transactional consistency isn't going to happen if you even involve more > than one hbase row. > > I haven't seen any search sites that absolutely need transactional > consistency. What they need is that documents can be found very shortly > after they are inserted and that crashes won't compromise that. > > On Sat, Feb 12, 2011 at 1:31 PM, Jason Rutherglen < > [EMAIL PROTECTED]> wrote: > >> Right, the concepts aren't that hard (write ahead log etc), however to >> keep the data transactionally consistent with another datastore across >> servers [I believe] is a little more difficult? Also with RT there >> needs to be a primary data store somewhere outside of Lucene, >> otherwise we'd be storing the same data twice, eg, in HBase and >> Lucene, that's inefficient. I'm guessing it'll be easier to keep >> Lucene indexes in parallel with HBase regions across servers, and then >> use the Coprocessor architecture etc, to keep them in'sync, on the >> same server. When a region is split, we'd need to also split the >> Lucene index, this'd be the only 'new' technology that'd need to be >> created on the Lucene side. >> >> I think it's advantageous to build a distributed search system that >> mirrors the underlying data, if the search indices are on their own >> servers, I think there's always going to be sync'ing problems? >> >> On Sat, Feb 12, 2011 at 1:14 PM, Ted Dunning <[EMAIL PROTECTED]> >> wrote: >> > I really think that putting update semantics into Katta would be much >> > easier. >> > >> > Building the write-ahead log for the lucene case isn't all that hard. If >> > you follow the Zookeeper model of having a WAL thread that writes batches >> of >> > log entries you can get pretty high speed as well. The basic idea is >> that >> > update requests are put into a queue of pending log writes, but are >> written >> > to the index immediately. When the WAL thread finishes the previous >> trenche >> > of log items, it comes back around and takes everything that is pending. >> > When it finishes a trenche of writes, it releases all of the pending >> > updates in a batch. If updates are lot frequent, then you lose no >> latency. >> > If you updates are very high speed, then you transition seamlessly to a >> > bandwidth oriented scheme of large updates while latency is roughly >> bounded >> > to 2-3x the original case. >> > >> > If you put the write-ahead log on a reliable replicated file system then, >> as >> > you say, much of the complexity of write ahead logging goes away. >> > >> > But this verges off topic for hbase. >> > >> > On Sat, Feb 12, 2011 at 1:01 PM, Jason Rutherglen < >> > [EMAIL PROTECTED]> wrote: >> > >> >> So in giving this a day of breathing room, it looks like HBase loads >> >> values as it's scanning a column? I think that'd be a killer to some >> >> Lucene queries, eg, we'd be loading entire/part-of posting lists just >> >> for a linear scan of the terms dict? Or we'd probably instead want to >> >> place the posting list into it's own column? >> >> >> >> Another approach would be to feed off the HLog, place updates into a >> >> dedicated RT Lucene index (eg, outside of HBase). With the latter >> >> system we'd get transactional consistency, and we wouldn't need to
-
Re: HBase and Lucene for realtime searchTed Dunning 2011-02-14, 06:47
Row updates are atomic.
Nothing else is. On Sun, Feb 13, 2011 at 3:21 PM, Jason Rutherglen < [EMAIL PROTECTED]> wrote: > > Transactional consistency isn't going to happen if you even involve more > > than one hbase row. > > What does this mean? Or rather, can you elaborate? >
-
Re: HBase and Lucene for realtime searchTed Dunning 2011-02-14, 06:49
Doc-partitioning has much better failure modes and is universal in my
experience for serious applications. On Sun, Feb 13, 2011 at 6:01 PM, Jason Rutherglen < [EMAIL PROTECTED]> wrote: > > Do you want to do Term- or Document partitioning? > > It sounds like no one uses term partitioning, doc-partitioning seems > to be the most logical default? >
-
Re: HBase and Lucene for realtime searchTed Dunning 2011-02-14, 06:51
I would avoid this, personally.
Serious transactions and complex queries are pretty much incompatible with simple implementation and large scale. Flow based updates and write-behind are more the norm. On Sun, Feb 13, 2011 at 6:09 PM, Jason Rutherglen < [EMAIL PROTECTED]> wrote: > > I haven't seen any search sites that absolutely need transactional > > consistency. > > While this is true, databases usually require this? And so this is > somewhat of an out-of-the-box view on search, and this's why it's > perhaps better to frame it more in the context databases, eg, > transactions, consistency, and complex queries.
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-02-14, 14:22
> Serious transactions and complex queries are pretty much incompatible with
> simple implementation and large scale. Right, that's the design motivation behind HBase and BigTable. That being said, Google's Percolator shows building more complex transactional systems on top of BigTable was successfully accomplished. On Sun, Feb 13, 2011 at 10:51 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > I would avoid this, personally. > > Serious transactions and complex queries are pretty much incompatible with > simple implementation and large scale. > > Flow based updates and write-behind are more the norm. > > On Sun, Feb 13, 2011 at 6:09 PM, Jason Rutherglen < > [EMAIL PROTECTED]> wrote: > >> > I haven't seen any search sites that absolutely need transactional >> > consistency. >> >> While this is true, databases usually require this? And so this is >> somewhat of an out-of-the-box view on search, and this's why it's >> perhaps better to frame it more in the context databases, eg, >> transactions, consistency, and complex queries. >
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-02-14, 15:08
> Row updates are atomic.
> > Nothing else is. Well, that's perfect! Lucene's IW.updateDoc is atomic per row/doc. As a row is added to HBase, we'd add/update a doc in Lucene. Part of the integration would entail keeping enough of the HBase write-ahead-Log (WAL) intact, so that if a region server failed, the Lucene index that was still in RAM could be rebuilt on startup.
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-02-14, 17:19
Based on the discussion I opened an issue that outlines how we can add
search to HBase at a low-level, which I think is the key: https://issues.apache.org/jira/browse/HBASE-3529 On Fri, Feb 11, 2011 at 3:10 PM, Jason Rutherglen <[EMAIL PROTECTED]> wrote: > Hello, > > I'm curious as to what a 'good' approach would be for implementing > search in HBase (using Lucene) with the end goal being the integration > of realtime search into HBase. I think the use case makes sense as > HBase is realtime and has a write-ahead log, performs automatic > partitioning, splitting of data, failover, redundancy, etc. These are > all things Lucene does not have out of the box, that we'd essentially > get for 'free'. > > For starters: Where would be the right place to store Lucene segments > or postings? Eg, we need to be able to efficiently perform a linear > iteration of the per-term posting list(s). > > Thanks! > > Jason Rutherglen >
-
Re: HBase and Lucene for realtime searchBruno Dumon 2011-02-14, 17:28
On Mon, Feb 14, 2011 at 12:37 AM, Jason Rutherglen <
[EMAIL PROTECTED]> wrote: > > Another issue is that maybe the scalability needs for search might be > > different. An HBase region is always only active in one region server, > there > > are no active replica's, while often for search you need replicas to > scale, > > since a search will typically hit all partitions. > > > Really? That seems odd. > Yep, really. The replication is [only] on the HDFS-level. For HBase, this is not much of a problem as long as the requests are not strongly skewed towards one region (this requires good consideration from users when choosing row keys), but for search this could be a real issue. Also, HBase and Lucene might be different in how much rows/documents they can handle on one server, or in one region (an HBase region is typically only 256MB), leading to difficult choices (optimize region size for hbase vs for lucene). > > to be the main action and all what follows just secondary side-effects > (i.e. > > there's no rollback). > > I think inside a Coprocessor you could block the HBase 'commit' until > a successful updateDoc call to Lucene (which is only an update to RAM > anyways)? > Yes, that should work. But doesn't it assume that the index is updated synchronously with the HBase row? I can imagine this will sometimes be an issue, e.g. if it would involve performing expensive content extraction (tika) or analysis. BTW, something we do in Lily, and which might be interesting to think about in this context as well, is denormalization, thus in the Lucene document of some HBase row information is stored from related (linked) rows. This requires that, when one row changes, you need to find out what other rows denormalize info from this row, and update the Lucene documents of those rows as well. Just bringing this up as a random feature to think about ;-) -- Bruno Dumon Outerthought http://outerthought.org/
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-02-14, 17:48
> Yep, really. The replication is [only] on the HDFS-level. For HBase, this is
> not much of a problem as long as the requests are not strongly skewed > towards one region (this requires good consideration from users when > choosing row keys), but for search this could be a real issue. I think this can be solved rather easily? Or is there an underlying design rationale? > Also, HBase and Lucene might be different in how much rows/documents they > can handle on one server, or in one region (an HBase region is typically > only 256MB), leading to difficult choices (optimize region size for hbase vs > for lucene). I think that case, either we can map multiple regions to a Lucene index or increase the size of the HBase region. Either way'd be fine. > Yes, that should work. But doesn't it assume that the index is updated > synchronously with the HBase row? I can imagine this will sometimes be an > issue, e.g. if it would involve performing expensive content extraction > (tika) or analysis. I don't understand here. You mean that the delay in indexing a document will adversely affect the HBase row insert because it's all in the same transaction? I think that fine, eg, it's just how the system'd work? On Mon, Feb 14, 2011 at 9:28 AM, Bruno Dumon <[EMAIL PROTECTED]> wrote: > On Mon, Feb 14, 2011 at 12:37 AM, Jason Rutherglen < > [EMAIL PROTECTED]> wrote: > >> > Another issue is that maybe the scalability needs for search might be >> > different. An HBase region is always only active in one region server, >> there >> > are no active replica's, while often for search you need replicas to >> scale, >> > since a search will typically hit all partitions. >> >> >> Really? That seems odd. >> > > Yep, really. The replication is [only] on the HDFS-level. For HBase, this is > not much of a problem as long as the requests are not strongly skewed > towards one region (this requires good consideration from users when > choosing row keys), but for search this could be a real issue. > > Also, HBase and Lucene might be different in how much rows/documents they > can handle on one server, or in one region (an HBase region is typically > only 256MB), leading to difficult choices (optimize region size for hbase vs > for lucene). > > >> > to be the main action and all what follows just secondary side-effects >> (i.e. >> > there's no rollback). >> >> I think inside a Coprocessor you could block the HBase 'commit' until >> a successful updateDoc call to Lucene (which is only an update to RAM >> anyways)? >> > > Yes, that should work. But doesn't it assume that the index is updated > synchronously with the HBase row? I can imagine this will sometimes be an > issue, e.g. if it would involve performing expensive content extraction > (tika) or analysis. > > BTW, something we do in Lily, and which might be interesting to think about > in this context as well, is denormalization, thus in the Lucene document of > some HBase row information is stored from related (linked) rows. This > requires that, when one row changes, you need to find out what other rows > denormalize info from this row, and update the Lucene documents of those > rows as well. Just bringing this up as a random feature to think about ;-) > > -- > Bruno Dumon > Outerthought > http://outerthought.org/ >
-
Re: HBase and Lucene for realtime searchJean-Daniel Cryans 2011-02-14, 17:51
> "speed would only be acceptable if you batch up " -- I understand what you
> are talking about here (without batching-up, HBase simply become very > sluggish). Can you comment if Cassandra needs a batch-up mode? (I recall > Twitter said they just keep putting results into Cassandra for its analytics > application) > Sean, I guess you are talking about rainbird? If so then check slide 26: http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011 In short, they batch 1 minute worth of data before inserting it. Like Ted said, without batching you have a server round-trip for every row update, and the speed of light cannot be improved so...
-
Re: HBase and Lucene for realtime searchTed Dunning 2011-02-14, 18:55
Fixing this is likely quite difficult since it requires distributed
transactions. It would also typically kill update performance because of the distributed transaction problem. On Mon, Feb 14, 2011 at 9:48 AM, Jason Rutherglen < [EMAIL PROTECTED]> wrote: > > Yep, really. The replication is [only] on the HDFS-level. For HBase, this > is > > not much of a problem as long as the requests are not strongly skewed > > towards one region (this requires good consideration from users when > > choosing row keys), but for search this could be a real issue. > > I think this can be solved rather easily? Or is there an underlying > design rationale?
-
Re: HBase and Lucene for realtime searchTed Dunning 2011-02-14, 18:57
I would find that unacceptable for many systems I have worked on. Lucene
update-behind would be fine, but waiting the insert until all of the Lucene stuff happened would not be acceptable. I would much rather that Lucene update from the write log in batches that are as big as needed to catch/keep up. On Mon, Feb 14, 2011 at 9:48 AM, Jason Rutherglen < [EMAIL PROTECTED]> wrote: > > Yes, that should work. But doesn't it assume that the index is updated > > synchronously with the HBase row? I can imagine this will sometimes be an > > issue, e.g. if it would involve performing expensive content extraction > > (tika) or analysis. > > I don't understand here. You mean that the delay in indexing a > document will adversely affect the HBase row insert because it's all > in the same transaction? I think that fine, eg, it's just how the > system'd work?
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-02-14, 19:06
> Fixing this is likely quite difficult since it requires distributed
> transactions. It would also typically kill update performance because of > the distributed transaction problem. Hmm... Looks like it's in the works? https://issues.apache.org/jira/browse/HBASE-1295 On Mon, Feb 14, 2011 at 10:55 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Fixing this is likely quite difficult since it requires distributed > transactions. It would also typically kill update performance because of > the distributed transaction problem. > > On Mon, Feb 14, 2011 at 9:48 AM, Jason Rutherglen < > [EMAIL PROTECTED]> wrote: > >> > Yep, really. The replication is [only] on the HDFS-level. For HBase, this >> is >> > not much of a problem as long as the requests are not strongly skewed >> > towards one region (this requires good consideration from users when >> > choosing row keys), but for search this could be a real issue. >> >> I think this can be solved rather easily? Or is there an underlying >> design rationale? >
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-02-14, 19:09
The older versions of Lucene NRT indexing is slow, the newer version
with RT will be as fast as Lucene's batch indexing is today, which I'm guessing will be fast enough for many/most users? Eg, it's simply analyzing and throwing the data into a RAM buffer (there's no IO or segment merging happening). On Mon, Feb 14, 2011 at 10:57 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > I would find that unacceptable for many systems I have worked on. Lucene > update-behind would be fine, but waiting the insert until all of the Lucene > stuff happened would not be acceptable. > > I would much rather that Lucene update from the write log in batches that > are as big as needed to catch/keep up. > > On Mon, Feb 14, 2011 at 9:48 AM, Jason Rutherglen < > [EMAIL PROTECTED]> wrote: > >> > Yes, that should work. But doesn't it assume that the index is updated >> > synchronously with the HBase row? I can imagine this will sometimes be an >> > issue, e.g. if it would involve performing expensive content extraction >> > (tika) or analysis. >> >> I don't understand here. You mean that the delay in indexing a >> document will adversely affect the HBase row insert because it's all >> in the same transaction? I think that fine, eg, it's just how the >> system'd work? >
-
Re: HBase and Lucene for realtime searchTed Dunning 2011-02-14, 19:18
The analysis can be very slow if you are doing Tika things and named entity
extraction and PDF interpretation and so on. On Mon, Feb 14, 2011 at 11:09 AM, Jason Rutherglen < [EMAIL PROTECTED]> wrote: > The older versions of Lucene NRT indexing is slow, the newer version > with RT will be as fast as Lucene's batch indexing is today, which I'm > guessing will be fast enough for many/most users? Eg, it's simply > analyzing and throwing the data into a RAM buffer (there's no IO or > segment merging happening). > > On Mon, Feb 14, 2011 at 10:57 AM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > I would find that unacceptable for many systems I have worked on. Lucene > > update-behind would be fine, but waiting the insert until all of the > Lucene > > stuff happened would not be acceptable. > > > > I would much rather that Lucene update from the write log in batches that > > are as big as needed to catch/keep up. > > > > On Mon, Feb 14, 2011 at 9:48 AM, Jason Rutherglen < > > [EMAIL PROTECTED]> wrote: > > > >> > Yes, that should work. But doesn't it assume that the index is updated > >> > synchronously with the HBase row? I can imagine this will sometimes be > an > >> > issue, e.g. if it would involve performing expensive content > extraction > >> > (tika) or analysis. > >> > >> I don't understand here. You mean that the delay in indexing a > >> document will adversely affect the HBase row insert because it's all > >> in the same transaction? I think that fine, eg, it's just how the > >> system'd work? > > >
-
Re: HBase and Lucene for realtime searchBruno Dumon 2011-02-14, 19:21
One option might be to introduce replication only for the indexes, and leave
the regions as they are today, at the cost of some imbalance in the design (meaning, hbase master would need to be aware of the two different concepts). On Mon, Feb 14, 2011 at 7:55 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Fixing this is likely quite difficult since it requires distributed > transactions. It would also typically kill update performance because of > the distributed transaction problem. > > On Mon, Feb 14, 2011 at 9:48 AM, Jason Rutherglen < > [EMAIL PROTECTED]> wrote: > > > > Yep, really. The replication is [only] on the HDFS-level. For HBase, > this > > is > > > not much of a problem as long as the requests are not strongly skewed > > > towards one region (this requires good consideration from users when > > > choosing row keys), but for search this could be a real issue. > > > > I think this can be solved rather easily? Or is there an underlying > > design rationale? > -- Bruno Dumon Outerthought http://outerthought.org/
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-02-14, 19:28
> The analysis can be very slow if you are doing Tika things and named entity
> extraction and PDF interpretation and so on. I'd consider those different/separate use cases where likely realtime isn't important? If large [static] documents are being stored in HBase why would expediency be required? On Mon, Feb 14, 2011 at 11:18 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > The analysis can be very slow if you are doing Tika things and named entity > extraction and PDF interpretation and so on. > > On Mon, Feb 14, 2011 at 11:09 AM, Jason Rutherglen < > [EMAIL PROTECTED]> wrote: > >> The older versions of Lucene NRT indexing is slow, the newer version >> with RT will be as fast as Lucene's batch indexing is today, which I'm >> guessing will be fast enough for many/most users? Eg, it's simply >> analyzing and throwing the data into a RAM buffer (there's no IO or >> segment merging happening). >> >> On Mon, Feb 14, 2011 at 10:57 AM, Ted Dunning <[EMAIL PROTECTED]> >> wrote: >> > I would find that unacceptable for many systems I have worked on. Lucene >> > update-behind would be fine, but waiting the insert until all of the >> Lucene >> > stuff happened would not be acceptable. >> > >> > I would much rather that Lucene update from the write log in batches that >> > are as big as needed to catch/keep up. >> > >> > On Mon, Feb 14, 2011 at 9:48 AM, Jason Rutherglen < >> > [EMAIL PROTECTED]> wrote: >> > >> >> > Yes, that should work. But doesn't it assume that the index is updated >> >> > synchronously with the HBase row? I can imagine this will sometimes be >> an >> >> > issue, e.g. if it would involve performing expensive content >> extraction >> >> > (tika) or analysis. >> >> >> >> I don't understand here. You mean that the delay in indexing a >> >> document will adversely affect the HBase row insert because it's all >> >> in the same transaction? I think that fine, eg, it's just how the >> >> system'd work? >> > >> >
-
Re: HBase and Lucene for realtime searchTed Dunning 2011-02-14, 20:04
As you like.
My experience is that analyzing a document takes longer than I want to cause the user to wait when inserting it. I almost always prefer write-behind indexing of some kind. On Mon, Feb 14, 2011 at 11:28 AM, Jason Rutherglen < [EMAIL PROTECTED]> wrote: > > The analysis can be very slow if you are doing Tika things and named > entity > > extraction and PDF interpretation and so on. > > I'd consider those different/separate use cases where likely realtime > isn't important? If large [static] documents are being stored in > HBase why would expediency be required? > > On Mon, Feb 14, 2011 at 11:18 AM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > The analysis can be very slow if you are doing Tika things and named > entity > > extraction and PDF interpretation and so on. > > > > On Mon, Feb 14, 2011 at 11:09 AM, Jason Rutherglen < > > [EMAIL PROTECTED]> wrote: > > > >> The older versions of Lucene NRT indexing is slow, the newer version > >> with RT will be as fast as Lucene's batch indexing is today, which I'm > >> guessing will be fast enough for many/most users? Eg, it's simply > >> analyzing and throwing the data into a RAM buffer (there's no IO or > >> segment merging happening). > >> > >> On Mon, Feb 14, 2011 at 10:57 AM, Ted Dunning <[EMAIL PROTECTED]> > >> wrote: > >> > I would find that unacceptable for many systems I have worked on. > Lucene > >> > update-behind would be fine, but waiting the insert until all of the > >> Lucene > >> > stuff happened would not be acceptable. > >> > > >> > I would much rather that Lucene update from the write log in batches > that > >> > are as big as needed to catch/keep up. > >> > > >> > On Mon, Feb 14, 2011 at 9:48 AM, Jason Rutherglen < > >> > [EMAIL PROTECTED]> wrote: > >> > > >> >> > Yes, that should work. But doesn't it assume that the index is > updated > >> >> > synchronously with the HBase row? I can imagine this will sometimes > be > >> an > >> >> > issue, e.g. if it would involve performing expensive content > >> extraction > >> >> > (tika) or analysis. > >> >> > >> >> I don't understand here. You mean that the delay in indexing a > >> >> document will adversely affect the HBase row insert because it's all > >> >> in the same transaction? I think that fine, eg, it's just how the > >> >> system'd work? > >> > > >> > > >
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-02-14, 20:18
> I almost always prefer write-behind indexing of some kind.
I think that's the easier of the two methods and while it can be accomplished in this system, would require some sort of 'queue' etc. For things like messaging, eg, email, a database write and subsequent document analyze should be fast, and should upon success, be searchable?
-
Re: HBase and Lucene for realtime searchTed Dunning 2011-02-14, 20:20
There is already going to be a serious imbalance because the number of
indexes is highly unlikely to be the same as the number of regions in an optimal setup. On Mon, Feb 14, 2011 at 11:21 AM, Bruno Dumon <[EMAIL PROTECTED]>wrote: > One option might be to introduce replication only for the indexes, and > leave > the regions as they are today, at the cost of some imbalance in the design > (meaning, hbase master would need to be aware of the two different > concepts). > > On Mon, Feb 14, 2011 at 7:55 PM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > > Fixing this is likely quite difficult since it requires distributed > > transactions. It would also typically kill update performance because of > > the distributed transaction problem. > > > > On Mon, Feb 14, 2011 at 9:48 AM, Jason Rutherglen < > > [EMAIL PROTECTED]> wrote: > > > > > > Yep, really. The replication is [only] on the HDFS-level. For HBase, > > this > > > is > > > > not much of a problem as long as the requests are not strongly skewed > > > > towards one region (this requires good consideration from users when > > > > choosing row keys), but for search this could be a real issue. > > > > > > I think this can be solved rather easily? Or is there an underlying > > > design rationale? > > > > > > -- > Bruno Dumon > Outerthought > http://outerthought.org/ >
-
Re: HBase and Lucene for realtime searchTed Dunning 2011-02-14, 20:22
Upon success, the composer of the message should be told as soon as possible
that their message has been committed. If it is indexed before they can formulate a query, then all is well. There is no need to delay completion of the update IMO. On Mon, Feb 14, 2011 at 12:18 PM, Jason Rutherglen < [EMAIL PROTECTED]> wrote: > > I almost always prefer write-behind indexing of some kind. > > I think that's the easier of the two methods and while it can be > accomplished in this system, would require some sort of 'queue' etc. > For things like messaging, eg, email, a database write and subsequent > document analyze should be fast, and should upon success, be > searchable? >
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-02-14, 20:37
I'm not sure how an alternative architecture would look, where there'd
be multiple replicated indexes pointing at only one region? What if that region went down, that's a single point of failure? It's far easier to simply keep the region n'sync with an attached index. Than to replicate the data from multiple regions to an index on another server? That sounds extremely redundant and problematic, eg, all of the database transactional functionality probably goes out the window. One can easily build a multi-tier batch indexing application today with HBase and Lucene, though it'd still need to scale [somehow] and with realtime search all of these issues encroach again. On Mon, Feb 14, 2011 at 12:20 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > There is already going to be a serious imbalance because the number of > indexes is highly unlikely to be the same as the number of regions in an > optimal setup. > > On Mon, Feb 14, 2011 at 11:21 AM, Bruno Dumon <[EMAIL PROTECTED]>wrote: > >> One option might be to introduce replication only for the indexes, and >> leave >> the regions as they are today, at the cost of some imbalance in the design >> (meaning, hbase master would need to be aware of the two different >> concepts). >> >> On Mon, Feb 14, 2011 at 7:55 PM, Ted Dunning <[EMAIL PROTECTED]> >> wrote: >> >> > Fixing this is likely quite difficult since it requires distributed >> > transactions. It would also typically kill update performance because of >> > the distributed transaction problem. >> > >> > On Mon, Feb 14, 2011 at 9:48 AM, Jason Rutherglen < >> > [EMAIL PROTECTED]> wrote: >> > >> > > > Yep, really. The replication is [only] on the HDFS-level. For HBase, >> > this >> > > is >> > > > not much of a problem as long as the requests are not strongly skewed >> > > > towards one region (this requires good consideration from users when >> > > > choosing row keys), but for search this could be a real issue. >> > > >> > > I think this can be solved rather easily? Or is there an underlying >> > > design rationale? >> > >> >> >> >> -- >> Bruno Dumon >> Outerthought >> http://outerthought.org/ >> >
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-02-14, 21:03
Indexing is pretty fast these days as you noted regarding (on the
extreme end) Twitter, so I highly doubt this'd be an issue for most apps. If it is then maybe they should use Kestrel (https://github.com/robey/kestrel) from Twitter or some other similar MQ system? On Mon, Feb 14, 2011 at 12:22 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Upon success, the composer of the message should be told as soon as possible > that their message has been committed. If it is indexed before they can > formulate a query, then all is well. There is no need to delay completion > of the update IMO. > > On Mon, Feb 14, 2011 at 12:18 PM, Jason Rutherglen < > [EMAIL PROTECTED]> wrote: > >> > I almost always prefer write-behind indexing of some kind. >> >> I think that's the easier of the two methods and while it can be >> accomplished in this system, would require some sort of 'queue' etc. >> For things like messaging, eg, email, a database write and subsequent >> document analyze should be fast, and should upon success, be >> searchable? >> >
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-02-14, 22:04
To beat a dead horse, yet another way to view adding even simple
search functionality to HBase is I think it'd put it on equal ground to something like MongoDB? On Fri, Feb 11, 2011 at 3:10 PM, Jason Rutherglen <[EMAIL PROTECTED]> wrote: > Hello, > > I'm curious as to what a 'good' approach would be for implementing > search in HBase (using Lucene) with the end goal being the integration > of realtime search into HBase. I think the use case makes sense as > HBase is realtime and has a write-ahead log, performs automatic > partitioning, splitting of data, failover, redundancy, etc. These are > all things Lucene does not have out of the box, that we'd essentially > get for 'free'. > > For starters: Where would be the right place to store Lucene segments > or postings? Eg, we need to be able to efficiently perform a linear > iteration of the per-term posting list(s). > > Thanks! > > Jason Rutherglen >
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-04-15, 01:18
Since posting this I started working on HBASE-3529, the goal of which
is to integrate Lucene into HBase, with an eye towards fully integrating realtime search when it's available in Lucene. RT'll give immediate consistency of HBase put's into the search index. The first challenge has been how to perform queries on index files stored in HDFS without speed degradation. To solve that problem, I took the general notion of HDFS-347 and instead now directly obtain a single block's java.io.File and memory map it for Lucene's usage. The benchmark's show that this system is viable for Lucene queries. The code is still rough, I will be cleaning it up and making it easier for others to assemble and try on their own. There is work to be done on splitting the indexes and moving Lucene indexes (to the local data node) when HBase rebalances a region. Perhaps we can discuss issues on the dev list. Comments are welcome.
-
Re: HBase and Lucene for realtime searchTed Yu 2011-04-15, 02:41
Jason:
I logged https://issues.apache.org/jira/browse/HBASE-3786 Feel free to comment there. On Thu, Apr 14, 2011 at 6:18 PM, Jason Rutherglen < [EMAIL PROTECTED]> wrote: > Since posting this I started working on HBASE-3529, the goal of which > is to integrate Lucene into HBase, with an eye towards fully > integrating realtime search when it's available in Lucene. RT'll give > immediate consistency of HBase put's into the search index. The first > challenge has been how to perform queries on index files stored in > HDFS without speed degradation. > > To solve that problem, I took the general notion of HDFS-347 and > instead now directly obtain a single block's java.io.File and memory > map it for Lucene's usage. The benchmark's show that this system is > viable for Lucene queries. The code is still rough, I will be > cleaning it up and making it easier for others to assemble and try on > their own. > > There is work to be done on splitting the indexes and moving Lucene > indexes (to the local data node) when HBase rebalances a region. > Perhaps we can discuss issues on the dev list. Comments are welcome. >
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-04-15, 13:19
Ted thanks!
On Thu, Apr 14, 2011 at 7:41 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > Jason: > I logged https://issues.apache.org/jira/browse/HBASE-3786 > Feel free to comment there. > > On Thu, Apr 14, 2011 at 6:18 PM, Jason Rutherglen < > [EMAIL PROTECTED]> wrote: > >> Since posting this I started working on HBASE-3529, the goal of which >> is to integrate Lucene into HBase, with an eye towards fully >> integrating realtime search when it's available in Lucene. RT'll give >> immediate consistency of HBase put's into the search index. The first >> challenge has been how to perform queries on index files stored in >> HDFS without speed degradation. >> >> To solve that problem, I took the general notion of HDFS-347 and >> instead now directly obtain a single block's java.io.File and memory >> map it for Lucene's usage. The benchmark's show that this system is >> viable for Lucene queries. The code is still rough, I will be >> cleaning it up and making it easier for others to assemble and try on >> their own. >> >> There is work to be done on splitting the indexes and moving Lucene >> indexes (to the local data node) when HBase rebalances a region. >> Perhaps we can discuss issues on the dev list. Comments are welcome. >> >
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-04-15, 16:15
Previously in this thread there was concern about the indexing speed
of Lucene vs. HBase, while certainly the throughput will not be as high when building a search index in conjunction with HBase, it should be quite good nonetheless. Here's a link to a discussion on this: http://bit.ly/dGxlEp Here are the two links at the bottom of the thread: http://blog.jteam.nl/2011/04/01/gimme-all-resources-you-have-i-can-use-them/ http://blog.mikemccandless.com/2010/09/lucenes-indexing-is-fast.html
-
Re: HBase and Lucene for realtime searchtsuna 2011-04-20, 06:50
On Sat, Feb 12, 2011 at 7:13 AM, Jason Rutherglen
<[EMAIL PROTECTED]> wrote: >> solr/katta/elasticsearch > > These don't have a distributed solution for realtime search [yet]. Sorry if this is a naive question but can you explain why you consider that ElasticSearch isn't a distributed solution for realtime search? -- Benoit "tsuna" Sigoure Software Engineer @ www.StumbleUpon.com
-
Re: HBase and Lucene for realtime searchOtis Gospodnetic 2011-04-20, 12:06
That's some old email.... :)
I think what Jason is doing is not so much about trying to get (N)RT search (which already exists in raw Lucene, in ES, in Zoie, Sensei, and eventually will be in Solr), but trying to get full-text search via Lucene tightly integrated with data storage via HBase. When data is added to HBase it should be *indexed* immediately, in RT, at the level of this code Jason wrote instead of at the application level ("add to HBase, then index to ES"), or by periodically polling the DB for changes and updating the index. At least that is what I think Jason's goal with this effort was. Otis -- We're hiring HBase hackers for Data Mining and Analytics http://blog.sematext.com/2011/04/18/hiring-data-mining-analytics-machine-learning-hackers/ ----- Original Message ---- > From: tsuna <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Wed, April 20, 2011 2:50:43 AM > Subject: Re: HBase and Lucene for realtime search > > On Sat, Feb 12, 2011 at 7:13 AM, Jason Rutherglen > <[EMAIL PROTECTED]> wrote: > >> solr/katta/elasticsearch > > > > These don't have a distributed solution for realtime search [yet]. > > Sorry if this is a naive question but can you explain why you consider > that ElasticSearch isn't a distributed solution for realtime search? > > -- > Benoit "tsuna" Sigoure > Software Engineer @ www.StumbleUpon.com >
-
Re: HBase and Lucene for realtime searchtsuna 2011-04-20, 20:25
On Wed, Apr 20, 2011 at 5:06 AM, Otis Gospodnetic
<[EMAIL PROTECTED]> wrote: > That's some old email.... :) Sorry I'm catching up just now :D > I think what Jason is doing is not so much about trying to get (N)RT search > (which already exists in raw Lucene, in ES, in Zoie, Sensei, and eventually will > be in Solr), but trying to get full-text search via Lucene tightly integrated > with data storage via HBase. When data is added to HBase it should be *indexed* > immediately, in RT, at the level of this code Jason wrote instead of at the > application level ("add to HBase, then index to ES"), or by periodically polling > the DB for changes and updating the index. Ah, OK. So I see someone already recommended the Coprocessor approach to send updates as things are written to HBase. When I talked with Shay (the author of ElasticSearch) he was interested in something like this to build a "river" to stream updates from HBase to ES. An another alternative would be to have a sink that uses HBase replication to replicate edits to ES. -- Benoit "tsuna" Sigoure Software Engineer @ www.StumbleUpon.com
-
Re: HBase and Lucene for realtime searchJason Rutherglen 2011-04-20, 20:55
> Sorry if this is a naive question but can you explain why you consider
> that ElasticSearch isn't a distributed solution for realtime search? I wasn't referring just to ES, mainly to Katta and Solr. Taking a step back, RT in Lucene should enable immediate consistency making it symmetrical with HBase? Outside of that there are 'containers' for Lucene, some of which are Katta, Solr, and ES. My opinion is that they each have drawbacks compared to HBase as a Lucene container. If one is running HBase in production, then adding a Lucene index on that data shouldn't add more complexity to operating HBase. And so if one's primary data store is HBase, my opinion is that one'd be adding significant additional complexity be adding 'another' cluster server system alongside. Especially given the requirements and symmetrical (eg, write-once, immediate consistency) nature of HBase and Lucene. Once everything is polished I think it'll be a nice solution that can replace many MySQL depoyments for realtime data access. One that'd offer even more types of queries and scalability than MySQL. If the user wishes to perform joins they can use Hive? |