|
|
-
Re: execute millions of "grep"Eugene Kirpichov 2011-11-03, 12:16
Hi,
Hm - I was not assuming that you'll have >20mln queries. However, you can partition the query space into as many parts as needed to fit each part in memory, and accordingly increase the number of passes over a document - or just do it in parallel with mapreduce. I am not sure I understand how your inverted index solution works. Can you elaborate? On Thu, Nov 3, 2011 at 3:55 PM, Oliver Krohne <[EMAIL PROTECTED]>wrote: > Hi Eugene, > > thanks for the hint. I looked at automaton and of course 100k are not > consuming a lot of memory and a single pass through the document is > perfect. But how will it work with 20 million and more queries? > In addition there will be different sets of queries for each language. So > we can expect to have 20 million + queries for each language. Of course > they would be different automatons. > > Another approach I am thinking of is to build an inverted index of the > queries and during map I will lookup the queries for each word and during > reduce do normal query processing. What do you think about that ? > > regards, > Oliver > > Am 03.11.2011 um 12:37 schrieb Eugene Kirpichov: > > > Hi Oliver, > > I have solved a similar problem before, and it seems to me that the best > solution is to build an automaton (with whole words, not letters on edges) > on the set of queries (basically, a "trie") that will allow you to find all > queries matching a document by a single pass over the document. > > For 100k queries, the true will be very fast and very easy to build, and > it will not consume much memory. > > The document scan is as follows for each word: > > - add root of trie to "active nodes" set > > - for each node in the active set, replace the node with its child > corresponding to the scanned word. > > > > Then whenever you add a leaf node to the active set, you are seeing a > match of some query that ends at the current word. > > > > Please tell if this description needs more detail. > > > > 03.11.2011, в 15:20, Oliver Krohne <[EMAIL PROTECTED]> > написал(а): > > > >> Hi Eugene, > >> > >> thanks for the quick reply. > >> > >> Not only the queries are changing but also the documents: > >> > >> a)The long time goal would be able to process a document or a batch of > documents after a certain event which a mean with "realtime". There will be > different types of documents which are updated differently. One set > documents will be updated every 1-2 hours. Other documents will be updated > once a week. So an event means that the queries have to run against that > set documents which are updated. So probably I need to run the queries > every hour against against a subset and one a week the other documents need > to be searched. > >> > >> b)The queries are changing a couple of times a day. At this stage we > have around 100k queries but the will grow every day. A huge set of the > queries remain the same but I expect that after 3 month 10% of the queries > will change every day. Either they are deleted or new ones are added. > >> > >> One of the main question am still thinking of is where to search in. > Search the queries in the documents or search the document terms in the > queries. > >> > >> I will need at first exact match of a phrase query and in a next step > phrase match given a precision/slop. > >> > >> Thanks for your help, > >> Oliver > >> > >> > >> Am 03.11.2011 um 11:52 schrieb Eugene Kirpichov: > >> > >>> If you really need to do millions of exact text queries against > millions of > >>> documents in realtime, a simple grep is not going to be sufficient for > you. > >>> You'll need smarter datastructures and algorithms. > >>> > >>> Please specify how frequently the set of *queries* changes and what you > >>> consider "real time". > >>> > >>> On Thu, Nov 3, 2011 at 2:46 PM, Oliver Krohne < > [EMAIL PROTECTED]>wrote: > >>> > >>>> Hi, > >>>> > >>>> I' am evaluating different solutions for massive phrase query > execution. I > >>>> need to execute millions of greps or more precise phrase queries Eugene Kirpichov Principal Engineer, Mirantis Inc. http://www.mirantis.com/ Editor, http://fprog.ru/ |