-Re: execute millions of "grep"
Eugene Kirpichov 2011-11-03, 12:16
Hm - I was not assuming that you'll have >20mln queries.
However, you can partition the query space into as many parts as needed to
fit each part in memory, and accordingly increase the number of passes over
a document - or just do it in parallel with mapreduce.
I am not sure I understand how your inverted index solution works. Can you
On Thu, Nov 3, 2011 at 3:55 PM, Oliver Krohne <[EMAIL PROTECTED]>wrote:
> Hi Eugene,
> thanks for the hint. I looked at automaton and of course 100k are not
> consuming a lot of memory and a single pass through the document is
> perfect. But how will it work with 20 million and more queries?
> In addition there will be different sets of queries for each language. So
> we can expect to have 20 million + queries for each language. Of course
> they would be different automatons.
> Another approach I am thinking of is to build an inverted index of the
> queries and during map I will lookup the queries for each word and during
> reduce do normal query processing. What do you think about that ?
> Am 03.11.2011 um 12:37 schrieb Eugene Kirpichov:
> > Hi Oliver,
> > I have solved a similar problem before, and it seems to me that the best
> solution is to build an automaton (with whole words, not letters on edges)
> on the set of queries (basically, a "trie") that will allow you to find all
> queries matching a document by a single pass over the document.
> > For 100k queries, the true will be very fast and very easy to build, and
> it will not consume much memory.
> > The document scan is as follows for each word:
> > - add root of trie to "active nodes" set
> > - for each node in the active set, replace the node with its child
> corresponding to the scanned word.
> > Then whenever you add a leaf node to the active set, you are seeing a
> match of some query that ends at the current word.
> > Please tell if this description needs more detail.
> > 03.11.2011, в 15:20, Oliver Krohne <[EMAIL PROTECTED]>
> >> Hi Eugene,
> >> thanks for the quick reply.
> >> Not only the queries are changing but also the documents:
> >> a)The long time goal would be able to process a document or a batch of
> documents after a certain event which a mean with "realtime". There will be
> different types of documents which are updated differently. One set
> documents will be updated every 1-2 hours. Other documents will be updated
> once a week. So an event means that the queries have to run against that
> set documents which are updated. So probably I need to run the queries
> every hour against against a subset and one a week the other documents need
> to be searched.
> >> b)The queries are changing a couple of times a day. At this stage we
> have around 100k queries but the will grow every day. A huge set of the
> queries remain the same but I expect that after 3 month 10% of the queries
> will change every day. Either they are deleted or new ones are added.
> >> One of the main question am still thinking of is where to search in.
> Search the queries in the documents or search the document terms in the
> >> I will need at first exact match of a phrase query and in a next step
> phrase match given a precision/slop.
> >> Thanks for your help,
> >> Oliver
> >> Am 03.11.2011 um 11:52 schrieb Eugene Kirpichov:
> >>> If you really need to do millions of exact text queries against
> millions of
> >>> documents in realtime, a simple grep is not going to be sufficient for
> >>> You'll need smarter datastructures and algorithms.
> >>> Please specify how frequently the set of *queries* changes and what you
> >>> consider "real time".
> >>> On Thu, Nov 3, 2011 at 2:46 PM, Oliver Krohne <
> [EMAIL PROTECTED]>wrote:
> >>>> Hi,
> >>>> I' am evaluating different solutions for massive phrase query
> execution. I
> >>>> need to execute millions of greps or more precise phrase queries
Principal Engineer, Mirantis Inc. http://www.mirantis.com/