Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> execute millions of "grep"


Copy link to this message
-
Re: execute millions of "grep"
Hi,

Hm - I was not assuming that you'll have >20mln queries.
However, you can partition the query space into as many parts as needed to
fit each part in memory, and accordingly increase the number of passes over
a document - or just do it in parallel with mapreduce.

I am not sure I understand how your inverted index solution works. Can you
elaborate?

On Thu, Nov 3, 2011 at 3:55 PM, Oliver Krohne <[EMAIL PROTECTED]>wrote:

> Hi Eugene,
>
> thanks for the hint. I looked at automaton and of course 100k are not
> consuming a lot of memory and a single pass through the document is
> perfect. But how will it work with 20 million and more queries?
> In addition there will be different sets of queries for each language. So
> we can expect to have 20 million + queries for each language. Of course
> they would be different automatons.
>
> Another approach I am thinking of is to build an inverted index of the
> queries and during map I will lookup the queries for each word and during
> reduce do normal query processing. What do you think about that ?
>
> regards,
> Oliver
>
> Am 03.11.2011 um 12:37 schrieb Eugene Kirpichov:
>
> > Hi Oliver,
> > I have solved a similar problem before, and it seems to me that the best
> solution is to build an automaton (with whole words, not letters on edges)
> on the set of queries (basically, a "trie") that will allow you to find all
> queries matching a document by a single pass over the document.
> > For 100k queries, the true will be very fast and very easy to build, and
> it will not consume much memory.
> > The document scan is as follows for each word:
> > - add root of trie to "active nodes" set
> > - for each node in the active set, replace the node with its child
> corresponding to the scanned word.
> >
> > Then whenever you add a leaf node to the active set, you are seeing a
> match of some query that ends at the current word.
> >
> > Please tell if this description needs more detail.
> >
> > 03.11.2011, в 15:20, Oliver Krohne <[EMAIL PROTECTED]>
> написал(а):
> >
> >> Hi Eugene,
> >>
> >> thanks for the quick reply.
> >>
> >> Not only the queries are changing but also the documents:
> >>
> >> a)The long time goal would be able to process a document or a batch of
> documents after a certain event which a mean with "realtime". There will be
> different types of documents which are updated differently. One set
> documents will be updated every 1-2 hours. Other documents will be updated
> once a week. So an event means that the queries have to run against that
> set documents which are updated. So probably I need to run the queries
> every hour against against a subset and one a week the other documents need
> to be searched.
> >>
> >> b)The queries are changing a couple of times a day. At this stage we
> have around 100k queries but the will grow every day. A huge set of the
> queries remain the same but I expect that after 3 month 10% of the queries
> will change every day. Either they are deleted or new ones are added.
> >>
> >> One of the main question am still thinking of is where to search in.
> Search the queries in the documents or search the document terms in the
> queries.
> >>
> >> I will need at first exact match of a phrase query and in a next step
> phrase match given a precision/slop.
> >>
> >> Thanks for your help,
> >> Oliver
> >>
> >>
> >> Am 03.11.2011 um 11:52 schrieb Eugene Kirpichov:
> >>
> >>> If you really need to do millions of exact text queries against
> millions of
> >>> documents in realtime, a simple grep is not going to be sufficient for
> you.
> >>> You'll need smarter datastructures and algorithms.
> >>>
> >>> Please specify how frequently the set of *queries* changes and what you
> >>> consider "real time".
> >>>
> >>> On Thu, Nov 3, 2011 at 2:46 PM, Oliver Krohne <
> [EMAIL PROTECTED]>wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I' am evaluating different solutions for massive phrase query
> execution. I
> >>>> need to execute millions of greps or more precise phrase queries
Eugene Kirpichov
Principal Engineer, Mirantis Inc. http://www.mirantis.com/
Editor, http://fprog.ru/
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB