Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> execute millions of "grep"


Copy link to this message
-
Re: execute millions of "grep"
Hi Oliver,
I have solved a similar problem before, and it seems to me that the best solution is to build an automaton (with whole words, not letters on edges) on the set of queries (basically, a "trie") that will allow you to find all queries matching a document by a single pass over the document.
For 100k queries, the true will be very fast and very easy to build, and it will not consume much memory.
The document scan is as follows for each word:
 - add root of trie to "active nodes" set
 - for each node in the active set, replace the node with its child corresponding to the scanned word.

Then whenever you add a leaf node to the active set, you are seeing a match of some query that ends at the current word.

Please tell if this description needs more detail.

03.11.2011, в 15:20, Oliver Krohne <[EMAIL PROTECTED]> н��писал(а):

> Hi Eugene,
>
> thanks for the quick reply.
>
> Not only the queries are changing but also the documents:
>
> a)The long time goal would be able to process a document or a batch of documents after a certain event which a mean with "realtime". There will be different types of documents which are updated differently. One set documents will be updated every 1-2 hours. Other documents will be updated once a week. So an event means that the queries have to run against that set documents which are updated. So probably I need to run the queries every hour against against a subset and one a week the other documents need to be searched.
>
> b)The queries are changing a couple of times a day. At this stage we have around 100k queries but the will grow every day. A huge set of the queries remain the same but I expect that after 3 month 10% of the queries will change every day. Either they are deleted or new ones are added.
>
> One of the main question am still thinking of is where to search in. Search the queries in the documents or search the document terms in the queries.
>
> I will need at first exact match of a phrase query and in a next step phrase match given a precision/slop.
>
> Thanks for your help,
> Oliver
>
>
> Am 03.11.2011 um 11:52 schrieb Eugene Kirpichov:
>
>> If you really need to do millions of exact text queries against millions of
>> documents in realtime, a simple grep is not going to be sufficient for you.
>> You'll need smarter datastructures and algorithms.
>>
>> Please specify how frequently the set of *queries* changes and what you
>> consider "real time".
>>
>> On Thu, Nov 3, 2011 at 2:46 PM, Oliver Krohne <[EMAIL PROTECTED]>wrote:
>>
>>> Hi,
>>>
>>> I' am evaluating different solutions for massive phrase query execution. I
>>> need to execute millions of greps or more precise phrase queries consisting
>>> of 1-4 terms against millions of documents. I saw the hadoop grep example
>>> but this is executing grep with one regex.
>>>
>>> I also saw the "Side data distribution" / "Distributed Cache" possibility
>>> of hadoop. So I could pass them to the mapper and execute each query agains
>>> the input line. The input line would be the entire text of an document
>>> (usually 50-500 words).
>>>
>>> As I am aiming to  have these information almost in realtime another
>>> questions arises about adhoc map/reduce jobs. Is there a limit of running a
>>> lot of jobs in parallel, lets say if I would fire a new job once a new
>>> document arises. This job would only process that particular document. Or I
>>> would batch 100-1000 documents and then fire the job.
>>>
>>> Can anyone advise an approach of doing it with hadoop?
>>>
>>> Thanks in advance,
>>> Oliver
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> Eugene Kirpichov
>> Principal Engineer, Mirantis Inc. http://www.mirantis.com/
>> Editor, http://fprog.ru/
>
> ---
>
> Oliver Krohne
> Founder & CTO
>
> YieldKit UG (haftungsbeschränkt)
> Mittelweg 161
> 20148 Hamburg
>
> T +49 40 209 349 771
> F +49 40 209 349 779
> E [EMAIL PROTECTED]
>
> http://www.yieldkit.com