Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> execute millions of "grep"


Copy link to this message
-
Re: execute millions of "grep"
If you really need to do millions of exact text queries against millions of
documents in realtime, a simple grep is not going to be sufficient for you.
You'll need smarter datastructures and algorithms.

Please specify how frequently the set of *queries* changes and what you
consider "real time".

On Thu, Nov 3, 2011 at 2:46 PM, Oliver Krohne <[EMAIL PROTECTED]>wrote:

> Hi,
>
> I' am evaluating different solutions for massive phrase query execution. I
> need to execute millions of greps or more precise phrase queries consisting
> of 1-4 terms against millions of documents. I saw the hadoop grep example
> but this is executing grep with one regex.
>
> I also saw the "Side data distribution" / "Distributed Cache" possibility
> of hadoop. So I could pass them to the mapper and execute each query agains
> the input line. The input line would be the entire text of an document
> (usually 50-500 words).
>
> As I am aiming to  have these information almost in realtime another
> questions arises about adhoc map/reduce jobs. Is there a limit of running a
> lot of jobs in parallel, lets say if I would fire a new job once a new
> document arises. This job would only process that particular document. Or I
> would batch 100-1000 documents and then fire the job.
>
> Can anyone advise an approach of doing it with hadoop?
>
> Thanks in advance,
> Oliver
>
>
>
>
>
>
>
>
>
>
>
>
>
>
--
Eugene Kirpichov
Principal Engineer, Mirantis Inc. http://www.mirantis.com/
Editor, http://fprog.ru/