Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> execute millions of "grep"


Copy link to this message
-
Re: execute millions of "grep"
If you really need to do millions of exact text queries against millions of
documents in realtime, a simple grep is not going to be sufficient for you.
You'll need smarter datastructures and algorithms.

Please specify how frequently the set of *queries* changes and what you
consider "real time".

On Thu, Nov 3, 2011 at 2:46 PM, Oliver Krohne <[EMAIL PROTECTED]>wrote:

> Hi,
>
> I' am evaluating different solutions for massive phrase query execution. I
> need to execute millions of greps or more precise phrase queries consisting
> of 1-4 terms against millions of documents. I saw the hadoop grep example
> but this is executing grep with one regex.
>
> I also saw the "Side data distribution" / "Distributed Cache" possibility
> of hadoop. So I could pass them to the mapper and execute each query agains
> the input line. The input line would be the entire text of an document
> (usually 50-500 words).
>
> As I am aiming to  have these information almost in realtime another
> questions arises about adhoc map/reduce jobs. Is there a limit of running a
> lot of jobs in parallel, lets say if I would fire a new job once a new
> document arises. This job would only process that particular document. Or I
> would batch 100-1000 documents and then fire the job.
>
> Can anyone advise an approach of doing it with hadoop?
>
> Thanks in advance,
> Oliver
>
>
>
>
>
>
>
>
>
>
>
>
>
>
--
Eugene Kirpichov
Principal Engineer, Mirantis Inc. http://www.mirantis.com/
Editor, http://fprog.ru/
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB