Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> use hbase as distributed crawl's scheduler


Copy link to this message
-
Re: use hbase as distributed crawl's scheduler
Hi,

Yes. I'm sure that would be a welcome addition.  Topic for user@nutch.a.o...

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/
On Fri, Jan 3, 2014 at 1:23 AM, James Taylor <[EMAIL PROTECTED]> wrote:

> Otis,
> I didn't realize Nutch uses HBase underneath. Might be interesting if you
> serialized data in a Phoenix-compliant manner, as you could run SQL queries
> directly on top of it.
>
> Thanks,
> James
>
>
> On Thu, Jan 2, 2014 at 10:17 PM, Otis Gospodnetic <
> [EMAIL PROTECTED]> wrote:
>
> > Hi,
> >
> > Have a look at http://nutch.apache.org .  Version 2.x uses HBase under
> the
> > hood.
> >
> > Otis
> > --
> > Performance Monitoring * Log Analytics * Search Analytics
> > Solr & Elasticsearch Support * http://sematext.com/
> >
> >
> > On Fri, Jan 3, 2014 at 1:12 AM, Li Li <[EMAIL PROTECTED]> wrote:
> >
> > > hi all,
> > >      I want to use hbase to store all urls(crawled or not crawled).
> > > And each url will has a column named priority which represent the
> > > priority of the url. I want to get the top N urls order by priority(if
> > > priority is the same then url whose timestamp is ealier is prefered).
> > >      in using something like mysql, my client application may like:
> > >      while true:
> > >          select  url from url_db order by priority,addedTime limit
> > > 1000 where status='not_crawled';
> > >          do something with this urls;
> > >          extract more urls and insert them into url_db;
> > >      How should I design hbase schema for this application? Is hbase
> > > suitable for me?
> > >      I found in this article
> > >
> >
> http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/
> > > ,
> > > they use redis to store urls. I think hbase is originated from
> > > bigtable and google use bigtable to store webpage, so for huge number
> > > of urls, I prefer distributed system like hbase.
> > >
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB