Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - use hbase as distributed crawl's scheduler


Copy link to this message
-
Re: use hbase as distributed crawl's scheduler
Otis Gospodnetic 2014-01-03, 06:33
Hi,

Yes. I'm sure that would be a welcome addition.  Topic for user@nutch.a.o...

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/
On Fri, Jan 3, 2014 at 1:23 AM, James Taylor <[EMAIL PROTECTED]> wrote:

> Otis,
> I didn't realize Nutch uses HBase underneath. Might be interesting if you
> serialized data in a Phoenix-compliant manner, as you could run SQL queries
> directly on top of it.
>
> Thanks,
> James
>
>
> On Thu, Jan 2, 2014 at 10:17 PM, Otis Gospodnetic <
> [EMAIL PROTECTED]> wrote:
>
> > Hi,
> >
> > Have a look at http://nutch.apache.org .  Version 2.x uses HBase under
> the
> > hood.
> >
> > Otis
> > --
> > Performance Monitoring * Log Analytics * Search Analytics
> > Solr & Elasticsearch Support * http://sematext.com/
> >
> >
> > On Fri, Jan 3, 2014 at 1:12 AM, Li Li <[EMAIL PROTECTED]> wrote:
> >
> > > hi all,
> > >      I want to use hbase to store all urls(crawled or not crawled).
> > > And each url will has a column named priority which represent the
> > > priority of the url. I want to get the top N urls order by priority(if
> > > priority is the same then url whose timestamp is ealier is prefered).
> > >      in using something like mysql, my client application may like:
> > >      while true:
> > >          select  url from url_db order by priority,addedTime limit
> > > 1000 where status='not_crawled';
> > >          do something with this urls;
> > >          extract more urls and insert them into url_db;
> > >      How should I design hbase schema for this application? Is hbase
> > > suitable for me?
> > >      I found in this article
> > >
> >
> http://blog.semantics3.com/how-we-built-our-almost-distributed-web-crawler/
> > > ,
> > > they use redis to store urls. I think hbase is originated from
> > > bigtable and google use bigtable to store webpage, so for huge number
> > > of urls, I prefer distributed system like hbase.
> > >
> >
>