Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> use hbase as distributed crawl's scheduler


Copy link to this message
-
Re: use hbase as distributed crawl's scheduler
Yes, sorry ;) Thanks for the correction.

Should have been:
"One table with the URL already crawled (80 millions), one table with the
URL
to crawle (2 billions) and one table with the URLs been processed. I'm not
running any SQL requests against my dataset but I have MR jobs doing many
different things. I have many other tables to help with the work on the
URLs."
2014/1/3 Ted Yu <[EMAIL PROTECTED]>

> bq. One URL ...
>
> I guess you mean one table ...
>
> Cheers
>
> On Jan 3, 2014, at 4:19 AM, Jean-Marc Spaggiari <[EMAIL PROTECTED]>
> wrote:
>
> > Interesting. This is exactly what I'm doing ;)
> >
> > I'm using 3 tables to achieve this.
> >
> > One table with the URL already crawled (80 millions), one URL with the
> URL
> > to crawle (2 billions) and one URL with the URLs been processed. I'm not
> > running any SQL requests against my dataset but I have MR jobs doing many
> > different things. I have many other tables to help with the work on the
> > URLs.
> >
> > I'm "salting" the keys using the URL hash so I can find them back very
> > quickly. There can be some collisions so I store also the URL itself on
> the
> > key. So very small scans returning 1 or something 2 rows allow me to
> > quickly find a row knowing the URL.
> >
> > I also have secondary index tables to store the CRCs of the pages to
> > identify duplicate pages based on this value.
> >
> > And so on ;) Working on that for 2 years now. I might have been able to
> use
> > Nuthc and others, but my goal was to learn and do that with a distributed
> > client on a single dataset...
> >
> > Enjoy.
> >
> > JM
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB