Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> use hbase as distributed crawl's scheduler


Copy link to this message
-
Re: use hbase as distributed crawl's scheduler
Couple of notes:
1. When updating to status you essentially add a new rowkey into HBase, I
would give it up all together. The essential requirement seems to point at
retrieving a list of urls in a certain order.
2. Wouldn't salting ruin the sort order required? Priority, date added?

On Friday, January 3, 2014, James Taylor wrote:

> Sure, no problem. One addition: depending on the cardinality of your
> priority column, you may want to salt your table to prevent hotspotting,
> since you'll have a monotonically increasing date in the key. To do that,
> just add " SALT_BUCKETS=<n>" on to your query, where <n> is the number of
> machines in your cluster. You can read more about salting here:
> http://phoenix.incubator.apache.org/salted.html
>
>
> On Thu, Jan 2, 2014 at 11:36 PM, Li Li <[EMAIL PROTECTED]> wrote:
>
> > thank you. it's great.
> >
> > On Fri, Jan 3, 2014 at 3:15 PM, James Taylor <[EMAIL PROTECTED]>
> > wrote:
> > > Hi LiLi,
> > > Have a look at Phoenix (http://phoenix.incubator.apache.org/). It's a
> > SQL
> > > skin on top of HBase. You can model your schema and issue your queries
> > just
> > > like you would with MySQL. Something like this:
> > >
> > > // Create table that optimizes for your most common query
> > > // (i.e. the PRIMARY KEY constraint should be ordered as you'd want
> your
> > > rows ordered)
> > > CREATE TABLE url_db (
> > >     status TINYINT,
> > >     priority INTEGER NOT NULL,
> > >     added_time DATE,
> > >     url VARCHAR NOT NULL
> > >     CONSTRAINT pk PRIMARY KEY (status, priority, added_time, url));
> > >
> > > int lastStatus = 0;
> > > int lastPriority = 0;
> > > Date lastAddedTime = new Date(0);
> > > String lastUrl = "";
> > >
> > > while (true) {
> > >     // Use row value constructor to page through results in batches of
> > 1000
> > >     String query = "
> > >         SELECT * FROM url_db
> > >         WHERE status=0 AND (status, priority, added_time, url) > (?, ?,
> > ?,
> > > ?)
> > >         ORDER BY status, priority, added_time, url
> > >         LIMIT 1000"
> > >     PreparedStatement stmt = connection.prepareStatement(query);
> > >
> > >     // Bind parameters
> > >     stmt.setInt(1, lastStatus);
> > >     stmt.setInt(2, lastPriority);
> > >     stmt.setDate(3, lastAddedTime);
> > >     stmt.setString(4, lastUrl);
> > >     ResultSet resultSet = stmt.executeQuery();
> > >
> > >     while (resultSet.next()) {
> > >         // Remember last row processed so that you can start after that
> > for
> > > next batch
> > >         lastStatus = resultSet.getInt(1);
> > >         lastPriority = resultSet.getInt(2);
> > >         lastAddedTime = resultSet.getDate(3);
> > >         lastUrl = resultSet.getString(4);
> > >
> > >         doSomethingWithUrls();
> > >
> > >         UPSERT INTO url_db(status, priority, added_time, url)
> > >         VALUES (1, ?, CURRENT_DATE(), ?);
> > >
> > >     }
> > > }
> > >
> > > If you need to efficiently query on url, add a secondary index like
> this:
> > >
> > > CREATE INDEX url_index ON url_db (url);
> > >
> > > Please let me know if you have questions.
> > >
> > > Thanks,
> > > James
> > >
> > >
> > >
> > >
> > > On Thu, Jan 2, 2014 at 10:22 PM, Li Li <[EMAIL PROTECTED]> wrote:
> > >
> > >> thank you. But I can't use nutch. could you tell me how hbase is used
> > >> in nutch? or hbase is only used to store webpage.
> > >>
> > >> On Fri, Jan 3, 2014 at 2:17 PM, Otis Gospodnetic
> > >> <[EMAIL PROTECTED]> wrote:
> > >> > Hi,
> > >> >
> > >> > Have a look at http://nutch.apache.org .  Version 2.x uses HBase
> > under
> > >> the
> > >> > hood.
> > >> >
> > >> > Otis
> > >> > --
> > >> > Performance Monitoring * Log Analytics * Search Analytics
> > >> > Solr & Elasticsearch Support * http://sematext.com/
> > >> >
> > >> >
> > >> > On Fri, Jan 3, 2014 at 1:12 AM, Li Li <
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB