We have a crawl table and here are a couple quick thoughts:
- I'd suggest that you use reverse url as your primary key. Specifically,
reversed host name but normal path and query string.
- Rather than maintaining separate rows for the same url using timestamp or
similar, I'd recommend that you use hbase versions if you want to maintain
multiple version of the same url.
On Wed, Feb 22, 2012 at 7:32 AM, Ian Varley <[EMAIL PROTECTED]> wrote:
> HBase doesn't have the concept of a globally unique auto-incrementing "ID"
> column; that would require that all PUTs to any region of a table first go
> through some central ID authority to get a unique ID, and that sort of goes
> against the general HBase approach (in which operations on regions are
> totally independent of each other, for unbounded horizontal scalability).
> That said, there are a couple ways to achieve what it seems like you want:
> - You could create a natural compound row key composed of (for example) a
> hash of the URL plus a timestamp. That way, you would be guaranteed that
> two crawls of the same URL appear as different rows (assuming they can't
> happen at the same millisecond).
> - You could alternately use a UUID of some sort as the row key, but the
> advantage of using URL_hash + timestamp is that you can find all the rows
> for a particular URL just by knowing the URL; you don't need any external
> - You could also "roll your own" global ID creation counter in HBase
> using a table with a counter in it, and use the atomic increment function
> to get unique values. That would still serialize all PUT operations, but it
> would be done in your code (not automatically in HBase).
> Remember that HBase doesn't have any secondary indexes, like the 3 you've
> added below. If you want to be able to access the data in HBase by these
> fields, you must either write it in that order according to the row key, or
> else manually write the information, denormalized, into "index-like" tables
> that you maintain yourself (noting that there's no transactional protection
> on this operation like in a relational database, so you must account for
> more failure scenarios). These are reminders that unless your data size is
> so massive that a relational database simply can't accommodate it, you're
> likely giving yourself more problems by using HBase rather than an RDBMS.
> Also: you might see the O'Reilly book, "HBase: The Definitive Guide" by
> the esteemed Mr. Lars George; in it, he uses a running example of a URL
> shortener application that might give you some ideas about your use case.
> On Feb 21, 2012, at 11:33 PM, Adarsh Sharma wrote:
> Dear all,
> After some R n D on schema design in hbase. I am confused how to design
> corresponding schema of a table in mysql.
> CREATE TABLE `page_content` (
> `crawled_page_id` bigint(20) NOT NULL DEFAULT '0' 'unique value for
> each url,
> `link_level` tinyint(4) DEFAULT NULL,
> `isprocessable` tinyint(2) NOT NULL DEFAULT '1',
> `isvalid` tinyint(4) NOT NULL DEFAULT '1',
> `isanalyzed` tinyint(4) NOT NULL DEFAULT '0' COMMENT ,
> `islocked` tinyint(4) NOT NULL DEFAULT '0' COMMENT 'set 1 when the
> records are in analyzing phase',
> `content_language` varchar(10) DEFAULT NULL,
> `url_id` bigint(20) NOT NULL,
> `publishing_date` varchar(40) DEFAULT NULL,
> `heading` varchar(150) DEFAULT NULL,
> `category` varchar(150) DEFAULT NULL,
> `crawled_page_url` varchar(500) NOT NULL,
> `keywords` varchar(500) DEFAULT NULL,
> `dt_stamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
> `content` mediumtext,
> PRIMARY KEY (`crawled_page_id`),
> KEY `idx_url` (`crawled_page_url`),
> KEY `idx_head` (`heading`),
> KEY `idx_dtstamp` (`dt_stamp`)
> ) ENGINE=InnoDB DEFAULT CHARSET=latin1
> In all the examples , I find the reverse url is the row key in hbase but
> in mysql i create an auto increment column that uniquly locate a document.
> Can anyone suggest what is the corresponding table in hbase.