HBase doesn't have the concept of a globally unique auto-incrementing "ID" column; that would require that all PUTs to any region of a table first go through some central ID authority to get a unique ID, and that sort of goes against the general HBase approach (in which operations on regions are totally independent of each other, for unbounded horizontal scalability).
That said, there are a couple ways to achieve what it seems like you want:
- You could create a natural compound row key composed of (for example) a hash of the URL plus a timestamp. That way, you would be guaranteed that two crawls of the same URL appear as different rows (assuming they can't happen at the same millisecond).
- You could alternately use a UUID of some sort as the row key, but the advantage of using URL_hash + timestamp is that you can find all the rows for a particular URL just by knowing the URL; you don't need any external index.
- You could also "roll your own" global ID creation counter in HBase using a table with a counter in it, and use the atomic increment function to get unique values. That would still serialize all PUT operations, but it would be done in your code (not automatically in HBase).
Remember that HBase doesn't have any secondary indexes, like the 3 you've added below. If you want to be able to access the data in HBase by these fields, you must either write it in that order according to the row key, or else manually write the information, denormalized, into "index-like" tables that you maintain yourself (noting that there's no transactional protection on this operation like in a relational database, so you must account for more failure scenarios). These are reminders that unless your data size is so massive that a relational database simply can't accommodate it, you're likely giving yourself more problems by using HBase rather than an RDBMS.
Also: you might see the O'Reilly book, "HBase: The Definitive Guide" by the esteemed Mr. Lars George; in it, he uses a running example of a URL shortener application that might give you some ideas about your use case.
On Feb 21, 2012, at 11:33 PM, Adarsh Sharma wrote:
After some R n D on schema design in hbase. I am confused how to design
corresponding schema of a table in mysql.
CREATE TABLE `page_content` (
`crawled_page_id` bigint(20) NOT NULL DEFAULT '0' 'unique value for
`link_level` tinyint(4) DEFAULT NULL,
`isprocessable` tinyint(2) NOT NULL DEFAULT '1',
`isvalid` tinyint(4) NOT NULL DEFAULT '1',
`isanalyzed` tinyint(4) NOT NULL DEFAULT '0' COMMENT ,
`islocked` tinyint(4) NOT NULL DEFAULT '0' COMMENT 'set 1 when the
records are in analyzing phase',
`content_language` varchar(10) DEFAULT NULL,
`url_id` bigint(20) NOT NULL,
`publishing_date` varchar(40) DEFAULT NULL,
`heading` varchar(150) DEFAULT NULL,
`category` varchar(150) DEFAULT NULL,
`crawled_page_url` varchar(500) NOT NULL,
`keywords` varchar(500) DEFAULT NULL,
`dt_stamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
PRIMARY KEY (`crawled_page_id`),
KEY `idx_url` (`crawled_page_url`),
KEY `idx_head` (`heading`),
KEY `idx_dtstamp` (`dt_stamp`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
In all the examples , I find the reverse url is the row key in hbase but
in mysql i create an auto increment column that uniquly locate a document.
Can anyone suggest what is the corresponding table in hbase.