|
|
-
Re: RDBMS to HBASE schema migrationDoug Meil 2012-06-28, 20:54
Hi there- I commend your enthusiasm for the Hbase project. For the "ground rules of Hbase" you probably want to read this closelyŠ http://hbase.apache.org/book.html#datamodel Š as it covers things like having one PK per table, no secondary indexes, etc. With a solid understanding of these rules the table relationship and search questions you're asking will become clearer. On 6/28/12 3:01 PM, "grashmi13" <[EMAIL PROTECTED]> wrote: > >Assets table is having numeric sequential ID and a one number out of >(1,2,3,4,5,6,7, 8, 9, 10) for AssetName. This is a master table with say >10 >rows only. > > >hmmm.. after some more surfing, i came to know that we have to manually >denormalize a relational DB. there are no preset rules for >denormalization. > >today I was doing a denormalization.. can you please confirm if rdbms to >hbase change i have done is correct or not: >These tables are from stock trading application: > >1. I have two tables... >Assets >========= >AssetID | assetName > > >second table is: > >INSTRUMENTS >============ >InstrumentID | assetID | Symbol | name | created | modifiedAt > > >While converting them into Hbase two questions were there in my mind: >1. Will there be two tables or one? If we have a web GUI and have two >screens, one for assets management and one for instuments management. In >instrument management, there would be a combo having assets. An asset can >be >assigned to multiple instruments. > >So if user will first populate all assets thru assets screen. So there >should be one master table for Assets. because it is not dependent on any >other entity. Am I right here? > >And now user will populate instruments thru instrument management screen. >Here another table would come into picture. "INSTRUMENTS". this table >would >be denormalized, as hbase doesnt support JOIN. Also, to maintain ACID >properties, all assets information should be duplicated here too. Now >question here, if I delete or update some asset in future, and there are >corresponding records in INSTRUMENTS table. what if asset is deletion or >updation goes successful for assets table but fails for instruments table? >How we can maintain consistancy and integrity here? > > >HBASE table conversion here: > >ASSETS >=================>row key - AssetID >================ >Columns: CF1 - AssetName >===============================================> >INSTRUMENTS >====================================>row key-- instrumentID(a sequential numeric value) >===================================>InstrumentInfo(family) - >Symbol >name >AssetsInfo(family2) - >AssetID >AssetName >====================================> > >Please confirm, two tables would be required or one only? If one only, how >assets population and after that instruments population would be possible? >If two, how we can maintain integrity and consistancy during assets >deletion\updation? > >Please confirm if my approach is correct? > >Yes, as read everywhere, in hbase, we must think about purpose of data in >our application before creating tables. So purpose here is, to >populate\delete\update assets and instruments thru web UI and then run a >job >which extract instruments by asset name. Also, assets by instrument name. > >=====================================> >After above conversion, I got stuck on one more point. "SELF-JOIN" > >Lets say, I have a table InstrumentsStock having 10 columns in it, out of >it >one primary key stockID and other normal column BaseCurrencyID, both >points >to an instrumentID. > >StockID is an ID of an instrument whose assetID is 1. and BaseCurrencyID >is >ID of an instrument whose assetID is 3. > >InstrumentsStock >======================>stockID | AssetID | BaseCurrencyID | 10 more columns here >=======================================> >where StockID = InstrumentID of Instrument with AssetID 1 >and BaseCurrencyID = InstrumentID of Instrument with AssetID 3 > >StockID is primary key > > >E.g. I have below data in tables: >Instruments: |