Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Timestamp as a key good practice?


Copy link to this message
-
Re: Timestamp as a key good practice?
Michael Segel 2012-06-14, 19:46
Ok...

Makes sense.

You don't need to worry about Coprocessors in your initial PoC. It just makes it easier instead of relying on the application managing all of the database updates.

A billion rows shouldn't be a problem for an RDBMS but that's a different issue.

To start with, you update the base table, your app then deletes the old column in the index and then insert the column value at new timestamp.

Note the following: You may want to simplify the time stamp by rounding up to the nearest second rather than going down to the ms.  This would give you more columns per row.

On Jun 14, 2012, at 1:47 PM, Jean-Marc Spaggiari wrote:

> Hi Michael,
>
> For now this is more a proof of concept than a production application.
> And if it's working, it should be growing a lot and database at the
> end will easily be over 1B rows. each individual server will have to
> send it's own information to one centralized server which will insert
> that into a database. That's why it need to be very quick and that's
> why I'm looking in HBase's direction. I tried with some relational
> databases with 4M rows in the table but the insert time is to slow
> when I have to introduce entries in bulk. Also, the ability for HBase
> to keep only the cells with values will allow to save a lot on the
> disk space (futur projects).
>
> I'm not yet used with HBase and there is still many things I need to
> undertsand but until I'm able to create a solution and test it, I will
> continue to read, learn and try that way. Then at then end I will be
> able to compare the 2 options I have (HBase or relational) and decide
> based on the results.
>
> So yes, your reply helped because it's giving me a way to achieve this
> goal (using co-processors). I don't know ye thow this part is working,
> so I will dig the documentation for it.
>
> Thanks,
>
> JM
>
> 2012/6/14, Michael Segel <[EMAIL PROTECTED]>:
>> Jean-Marc,
>>
>> You do realize that this really isn't a good use case for HBase, assuming
>> that what you are describing is a stand alone system.
>> It would be easier and better if you just used a simple relational database.
>>
>> Then you would have your table w an ID, and a secondary index on the
>> timestamp.
>> Retrieve the data in Ascending order by timestamp and take the top 500 off
>> the list.
>>
>> If you insist on using HBase, yes you will have to have a secondary table.
>> Then using co-processors...
>> When you update the row in your base table, you
>> then get() the row in your index by timestamp, removing the column for that
>> rowid.
>> Add the new column to the timestamp row.
>>
>> As you put it.
>>
>> Now you can just do a partial scan on your index. Because your index table
>> is so small... you shouldn't worry about hotspots.
>> You may just want to rebuild your index every so often...
>>
>> HTH
>>
>> -Mike
>>
>> On Jun 14, 2012, at 7:22 AM, Jean-Marc Spaggiari wrote:
>>
>>> Hi Michael,
>>>
>>> Thanks for your feedback. Here are more details to describe what I'm
>>> trying to achieve.
>>>
>>> My goal is to store information about files into the database. I need
>>> to check the oldest files in the database to refresh the information.
>>>
>>> The key is an 8 bytes ID of the server name in the network hosting the
>>> file + MD5 of the file path. Total is a 24 bytes key.
>>>
>>> So each time I look at a file and gather the information, I update its
>>> row in the database based on the key including a "last_update" field.
>>> I can calculate this key for any file in the drives.
>>>
>>> In order to know which file I need to check in the network, I need to
>>> scan the table by "last_update" field. So the idea is to build another
>>> table which contain the last_update as a key and the files IDs in
>>> columns. (Here is the hotspotting)
>>>
>>> Each time I work on a file, I will have to update the main table by ID
>>> and remove the cell from the second table (the index) and put it back
>>> with the new "last_update" key.