Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Unexpected Data insertion time and Data size explosion

Copy link to this message
Re: Unexpected Data insertion time and Data size explosion
May I ask whether you pre-split your table before loading ?

On Dec 4, 2011, at 6:19 AM, kranthi reddy <[EMAIL PROTECTED]> wrote:

> Hi all,
>    I am a newbie to Hbase and Hadoop. I have setup a cluster of 4 machines
> and am trying to insert data. 3 of the machines are tasktrackers, with 4
> map tasks each.
>    My data consists of about 1.3 billion rows with 4 columns each (100GB
> txt file). The column structure is "rowID, word1, word2, word3".  My DFS
> replication in hadoop and hbase is set to 3 each. I have put only one
> column family and 3 qualifiers for each field (word*).
>    I am using the SampleUploader present in the HBase distribution. To
> complete 40% of the insertion, it has taken around 21 hrs and it's still
> running. I have 12 map tasks running.* I would like to know is the
> insertion time taken here on expected lines ??? Because when I used lucene,
> I was able to insert the entire data in about 8 hours.*
>    Also, there seems to be huge explosion of data size here. With a
> replication factor of 3 for HBase, I was expecting the table size inserted
> to be around 350-400GB. (350-400GB for an 100GB txt file I have, 300GB for
> replicating the data 3 times and 50+ GB for additional storage
> information). But even for 40% completion of data insertion, the space
> occupied is around 550GB (Looks like it might take around 1.2TB for an
> 100GB file).* I have used the rowID to be a String, instead of Long. Will
> that account for such rapid increase in data storage???
> *
> Regards,
> Kranthi