Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Unexpected Data insertion time and Data size explosion


Copy link to this message
-
Re: Unexpected Data insertion time and Data size explosion
May I ask whether you pre-split your table before loading ?

On Dec 4, 2011, at 6:19 AM, kranthi reddy <[EMAIL PROTECTED]> wrote:

> Hi all,
>
>    I am a newbie to Hbase and Hadoop. I have setup a cluster of 4 machines
> and am trying to insert data. 3 of the machines are tasktrackers, with 4
> map tasks each.
>
>    My data consists of about 1.3 billion rows with 4 columns each (100GB
> txt file). The column structure is "rowID, word1, word2, word3".  My DFS
> replication in hadoop and hbase is set to 3 each. I have put only one
> column family and 3 qualifiers for each field (word*).
>
>    I am using the SampleUploader present in the HBase distribution. To
> complete 40% of the insertion, it has taken around 21 hrs and it's still
> running. I have 12 map tasks running.* I would like to know is the
> insertion time taken here on expected lines ??? Because when I used lucene,
> I was able to insert the entire data in about 8 hours.*
>
>    Also, there seems to be huge explosion of data size here. With a
> replication factor of 3 for HBase, I was expecting the table size inserted
> to be around 350-400GB. (350-400GB for an 100GB txt file I have, 300GB for
> replicating the data 3 times and 50+ GB for additional storage
> information). But even for 40% completion of data insertion, the space
> occupied is around 550GB (Looks like it might take around 1.2TB for an
> 100GB file).* I have used the rowID to be a String, instead of Long. Will
> that account for such rapid increase in data storage???
> *
>
> Regards,
> Kranthi
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB