Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Block size of HBase files


+
Praveen Bysani 2013-05-13, 06:40
+
Amandeep Khurana 2013-05-13, 07:03
+
Praveen Bysani 2013-05-13, 09:45
+
Ted Yu 2013-05-13, 10:36
+
Praveen Bysani 2013-05-13, 11:48
+
Anoop John 2013-05-13, 11:54
+
Praveen Bysani 2013-05-13, 12:19
+
Anoop John 2013-05-13, 12:22
+
Praveen Bysani 2013-05-14, 02:23
Copy link to this message
-
Re: Block size of HBase files
Praveen,

How many regions there in ur table and how and CFs?
Under /hbase/<table-name> there will be many files and dir u will be able
to see. There will be .tableinfo file and every region will have
.regionInfo file and then under cf the data file (HFiles) .  Your total
data is 250GB. When your block size is 1GB and u have only one file of
250GB, then what you are looking for makes sense. But it is not the case
with HBase data storage.

HFiles are created per CF per region.  Also as data comes in (writes), by
default after 128mb HBase will flush it as a file into HDFS. So making a
file in HDFS with 1 block.(In ur case)  Later these smaller files will get
merged into bigger one .(Compaction)  At the time when u checked, some
major compactions were run? Major compaction will merge all files under a
CF within a region to one HFile .  So if u have 100 regions and 2 CFs for
table,after major compaction you will be having 200 HFiles. (Remember under
/hbase/<table-name> some other files also you will be able to see other
than the HFiles.)

The #files and avg block size displayed below speaks it.(Why u have those
many blocks)

The HFile size Amandeep was refering is the max size for an HFile (And thus
for a region).  If you keep on writing data to a region and when the data
size crosses this max size, HBase will split that region into 2.

Can you try checking the files count and blocks count after running a major
compaction?

What MR job u r trying to run with HBase? Also why you run MR directly on
the HFiles?  When you run the MR job over HBase (Like Scan using MR) it is
not the #files or blocks which decides the #mappers.  It will be based on
the #regions in the table..

-Anoop-

On Mon, May 13, 2013 at 3:15 PM, Praveen Bysani <[EMAIL PROTECTED]>wrote:

> Hi,
>
> I wanted to minimize on the number of map reduce tasks generated while
> processing a job, hence configured it to a larger value.
>
> I don't think i have configured HFile size in the cluster. I use Cloudera
> Manager to mange my cluster, and the only configuration i can relate
> to is hfile.block.cache.size
> which is set to 0.25. How do i change the HFile size ?
>
> On 13 May 2013 15:03, Amandeep Khurana <[EMAIL PROTECTED]> wrote:
>
> > On Sun, May 12, 2013 at 11:40 PM, Praveen Bysani <
> [EMAIL PROTECTED]
> > >wrote:
> >
> > > Hi,
> > >
> > > I have the dfs.block.size value set to 1 GB in my cluster
> configuration.
> >
> >
> > Just out of curiosity - why do you have it set at 1GB?
> >
> >
> > > I
> > > have around 250 GB of data stored in hbase over this cluster. But when
> i
> > > check the number of blocks, it doesn't correspond to the block size
> > value i
> > > set. From what i understand i should only have ~250 blocks. But instead
> > > when i did a fsck on the /hbase/<table-name>, i got the following
> > >
> > > Status: HEALTHY
> > >  Total size:    265727504820 B
> > >  Total dirs:    1682
> > >  Total files:   1459
> > >  Total blocks (validated):      1459 (avg. block size 182129886 B)
> > >  Minimally replicated blocks:   1459 (100.0 %)
> > >  Over-replicated blocks:        0 (0.0 %)
> > >  Under-replicated blocks:       0 (0.0 %)
> > >  Mis-replicated blocks:         0 (0.0 %)
> > >  Default replication factor:    3
> > >  Average block replication:     3.0
> > >  Corrupt blocks:                0
> > >  Missing replicas:              0 (0.0 %)
> > >  Number of data-nodes:          5
> > >  Number of racks:               1
> > >
> > > Are there any other configuration parameters that need to be set ?
> >
> >
> > What is your HFile size set to? The HFiles that get persisted would be
> > bound by that number. Thereafter each HFile would be split into blocks,
> the
> > size of which you configure using the dfs.block.size configuration
> > parameter.
> >
> >
> > >
> > > --
> > > Regards,
> > > Praveen Bysani
> > > http://www.praveenbysani.com
> > >
> >
>
>
>
> --
> Regards,
> Praveen Bysani
> http://www.praveenbysani.com
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB