When you set time range on Scan, some files can get skipped based on the
max min ts values in that file. Said this, when u do major compact and do
scan based on time range, dont think u will get some advantage.
On Wed, Jun 5, 2013 at 10:11 AM, Rahul Ravindran <[EMAIL PROTECTED]> wrote:
> Our row-keys do not contain time. By time-based scans, I mean, an MR over
> the Hbase table where the scan object has no startRow or endRow but has a
> startTime and endTime.
> Our row key format is <MD5 of UUID>+UUID, so, we expect good distribution.
> We have pre-split initially to prevent any initial hotspotting.
> From: anil gupta <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]; Rahul Ravindran <[EMAIL PROTECTED]>
> Sent: Tuesday, June 4, 2013 9:31 PM
> Subject: Re: Scan + Gets are disk bound
> On Tue, Jun 4, 2013 at 11:48 AM, Rahul Ravindran <[EMAIL PROTECTED]>
> >We are relatively new to Hbase, and we are hitting a roadblock on our
> scan performance. I searched through the email archives and applied a bunch
> of the recommendations there, but they did not improve much. So, I am
> hoping I am missing something which you could guide me towards. Thanks in
> >We are currently writing data and reading in an almost continuous mode
> (stream of data written into an HBase table and then we run a time-based MR
> on top of this Table). We currently were backed up and about 1.5 TB of data
> was loaded into the table and we began performing time-based scan MRs in 10
> minute time intervals(startTime and endTime interval is 10 minutes). Most
> of the 10 minute interval had about 100 GB of data to process.
> >Our workflow was to primarily eliminate duplicates from this table. We
> have maxVersions = 5 for the table. We use TableInputFormat to perform the
> time-based scan to ensure data locality. In the mapper, we check if there
> exists a previous version of the row in a time period earlier to the
> timestamp of the input row. If not, we emit that row.
> >We looked at https://issues.apache.org/jira/browse/HBASE-4683 and hence
> turned off block cache for this table with the expectation that the block
> index and bloom filter will be cached in the block cache. We expect
> duplicates to be rare and hence hope for most of these checks to be
> fulfilled by the bloom filter. Unfortunately, we notice very slow
> performance on account of being disk bound. Looking at jstack, we notice
> that most of the time, we appear to be hitting disk for the block index. We
> performed a major compaction and retried and performance improved some, but
> not by much. We are processing data at about 2 MB per second.
> > We are using CDH 4.2.1 HBase 0.94.2 and HDFS 2.0.0 running with 8
> datanodes/regionservers(each with 32 cores, 4x1TB disks and 60 GB RAM).
> Anil: You dont have the right balance between disk,cpu and ram. You have
> too much of CPU, RAM but very less NUMBER of disks. Usually, its better to
> have a Disk/Cpu_core ratio near 0.6-0.8. Your's is around 0.13. This seems
> to be the biggest reason of your problem.
> HBase is running with 30 GB Heap size, memstore values being capped at 3
> GB and flush thresholds being 0.15 and 0.2. Blockcache is at 0.5 of total
> heap size(15 GB). We are using SNAPPY for our tables.
> >A couple of questions:
> > * Is the performance of the time-based scan bad after a major
> Anil: In general, TimeBased(i am assuming you have built your rowkey on
> timestamp) scans are not good for HBase because of region hot-spotting.
> Have you tried setting the ScannerCaching to a higher number?
> > * What can we do to help alleviate being disk bound? The typical
> answer of adding more RAM does not seem to have helped, or we are missing
> some other config
> Anil: Try adding more disks to your machines.
> >Below are some of the metrics from a Regionserver webUI: