Thanks for that confirmation. This is what we hypothesized as well.
So, if we are dependent on timerange scans, we need to completely avoid major compaction and depend only on minor compactions? Is there any downside? We do have a TTL set on all the rows in the table.
From: Anoop John <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]; Rahul Ravindran <[EMAIL PROTECTED]>
Cc: anil gupta <[EMAIL PROTECTED]>
Sent: Tuesday, June 4, 2013 10:44 PM
Subject: Re: Scan + Gets are disk bound
When you set time range on Scan, some files can get skipped based on the
max min ts values in that file. Said this, when u do major compact and do
scan based on time range, dont think u will get some advantage.
On Wed, Jun 5, 2013 at 10:11 AM, Rahul Ravindran <[EMAIL PROTECTED]> wrote:
> Our row-keys do not contain time. By time-based scans, I mean, an MR over
> the Hbase table where the scan object has no startRow or endRow but has a
> startTime and endTime.
> Our row key format is <MD5 of UUID>+UUID, so, we expect good distribution.
> We have pre-split initially to prevent any initial hotspotting.
> From: anil gupta <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]; Rahul Ravindran <[EMAIL PROTECTED]>
> Sent: Tuesday, June 4, 2013 9:31 PM
> Subject: Re: Scan + Gets are disk bound
> On Tue, Jun 4, 2013 at 11:48 AM, Rahul Ravindran <[EMAIL PROTECTED]>
> >We are relatively new to Hbase, and we are hitting a roadblock on our
> scan performance. I searched through the email archives and applied a bunch
> of the recommendations there, but they did not improve much. So, I am
> hoping I am missing something which you could guide me towards. Thanks in
> >We are currently writing data and reading in an almost continuous mode
> (stream of data written into an HBase table and then we run a time-based MR
> on top of this Table). We currently were backed up and about 1.5 TB of data
> was loaded into the table and we began performing time-based scan MRs in 10
> minute time intervals(startTime and endTime interval is 10 minutes). Most
> of the 10 minute interval had about 100 GB of data to process.
> >Our workflow was to primarily eliminate duplicates from this table. We
> have maxVersions = 5 for the table. We use TableInputFormat to perform the
> time-based scan to ensure data locality. In the mapper, we check if there
> exists a previous version of the row in a time period earlier to the
> timestamp of the input row. If not, we emit that row.
> >We looked at https://issues.apache.org/jira/browse/HBASE-4683 and hence
> turned off block cache for this table with the expectation that the block
> index and bloom filter will be cached in the block cache. We expect
> duplicates to be rare and hence hope for most of these checks to be
> fulfilled by the bloom filter. Unfortunately, we notice very slow
> performance on account of being disk bound. Looking at jstack, we notice
> that most of the time, we appear to be hitting disk for the block index. We
> performed a major compaction and retried and performance improved some, but
> not by much. We are processing data at about 2 MB per second.
> > We are using CDH 4.2.1 HBase 0.94.2 and HDFS 2.0.0 running with 8
> datanodes/regionservers(each with 32 cores, 4x1TB disks and 60 GB RAM).
> Anil: You dont have the right balance between disk,cpu and ram. You have
> too much of CPU, RAM but very less NUMBER of disks. Usually, its better to
> have a Disk/Cpu_core ratio near 0.6-0.8. Your's is around 0.13. This seems
> to be the biggest reason of your problem.
> HBase is running with 30 GB Heap size, memstore values being capped at 3
> GB and flush thresholds being 0.15 and 0.2. Blockcache is at 0.5 of total
> heap size(15 GB). We are using SNAPPY for our tables.
> >A couple of questions:
> > * Is the performance of the time-based scan bad after a major