Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> HBase Put


Thanks. I should have read there first. :)

Thanks,
Abhishek
-----Original Message-----
From: Jason Frantz [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, August 22, 2012 2:05 PM
To: [EMAIL PROTECTED]
Subject: Re: HBase Put

Abhishek,

Setting your column family's bloom filter to ROWCOL will include qualifiers:

http://hbase.apache.org/book.html#schema.bloom

-Jason

On Wed, Aug 22, 2012 at 1:49 PM, Pamecha, Abhishek <[EMAIL PROTECTED]> wrote:

> Can I enable bloom filters per block at column qualifier levels too?
> That way, will small block sizes, I can selectively load only few data
> blocks in memory. Then I can do some trade off between block size and
> bloom filter false positive rate.
>
> I am designing for a wide table scenario with thousands and millions
> of columns and thus I don't really want to stress on checks for blocks
> having more than one row key.
>
> Thanks,
> Abhishek
>
>
> -----Original Message-----
> From: Mohit Anchlia [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, August 22, 2012 11:09 AM
> To: [EMAIL PROTECTED]
> Subject: Re: HBase Put
>
> On Wed, Aug 22, 2012 at 10:20 AM, Pamecha, Abhishek <[EMAIL PROTECTED]>
> wrote:
>
> > So then a GET query means one needs to look in every HFile where key
> > falls within the min/max range of the file.
> >
> > From another parallel thread, I gather, HFile comprise of blocks
> > which, I think, is an atomic unit of persisted data in HDFS.(please
> correct if not).
> >
> > And that each block for a HFile has a range of keys. My key can
> > satisfy the range for the block and yet may not be present. So, all
> > the blocks that match the range for my key, will need to be scanned.
> > There is one block index per HFile which sorts blocks by key ranges.
> > This index help in reducing the number of blocks to scan by
> > extracting only those blocks whose ranges satisfy the key.
> >
> > In this case, if puts are random wrt order, each block may have
> > similar range and it may turn out that Hbase needs to scan every
> > block for the File. This may not be good for performance.
> >
> > I just want to validate my understanding.
> >
> >
> If you have such a use case I think best practice is to use bloom filters.
> I think in generaly it's a good idea to atleast enable bloom filter at
> row level.
>
> > Thanks,
> > Abhishek
> >
> >
> > -----Original Message-----
> > From: lars hofhansl [mailto:[EMAIL PROTECTED]]
> >  Sent: Tuesday, August 21, 2012 5:55 PM
> > To: [EMAIL PROTECTED]
> > Subject: Re: HBase Put
> >
> > That is correct.
> >
> >
> >
> > ________________________________
> >  From: "Pamecha, Abhishek" <[EMAIL PROTECTED]>
> > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; lars hofhansl <
> > [EMAIL PROTECTED]>
> > Sent: Tuesday, August 21, 2012 4:45 PM
> > Subject: RE: HBase Put
> >
> > Hi Lars,
> >
> > Thanks for the explanation. I still have a little doubt:
> >
> > Based on your description, given gets do a merge sort, the data on
> > disk is not kept sorted across files, but just sorted within a file.
> >
> > So, basically if on two separate days, say these keys get inserted:
> >
> > Day1: File1:   A B J M
> > Day2: File2:  C D K P
> >
> > Then each file is sorted within itself, but scanning both files will
> > require Hbase to use merge sort to produce a sorted result. Right?
> >
> > Also, File 1 and File2 are immutable, and during compactions, File 1
> > and
> > File2 are compacted and sorted using merge sort to a bigger File3.
> > Is that correct too?
> >
> > Thanks,
> > Abhishek
> >
> >
> > -----Original Message-----
> > From: lars hofhansl [mailto:[EMAIL PROTECTED]]
> > Sent: Tuesday, August 21, 2012 4:07 PM
> > To: [EMAIL PROTECTED]
> > Subject: Re: HBase Put
> >
> > In a nutshell:
> > - Puts are collected in memory (in a sorted data structure)
> > - When the collected data reaches a certain size it is flushed to a
> > new file (which is sorted)
> > - Gets do a merge sort between the various files that have been
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB