Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Fastest way to read only the keys of a HTable?


Copy link to this message
-
RE: Fastest way to read only the keys of a HTable?
If you only need to consider a single column family, use Scan.addFamily() on your scanner.  Then there will be no impact of the other column families.

> -----Original Message-----
> From: Something Something [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, February 03, 2011 11:28 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Fastest way to read only the keys of a HTable?
>
> Hmm.. performance hasn't improved at all.  Do you see anything wrong with
> the following code:
>
>
>     public List<Partner> getPartners() {
>       ArrayList<Partner> partners = new ArrayList<Partner>();
>
>       try {
>           HTable table = new HTable("partner");
>           Scan scan = new Scan();
>           scan.setFilter(new FirstKeyOnlyFilter());
>           ResultScanner scanner = table.getScanner(scan);
>           Result result = scanner.next();
>           while (result != null) {
>               Partner partner = new
> Partner(Bytes.toString(result.getRow()));
>               partners.add(partner);
>               result = scanner.next();
>           }
>       } catch (IOException e) {
>           throw new RuntimeException(e);
>       }
>       return partners;
>   }
>
> May be I shouldn't use more than one "column family" in a HTable - but the
> BigTable paper recommends that, doesn't it?  Please advice and thanks for
> your help.
>
>
>
>
> On Wed, Feb 2, 2011 at 10:55 PM, Stack <[EMAIL PROTECTED]> wrote:
>
> > I don't see a getKey on Result.  Use
> >
> >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Result.
> > html#getRow()
> > .
> >
> > Here is how its used in the shell table.rb class:
> >
> >    # Count rows in a table
> >    def count(interval = 1000, caching_rows = 10)
> >      # We can safely set scanner caching with the first key only filter
> >      scan = org.apache.hadoop.hbase.client.Scan.new
> >      scan.cache_blocks = false
> >      scan.caching = caching_rows
> >
> > scan.setFilter(org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter.new)
> >
> >      # Run the scanner
> >      scanner = @table.getScanner(scan)
> >      count = 0
> >      iter = scanner.iterator
> >
> >      # Iterate results
> >      while iter.hasNext
> >        row = iter.next
> >        count += 1
> >        next unless (block_given? && count % interval == 0)
> >        # Allow command modules to visualize counting process
> >        yield(count, String.from_java_bytes(row.getRow))
> >      end
> >
> >      # Return the counter
> >      return count
> >    end
> >
> >
> > St.Ack
> >
> > On Thu, Feb 3, 2011 at 6:47 AM, Something Something
> > <[EMAIL PROTECTED]> wrote:
> > > Thanks.  So I will add this...
> > >
> > >   scan.setFilter(new FirstKeyOnlyFilter());
> > >
> > > But after I do this...
> > >
> > >   Result result = scanner.next();
> > >
> > > There's no...  result.getKey() - so what method would give me the
> > > Key
> > value?
> > >
> > >
> > >
> > > On Wed, Feb 2, 2011 at 10:20 PM, Stack <[EMAIL PROTECTED]> wrote:
> > >
> > >> See
> > >>
> > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FirstKe
> > yOnlyFilter.html
> > >> St.Ack
> > >>
> > >> On Thu, Feb 3, 2011 at 6:01 AM, Something Something
> > >> <[EMAIL PROTECTED]> wrote:
> > >> > I want to read only the keys in a table. I tried this...
> > >> >
> > >> >    try {
> > >> >
> > >> >  HTable table = new HTable("myTable");
> > >> >
> > >> >  Scan scan = new Scan();
> > >> >
> > >> >  scan.addFamily(Bytes.toBytes("Info"));
> > >> >
> > >> >  ResultScanner scanner = table.getScanner(scan);
> > >> >
> > >> >   Result result = scanner.next();
> > >> >
> > >> >  while (result != null) {
> > >> >
> > >> > & so on...
> > >> >
> > >> > This was performing fairly well until I added another Family that
> > >> contains
> > >> > lots of key/value pairs.  My understanding was that adding
> > >> > another
> > family
> > >> > wouldn't affect performance of this code because I am explicitly
> > >> > using "Info", but it is.
> > >> >
> > >> > Anyway, in this particular use case, I only care about the "Key"
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB