Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Fastest way to read only the keys of a HTable?


Copy link to this message
-
RE: Fastest way to read only the keys of a HTable?
If you only need to consider a single column family, use Scan.addFamily() on your scanner.  Then there will be no impact of the other column families.

> -----Original Message-----
> From: Something Something [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, February 03, 2011 11:28 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Fastest way to read only the keys of a HTable?
>
> Hmm.. performance hasn't improved at all.  Do you see anything wrong with
> the following code:
>
>
>     public List<Partner> getPartners() {
>       ArrayList<Partner> partners = new ArrayList<Partner>();
>
>       try {
>           HTable table = new HTable("partner");
>           Scan scan = new Scan();
>           scan.setFilter(new FirstKeyOnlyFilter());
>           ResultScanner scanner = table.getScanner(scan);
>           Result result = scanner.next();
>           while (result != null) {
>               Partner partner = new
> Partner(Bytes.toString(result.getRow()));
>               partners.add(partner);
>               result = scanner.next();
>           }
>       } catch (IOException e) {
>           throw new RuntimeException(e);
>       }
>       return partners;
>   }
>
> May be I shouldn't use more than one "column family" in a HTable - but the
> BigTable paper recommends that, doesn't it?  Please advice and thanks for
> your help.
>
>
>
>
> On Wed, Feb 2, 2011 at 10:55 PM, Stack <[EMAIL PROTECTED]> wrote:
>
> > I don't see a getKey on Result.  Use
> >
> >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Result.
> > html#getRow()
> > .
> >
> > Here is how its used in the shell table.rb class:
> >
> >    # Count rows in a table
> >    def count(interval = 1000, caching_rows = 10)
> >      # We can safely set scanner caching with the first key only filter
> >      scan = org.apache.hadoop.hbase.client.Scan.new
> >      scan.cache_blocks = false
> >      scan.caching = caching_rows
> >
> > scan.setFilter(org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter.new)
> >
> >      # Run the scanner
> >      scanner = @table.getScanner(scan)
> >      count = 0
> >      iter = scanner.iterator
> >
> >      # Iterate results
> >      while iter.hasNext
> >        row = iter.next
> >        count += 1
> >        next unless (block_given? && count % interval == 0)
> >        # Allow command modules to visualize counting process
> >        yield(count, String.from_java_bytes(row.getRow))
> >      end
> >
> >      # Return the counter
> >      return count
> >    end
> >
> >
> > St.Ack
> >
> > On Thu, Feb 3, 2011 at 6:47 AM, Something Something
> > <[EMAIL PROTECTED]> wrote:
> > > Thanks.  So I will add this...
> > >
> > >   scan.setFilter(new FirstKeyOnlyFilter());
> > >
> > > But after I do this...
> > >
> > >   Result result = scanner.next();
> > >
> > > There's no...  result.getKey() - so what method would give me the
> > > Key
> > value?
> > >
> > >
> > >
> > > On Wed, Feb 2, 2011 at 10:20 PM, Stack <[EMAIL PROTECTED]> wrote:
> > >
> > >> See
> > >>
> > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FirstKe
> > yOnlyFilter.html
> > >> St.Ack
> > >>
> > >> On Thu, Feb 3, 2011 at 6:01 AM, Something Something
> > >> <[EMAIL PROTECTED]> wrote:
> > >> > I want to read only the keys in a table. I tried this...
> > >> >
> > >> >    try {
> > >> >
> > >> >  HTable table = new HTable("myTable");
> > >> >
> > >> >  Scan scan = new Scan();
> > >> >
> > >> >  scan.addFamily(Bytes.toBytes("Info"));
> > >> >
> > >> >  ResultScanner scanner = table.getScanner(scan);
> > >> >
> > >> >   Result result = scanner.next();
> > >> >
> > >> >  while (result != null) {
> > >> >
> > >> > & so on...
> > >> >
> > >> > This was performing fairly well until I added another Family that
> > >> contains
> > >> > lots of key/value pairs.  My understanding was that adding
> > >> > another
> > family
> > >> > wouldn't affect performance of this code because I am explicitly
> > >> > using "Info", but it is.
> > >> >
> > >> > Anyway, in this particular use case, I only care about the "Key"