|
|
-
RE: Fastest way to read only the keys of a HTable?Jonathan Gray 2011-02-03, 20:15
If you only need to consider a single column family, use Scan.addFamily() on your scanner. Then there will be no impact of the other column families.
> -----Original Message----- > From: Something Something [mailto:[EMAIL PROTECTED]] > Sent: Thursday, February 03, 2011 11:28 AM > To: [EMAIL PROTECTED] > Subject: Re: Fastest way to read only the keys of a HTable? > > Hmm.. performance hasn't improved at all. Do you see anything wrong with > the following code: > > > public List<Partner> getPartners() { > ArrayList<Partner> partners = new ArrayList<Partner>(); > > try { > HTable table = new HTable("partner"); > Scan scan = new Scan(); > scan.setFilter(new FirstKeyOnlyFilter()); > ResultScanner scanner = table.getScanner(scan); > Result result = scanner.next(); > while (result != null) { > Partner partner = new > Partner(Bytes.toString(result.getRow())); > partners.add(partner); > result = scanner.next(); > } > } catch (IOException e) { > throw new RuntimeException(e); > } > return partners; > } > > May be I shouldn't use more than one "column family" in a HTable - but the > BigTable paper recommends that, doesn't it? Please advice and thanks for > your help. > > > > > On Wed, Feb 2, 2011 at 10:55 PM, Stack <[EMAIL PROTECTED]> wrote: > > > I don't see a getKey on Result. Use > > > > > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Result. > > html#getRow() > > . > > > > Here is how its used in the shell table.rb class: > > > > # Count rows in a table > > def count(interval = 1000, caching_rows = 10) > > # We can safely set scanner caching with the first key only filter > > scan = org.apache.hadoop.hbase.client.Scan.new > > scan.cache_blocks = false > > scan.caching = caching_rows > > > > scan.setFilter(org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter.new) > > > > # Run the scanner > > scanner = @table.getScanner(scan) > > count = 0 > > iter = scanner.iterator > > > > # Iterate results > > while iter.hasNext > > row = iter.next > > count += 1 > > next unless (block_given? && count % interval == 0) > > # Allow command modules to visualize counting process > > yield(count, String.from_java_bytes(row.getRow)) > > end > > > > # Return the counter > > return count > > end > > > > > > St.Ack > > > > On Thu, Feb 3, 2011 at 6:47 AM, Something Something > > <[EMAIL PROTECTED]> wrote: > > > Thanks. So I will add this... > > > > > > scan.setFilter(new FirstKeyOnlyFilter()); > > > > > > But after I do this... > > > > > > Result result = scanner.next(); > > > > > > There's no... result.getKey() - so what method would give me the > > > Key > > value? > > > > > > > > > > > > On Wed, Feb 2, 2011 at 10:20 PM, Stack <[EMAIL PROTECTED]> wrote: > > > > > >> See > > >> > > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/FirstKe > > yOnlyFilter.html > > >> St.Ack > > >> > > >> On Thu, Feb 3, 2011 at 6:01 AM, Something Something > > >> <[EMAIL PROTECTED]> wrote: > > >> > I want to read only the keys in a table. I tried this... > > >> > > > >> > try { > > >> > > > >> > HTable table = new HTable("myTable"); > > >> > > > >> > Scan scan = new Scan(); > > >> > > > >> > scan.addFamily(Bytes.toBytes("Info")); > > >> > > > >> > ResultScanner scanner = table.getScanner(scan); > > >> > > > >> > Result result = scanner.next(); > > >> > > > >> > while (result != null) { > > >> > > > >> > & so on... > > >> > > > >> > This was performing fairly well until I added another Family that > > >> contains > > >> > lots of key/value pairs. My understanding was that adding > > >> > another > > family > > >> > wouldn't affect performance of this code because I am explicitly > > >> > using "Info", but it is. > > >> > > > >> > Anyway, in this particular use case, I only care about the "Key" |