Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> scanner question in regards to columns loaded

Copy link to this message
Re: scanner question in regards to columns loaded
Filters (and more generally, iterators) are executed on the server. There
is an option to run them client side. See

Using fetchColumnFamily will return only keys that have specific column
family values, not rows.

If I have a few keys in a table:

row1 family1: qualifier1
row1 family2: qualifier2
row2 family1: qualifier1

Let's say I call `scanner.fetchColumnFamily("family1")`. My scanner will

row1 family1: qualifier1
row2 family1: qualifier1

Now let's say I want to do a scan, but call
`scanner.fetchColumnFamily("family2")`. My scanner will return:

row1 family2: qualifier2

If you want whole rows that contain specific column families, then I
believe you'd have to write a custom iterator using the RowFilter
On Sun, Jan 26, 2014 at 7:39 PM, Jamie Johnson <[EMAIL PROTECTED]> wrote:

> After a little reading...if I use fetchColumnFamily does that skip any
> rows that does not have the column family?
> On Jan 26, 2014 7:27 PM, "Jamie Johnson" <[EMAIL PROTECTED]> wrote:
>> Thanks for the ideas.  Filters are client side right?
>> I need to read the documentation more as I don't know how to just query a
>> column family.  Would it be possible to get all terms that start with a
>> particular value?  I was thinking that we would need a special prefix for
>> this but if something could be done without needing it that would work well.
>> On Jan 26, 2014 5:44 PM, "Christopher" <[EMAIL PROTECTED]> wrote:
>>> Ah, I see. Well, you could do that with a custom filter (iterator),
>>> but otherwise, no, not unless you had some other special per-term
>>> entry to query (rather than per-term/document pair). The design of
>>> this kind of table though, seems focused on finding documents which
>>> contain the given terms, though, not listing all terms seen. If you
>>> need that additional feature and don't want to write a custom filter,
>>> you could achieve that by putting a special entry in its own row for
>>> each term, in addition to the entries per-term/document pair, as in:
>>> RowID                       ColumnFamily     Column Qualifier     Value
>>> <term1>                    term                   -
>>>        -
>>> <term1>=<doc_id2>   index                  count                     5
>>> Then, you could list terms by querying the "term" column family
>>> without getting duplicates. And, you could get decent performance with
>>> this scan if you put the "term" column family and the "index" column
>>> family in separate locality groups. You could even make this entry an
>>> aggregated count for all documents (see documentation for combiners),
>>> in case you want corpus-wide term frequencies (for something like
>>> TF-IDF computations).
>>> --
>>> Christopher L Tubbs II
>>> http://gravatar.com/ctubbsii
>>> On Sun, Jan 26, 2014 at 7:55 AM, Jamie Johnson <[EMAIL PROTECTED]>
>>> wrote:
>>> > I mean if a user asked for all terms that started with "term" is there
>>> a way
>>> > to get term1 and term2 just once while scanning or would I get each
>>> twice,
>>> > once for each docid and need to filter client side?
>>> >
>>> > On Jan 26, 2014 1:33 AM, "Christopher" <[EMAIL PROTECTED]> wrote:
>>> >>
>>> >> If you use the Range constructor that takes two arguments, then yes,
>>> >> you'd get two entries. However, "count" would come before "doc_id",
>>> >> though, because the qualifier is part of the Key, and therefore, part
>>> >> of the sort order. There's also a Range constructor that allows you to
>>> >> specify whether you want the startKey and endKey to be inclusive or
>>> >> exclusive.
>>> >>
>>> >> I don't know of a specific document that outlines various strategies
>>> >> that I can link to. Perhaps I'll put one together, when I get some
>>> >> spare time, if nobody else does. I think most people do a lot of