Jamie Johnson 2014-01-25, 03:23
Christopher 2014-01-25, 04:34
Sean Busbey 2014-01-25, 05:11
Jamie Johnson 2014-01-25, 05:20
Christopher 2014-01-26, 06:33
Jamie Johnson 2014-01-26, 12:56
Christopher 2014-01-26, 22:44
Jamie Johnson 2014-01-27, 00:28
Jamie Johnson 2014-01-27, 00:39
William Slacum 2014-01-27, 02:58
Filters are iterators, which are configured to run on the server-side.
fetchColumnFamily will only return entries in the specified column
families. If a row has no entries with the specified column families,
then no entries for that row will return.
Christopher L Tubbs II
On Sun, Jan 26, 2014 at 7:39 PM, Jamie Johnson <[EMAIL PROTECTED]> wrote:
> After a little reading...if I use fetchColumnFamily does that skip any rows
> that does not have the column family?
> On Jan 26, 2014 7:27 PM, "Jamie Johnson" <[EMAIL PROTECTED]> wrote:
>> Thanks for the ideas. Filters are client side right?
>> I need to read the documentation more as I don't know how to just query a
>> column family. Would it be possible to get all terms that start with a
>> particular value? I was thinking that we would need a special prefix for
>> this but if something could be done without needing it that would work well.
>> On Jan 26, 2014 5:44 PM, "Christopher" <[EMAIL PROTECTED]> wrote:
>>> Ah, I see. Well, you could do that with a custom filter (iterator),
>>> but otherwise, no, not unless you had some other special per-term
>>> entry to query (rather than per-term/document pair). The design of
>>> this kind of table though, seems focused on finding documents which
>>> contain the given terms, though, not listing all terms seen. If you
>>> need that additional feature and don't want to write a custom filter,
>>> you could achieve that by putting a special entry in its own row for
>>> each term, in addition to the entries per-term/document pair, as in:
>>> RowID ColumnFamily Column Qualifier Value
>>> <term1> term -
>>> <term1>=<doc_id2> index count 5
>>> Then, you could list terms by querying the "term" column family
>>> without getting duplicates. And, you could get decent performance with
>>> this scan if you put the "term" column family and the "index" column
>>> family in separate locality groups. You could even make this entry an
>>> aggregated count for all documents (see documentation for combiners),
>>> in case you want corpus-wide term frequencies (for something like
>>> TF-IDF computations).
>>> Christopher L Tubbs II
>>> On Sun, Jan 26, 2014 at 7:55 AM, Jamie Johnson <[EMAIL PROTECTED]> wrote:
>>> > I mean if a user asked for all terms that started with "term" is there
>>> > a way
>>> > to get term1 and term2 just once while scanning or would I get each
>>> > twice,
>>> > once for each docid and need to filter client side?
>>> > On Jan 26, 2014 1:33 AM, "Christopher" <[EMAIL PROTECTED]> wrote:
>>> >> If you use the Range constructor that takes two arguments, then yes,
>>> >> you'd get two entries. However, "count" would come before "doc_id",
>>> >> though, because the qualifier is part of the Key, and therefore, part
>>> >> of the sort order. There's also a Range constructor that allows you to
>>> >> specify whether you want the startKey and endKey to be inclusive or
>>> >> exclusive.
>>> >> I don't know of a specific document that outlines various strategies
>>> >> that I can link to. Perhaps I'll put one together, when I get some
>>> >> spare time, if nobody else does. I think most people do a lot of
>>> >> experimentation to figure out which strategies work best.
>>> >> I'm not entirely sure what you mean about "getting an iterator over
>>> >> all terms without duplicates". I'm assuming you don't mean duplicate
>>> >> versions of a single entry, which is handled by the
>>> >> VersioningIterator, which should be on new tables by default, and set
>>> >> to retain the recent 1 version, to support updates. With the scheme I
>>> >> suggested, your table would look something like the following,
>>> >> instead:
>>> >> RowID ColumnFamily Column Qualifier
>>> >> Value
>>> >> <term1>=<doc_id1> index count