Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> scanner question in regards to columns loaded


Copy link to this message
-
Re: scanner question in regards to columns loaded
Filters are iterators, which are configured to run on the server-side.
fetchColumnFamily will only return entries in the specified column
families. If a row has no entries with the specified column families,
then no entries for that row will return.

--
Christopher L Tubbs II
http://gravatar.com/ctubbsii
On Sun, Jan 26, 2014 at 7:39 PM, Jamie Johnson <[EMAIL PROTECTED]> wrote:
> After a little reading...if I use fetchColumnFamily does that skip any rows
> that does not have the column family?
>
> On Jan 26, 2014 7:27 PM, "Jamie Johnson" <[EMAIL PROTECTED]> wrote:
>>
>> Thanks for the ideas.  Filters are client side right?
>>
>> I need to read the documentation more as I don't know how to just query a
>> column family.  Would it be possible to get all terms that start with a
>> particular value?  I was thinking that we would need a special prefix for
>> this but if something could be done without needing it that would work well.
>>
>> On Jan 26, 2014 5:44 PM, "Christopher" <[EMAIL PROTECTED]> wrote:
>>>
>>> Ah, I see. Well, you could do that with a custom filter (iterator),
>>> but otherwise, no, not unless you had some other special per-term
>>> entry to query (rather than per-term/document pair). The design of
>>> this kind of table though, seems focused on finding documents which
>>> contain the given terms, though, not listing all terms seen. If you
>>> need that additional feature and don't want to write a custom filter,
>>> you could achieve that by putting a special entry in its own row for
>>> each term, in addition to the entries per-term/document pair, as in:
>>>
>>> RowID                       ColumnFamily     Column Qualifier     Value
>>> <term1>                    term                   -
>>> -
>>> <term1>=<doc_id2>   index                  count                     5
>>>
>>> Then, you could list terms by querying the "term" column family
>>> without getting duplicates. And, you could get decent performance with
>>> this scan if you put the "term" column family and the "index" column
>>> family in separate locality groups. You could even make this entry an
>>> aggregated count for all documents (see documentation for combiners),
>>> in case you want corpus-wide term frequencies (for something like
>>> TF-IDF computations).
>>>
>>> --
>>> Christopher L Tubbs II
>>> http://gravatar.com/ctubbsii
>>>
>>>
>>> On Sun, Jan 26, 2014 at 7:55 AM, Jamie Johnson <[EMAIL PROTECTED]> wrote:
>>> > I mean if a user asked for all terms that started with "term" is there
>>> > a way
>>> > to get term1 and term2 just once while scanning or would I get each
>>> > twice,
>>> > once for each docid and need to filter client side?
>>> >
>>> > On Jan 26, 2014 1:33 AM, "Christopher" <[EMAIL PROTECTED]> wrote:
>>> >>
>>> >> If you use the Range constructor that takes two arguments, then yes,
>>> >> you'd get two entries. However, "count" would come before "doc_id",
>>> >> though, because the qualifier is part of the Key, and therefore, part
>>> >> of the sort order. There's also a Range constructor that allows you to
>>> >> specify whether you want the startKey and endKey to be inclusive or
>>> >> exclusive.
>>> >>
>>> >> I don't know of a specific document that outlines various strategies
>>> >> that I can link to. Perhaps I'll put one together, when I get some
>>> >> spare time, if nobody else does. I think most people do a lot of
>>> >> experimentation to figure out which strategies work best.
>>> >>
>>> >> I'm not entirely sure what you mean about "getting an iterator over
>>> >> all terms without duplicates". I'm assuming you don't mean duplicate
>>> >> versions of a single entry, which is handled by the
>>> >> VersioningIterator, which should be on new tables by default, and set
>>> >> to retain the recent 1 version, to support updates. With the scheme I
>>> >> suggested, your table would look something like the following,
>>> >> instead:
>>> >>
>>> >> RowID                       ColumnFamily     Column Qualifier
>>> >> Value
>>> >> <term1>=<doc_id1>   index                  count

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB