Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> accumulo for a bi-map?


Copy link to this message
-
Re: accumulo for a bi-map?
Just be aware that if you have extremely wide matches (one record
matching many other records), you've now forced these records to only
ever be hosted on one tabletserver (as a row cannot be split across a
tablet).

Given the size of what you outlined so far, you'd probably have to get
up to the scale of tens of millions before this is a problem.

On 7/18/13 12:15 PM, Marc Reichman wrote:
> I have implemented an approach like Dave Marion's, where on a match
> during search I insert two rows:
>
> Row____
>
>
>
> Column Family____
>
>
>
> Column Qualifier____
>
>
>
> Value____
>
>
>
>
>
> abcd____
>
>
>
> ijkl____
>
>
>
> 90____
>
>
>
> __ __
>
>
>
>
>
> ijkl____
>
>
>
> abcd____
>
>
>
> 90____
>
>
>
> __ __
>
> __
>
> This works great for what I need to get, all abcd matches, all ijkl
> matches, specifically abcd->ijkl or reversed. For threshold filtering,
> I'm currently getting all of the results (from these cases) and then not
> retaining items below my threshold. I've looked at some ways to use a
> scan iterator to do this but I'm coming up short. Best idea I've had yet
> is to extend the ColumnQualifierFilter to see if I can do a "greater
> than" instead of an equals to accept or not. Any thoughts?
>
>
>
> On Wed, Jul 17, 2013 at 10:26 AM, Marc Reichman
> <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
>
>     Thank you all for your responses. Some follow-up thoughts/questions:
>
>     The use cases I'm chasing right now for retrieval are shaping up to be:
>     1. Get one ABCD->IJKL match score
>     2. Get all ABCD->* match scores
>     3. Either of the above, only greater than a specified threshold.
>
>     It's looking like the results may go into a different table than the
>     original features, so I can work a little more flexibly.
>
>     So far, Dave Marion's approach seems most closely suited to this,
>     but in a different table I wouldn't get the features back if I just
>     did a basic scan for the row key without other factors, which would
>     satisfy use case #2. I can satisfy case #1 easily if I make the
>     targets (IJKL) a qualifier and constrain by it on my scan as needed.
>
>     For #3, I'm a bit confused at a best way to do this. A simple
>     solution would be to just pull all the results from the #1/#2 cases
>     and filter out undesirables in my client-side code. Assuming
>     key:source, fam:target, col:score, is there some form of iterator or
>     filter I could use to process the column names and throw out what I
>     don't want with decent data locality for the processing?
>
>     Would it make any major impact if the scores were not integers but
>     doubles? I'm already anticipating having to parse doubles from the
>     scores as-stored in byte[] string form, but I don't know if the
>     performance impact would make any difference doing that locally
>     after or in an iterator.
>
>     I feel like this is close and I appreciate the guidance.
>
>     Thanks,
>     Marc
>
>
>     On Tue, Jul 16, 2013 at 6:25 PM, Josh Elser <[EMAIL PROTECTED]
>     <mailto:[EMAIL PROTECTED]>> wrote:
>
>         Instead of keeping all match scores inside of one Value, have
>         you considered thinking about your data in term of edges?
>
>         key:abcd->efgh score, value:88%
>         key:abcd->ijkl score, value:90%
>         key:efgh->abcd score, value:88%
>         key:ijkl->abcd score, value:90%
>
>         If you do go the route of storing both directions in Accumulo, a
>         structure like this will likely be much easier to maintain, as
>         you're not trying to manage difficult aggregation rules for
>         multiple updates to the matches for a single record.
>         Additionally, you should get really good compression (and even
>         better in 1.5) when you have large row prefixes (many matches
>         for abcd will equate to abcd being stored "once").
>
>         You could also store all of the features for a record in a key