Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> accumulo for a bi-map?

Marc Reichman 2013-07-16, 21:28
Dave Marion 2013-07-16, 23:16
David Medinets 2013-07-16, 22:55
Josh Elser 2013-07-16, 23:25
Copy link to this message
Re: accumulo for a bi-map?
Thank you all for your responses. Some follow-up thoughts/questions:

The use cases I'm chasing right now for retrieval are shaping up to be:
1. Get one ABCD->IJKL match score
2. Get all ABCD->* match scores
3. Either of the above, only greater than a specified threshold.

It's looking like the results may go into a different table than the
original features, so I can work a little more flexibly.

So far, Dave Marion's approach seems most closely suited to this, but in a
different table I wouldn't get the features back if I just did a basic scan
for the row key without other factors, which would satisfy use case #2. I
can satisfy case #1 easily if I make the targets (IJKL) a qualifier and
constrain by it on my scan as needed.

For #3, I'm a bit confused at a best way to do this. A simple solution
would be to just pull all the results from the #1/#2 cases and filter out
undesirables in my client-side code. Assuming key:source, fam:target,
col:score, is there some form of iterator or filter I could use to process
the column names and throw out what I don't want with decent data locality
for the processing?

Would it make any major impact if the scores were not integers but doubles?
I'm already anticipating having to parse doubles from the scores as-stored
in byte[] string form, but I don't know if the performance impact would
make any difference doing that locally after or in an iterator.

I feel like this is close and I appreciate the guidance.

On Tue, Jul 16, 2013 at 6:25 PM, Josh Elser <[EMAIL PROTECTED]> wrote:

> Instead of keeping all match scores inside of one Value, have you
> considered thinking about your data in term of edges?
> key:abcd->efgh score, value:88%
> key:abcd->ijkl score, value:90%
> key:efgh->abcd score, value:88%
> key:ijkl->abcd score, value:90%
> If you do go the route of storing both directions in Accumulo, a structure
> like this will likely be much easier to maintain, as you're not trying to
> manage difficult aggregation rules for multiple updates to the matches for
> a single record. Additionally, you should get really good compression (and
> even better in 1.5) when you have large row prefixes (many matches for abcd
> will equate to abcd being stored "once").
> You could also store all of the features for a record in a key which only
> has the record in the row.
> key:abcd feature:foo1
> key:abcd feature:foo2
> etc.
> Also, I'd encourage you to try to upgrade to 1.5.0 if you can, but, if
> not, definitely update to 1.4.3 as it fixes a fair number of bugs. It's as
> simple as stopping Accumulo, and copying in the 1.4.3 Accumulo jar files to
> $ACCUMULO_HOME/lib, and removing the 1.4.1 jars.
> (apparently Dave Marion and I think alike)
> - Josh
> On 07/16/2013 05:28 PM, Marc Reichman wrote:
>> We are using accumulo as a mechanism to store feature data (binary
>> byte[]) for some simple keys which are used for a search algorithm. We
>> currently search by iterating over the feature space using
>> AccumuloRowInputFormat. Results come out of a reducer into HDFS, currently
>> in a SequenceFile.
>> A customer has asked if we can store our results somewhere in our Hadoop
>> infrastructure, and also perform nightly searches of everything vs
>> everything to keep match results up to date.
>> To me, the storage of the results in alternate column families (from the
>> features) would be a way way to store the matches alongside the key rows:
>> (key: abcd, features:{...}, matches{ 'm0: efgh-88%, 'm1': ijkl-90%, ...,
>> 'mN': etc }
>> (key: ijkl, features:{...}, matches{ 'm0: efgh-88%, 'm1': abcd-90%, ...,
>> 'mN': etc }
>> Match scores are equal between two items regardless of perspective, so
>> a->b is 90% as b->a is 90%.
>> Is there a way to simply add columns to an existing family without having
>> to name them or keep track of how many there are? Am I better off making a
>> column family for each match key and then store score and other fields in
>> columns? Making one column with the key as the name and the score as the
Marc Reichman 2013-07-18, 16:15
Josh Elser 2013-07-18, 16:48
Adam Fuchs 2013-07-17, 19:03
Jeremy Kepner 2013-07-18, 17:32
Frank Smith 2013-07-21, 14:15
Kepner, Jeremy - 0553 - M... 2013-07-21, 18:11