Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> accumulo for a bi-map?


Copy link to this message
-
accumulo for a bi-map?
We are using accumulo as a mechanism to store feature data (binary byte[])
for some simple keys which are used for a search algorithm. We currently
search by iterating over the feature space using AccumuloRowInputFormat.
Results come out of a reducer into HDFS, currently in a SequenceFile.

A customer has asked if we can store our results somewhere in our Hadoop
infrastructure, and also perform nightly searches of everything vs
everything to keep match results up to date.

To me, the storage of the results in alternate column families (from the
features) would be a way way to store the matches alongside the key rows:
(key: abcd, features:{...}, matches{ 'm0: efgh-88%, 'm1': ijkl-90%, ...,
'mN': etc }
(key: ijkl, features:{...}, matches{ 'm0: efgh-88%, 'm1': abcd-90%, ...,
'mN': etc }

Match scores are equal between two items regardless of perspective, so a->b
is 90% as b->a is 90%.

Is there a way to simply add columns to an existing family without having
to name them or keep track of how many there are? Am I better off making a
column family for each match key and then store score and other fields in
columns? Making one column with the key as the name and the score as the
value for each match under one family?

Ideally I would have some form of bidirectional map so I could look at any
key and find all the results as other keys, and find any results to get
other matches.

One approach is to simply add both sides of the relationship every time
anything matches anything else, which seems a bit wasteful, space-wise.

Curious if any pre-existing ideas are out there. Currently on hadoop
1.0.3/accumulo 1.4.1, not set in (hard) concrete.

Thanks,
Marc
+
Dave Marion 2013-07-16, 23:16
+
David Medinets 2013-07-16, 22:55
+
Josh Elser 2013-07-16, 23:25
+
Marc Reichman 2013-07-17, 15:26
+
Marc Reichman 2013-07-18, 16:15
+
Josh Elser 2013-07-18, 16:48
+
Adam Fuchs 2013-07-17, 19:03
+
Jeremy Kepner 2013-07-18, 17:32
+
Frank Smith 2013-07-21, 14:15
+
Kepner, Jeremy - 0553 - M... 2013-07-21, 18:11
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB