Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> accumulo for a bi-map?


+
Marc Reichman 2013-07-16, 21:28
+
Dave Marion 2013-07-16, 23:16
+
David Medinets 2013-07-16, 22:55
+
Josh Elser 2013-07-16, 23:25
+
Marc Reichman 2013-07-17, 15:26
+
Marc Reichman 2013-07-18, 16:15
+
Josh Elser 2013-07-18, 16:48
+
Adam Fuchs 2013-07-17, 19:03
+
Jeremy Kepner 2013-07-18, 17:32
+
Frank Smith 2013-07-21, 14:15
Copy link to this message
-
Re: accumulo for a bi-map?
D4M is a library that provides a nice matlab like interface to Accumulo for quickly protyping algorithms (see http://www.mit.edu/~kepner/D4M/).

The D4M Schema described in the paper (see http://www.mit.edu/~kepner/pubs/D4Mschema_HPEC2013_Paper.pdf) is an Accumulo schema for indexing and counting all the unique strings in a data set.  The schema is actually independent of D4M and can be implemented in any environment.
On Jul 21, 2013, at 10:15 AM, Frank Smith wrote:

Jeremy/Adam,

Forgive me for the noob question, but I come to Accumulo from the relational database / enterprise search application developer side of the business, so the scaling aspect has been its real attraction, through basic data warehousing use cases.

I am reading and digesting the D4m information, and I find it very interesting, but I was curious if you could explain to me how this fits with systems like HPCC (or its proprietary predecessor) in terms of approach, applications, etc.?  My initial perception (naive perhaps) is that I could run D4M on top of Accumulo to leverage common infrastructure to serve both purposes (traditional data warehouse applications and statistical methods).  I would imagine this assessment might cut some corners on a largely valid hypothesis - but really trying to find those conditions.

Thanks,

Frank

> Date: Thu, 18 Jul 2013 13:32:25 -0400
> From: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
> Subject: Re: accumulo for a bi-map?
>
> Here is a link to the IEEE HPEC paper we wrote up on our schema work:
>
> http://www.mit.edu/~kepner/pubs/D4Mschema_HPEC2013_Paper.pdf
>
> On Wed, Jul 17, 2013 at 03:03:35PM -0400, Adam Fuchs wrote:
> > Marc,
> > You might also want to check out D4M and the table organization that it
> > uses in Accumulo. D4M stores matrixes and their transforms, which is
> > essentially the same concept as a bidirectional map or a bidirected
> > graph:�[1]http://www.mit.edu/~kepner/D4M/
> > Cheers,
> > Adam
> >
> > On Tue, Jul 16, 2013 at 5:28 PM, Marc Reichman
> > <[2][EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
> >
> > We are using accumulo as a mechanism to store feature data (binary
> > byte[]) for some simple keys which are used for a search algorithm. We
> > currently search by iterating over the feature space using
> > AccumuloRowInputFormat. Results come out of a reducer into HDFS,
> > currently in a SequenceFile.
> > A customer has asked if we can store our results somewhere in our Hadoop
> > infrastructure, and also perform nightly searches of everything vs
> > everything to keep match results up to date.
> > To me, the storage of the results in alternate column families (from the
> > features) would be a way way to store the matches alongside the key
> > rows:
> > (key: abcd, features:{...}, matches{ 'm0: efgh-88%, 'm1': ijkl-90%, ...,
> > 'mN': etc }
> > (key: ijkl, features:{...}, matches{ 'm0: efgh-88%, 'm1': abcd-90%, ...,
> > 'mN': etc }
> > Match scores are equal between two items regardless of perspective, so
> > a->b is 90% as b->a is 90%.
> > Is there a way to simply add columns to an existing family without
> > having to name them or keep track of how many there are? Am I better off
> > making a column family for each match key and then store score and other
> > fields in columns? Making one column with the key as the name and the
> > score as the value for each match under one family?
> > Ideally I would have some form of bidirectional map so I could look at
> > any key and find all the results as other keys, and find any results to
> > get other matches.
> > One approach is to simply add both sides of the relationship every time
> > anything matches anything else, which seems a bit wasteful, space-wise.
> > Curious if any pre-existing ideas are out there. Currently on hadoop
> > 1.0.3/accumulo 1.4.1, not set in (hard) concrete.
> > Thanks,
> > Marc
> >
> > References
> >
> > Visible links
> > 1. http://www.mit.edu/~kepner/D4M/
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB