Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> joining accumulo tables with mapreduce


Copy link to this message
-
joining accumulo tables with mapreduce
Hello,

 I am interested in learning what the best solution/practices might be to
join 3 accumulo tables by running a map reduce job. Interested in getting
feedback on best practices and such. Heres a pseudo code of what I want to
accomplish:
AccumuloInputFormat accepts tableA
Global variable <table_list> has table names: tableB, tableC

In a mapper, for example, you would do something like this:

for each row in TableA
 if (row.family == "abc" && row.qualifier == "xyz") value = getValue()
 if (foundvalue) {

  for each table in table_list
    scan table with (this rowid && family = "def")
    for each entry found in scan
      write to final_table (rowid, value_as_family, tablename_as_qualifier,
entry_as_value_string)

}//end if foundvalue

}//end for loop
This is a simple version of what I want to do. In my non mapreduce java
code I would do this by calling a using different scanners per table in the
list. Couple questions:
- how bad/good is performance when using scanners withing mappers?
- if I get one mapper per range in tableA, do I reset scanners? how? or
would I set up a scanner in the setup() of mapper ? --> i have no clue how
this will play out so thinking out loud here.
- any optimization suggestions? or examples of creating join_tables/indexes
out there that I can refer to?
Thank you for all suggestions.
+
Keith Turner 2013-04-17, 14:59
+
Aji Janis 2013-04-17, 20:43
+
Keith Turner 2013-04-17, 23:39
+
Kurt Christensen 2013-05-04, 14:15
+
David Medinets 2013-04-18, 01:03
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB