Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> joining accumulo tables with mapreduce


Copy link to this message
-
Re: joining accumulo tables with mapreduce
Consider using pig to perform the join. There is an Accumulo-Pig github
project. You can load all three tables and then join fairly easily. Pig
basically writes the M/R jobs for you.

Using a common row value, I've run many M/R jobs in parallel to load data
into a Accumulo table which creates an effective join. This technique was
fast enough for my particular project. It's effectiveness depends on many
variables.
On Tue, Apr 16, 2013 at 5:28 PM, Aji Janis <[EMAIL PROTECTED]> wrote:

> Hello,
>
>  I am interested in learning what the best solution/practices might be to
> join 3 accumulo tables by running a map reduce job. Interested in getting
> feedback on best practices and such. Heres a pseudo code of what I want to
> accomplish:
>
>
> AccumuloInputFormat accepts tableA
> Global variable <table_list> has table names: tableB, tableC
>
> In a mapper, for example, you would do something like this:
>
> for each row in TableA
>  if (row.family == "abc" && row.qualifier == "xyz") value = getValue()
>  if (foundvalue) {
>
>   for each table in table_list
>     scan table with (this rowid && family = "def")
>     for each entry found in scan
>       write to final_table (rowid, value_as_family,
> tablename_as_qualifier, entry_as_value_string)
>
> }//end if foundvalue
>
> }//end for loop
>
>
> This is a simple version of what I want to do. In my non mapreduce java
> code I would do this by calling a using different scanners per table in the
> list. Couple questions:
>
>
> - how bad/good is performance when using scanners withing mappers?
> - if I get one mapper per range in tableA, do I reset scanners? how? or
> would I set up a scanner in the setup() of mapper ? --> i have no clue how
> this will play out so thinking out loud here.
> - any optimization suggestions? or examples of creating
> join_tables/indexes out there that I can refer to?
>
>
> Thank you for all suggestions.
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB