Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> joining accumulo tables with mapreduce


Copy link to this message
-
Re: joining accumulo tables with mapreduce

How about three scanners, one for each table? Advance the one with the
least value (sort-wise) and combine when they match.
On 4/17/13 4:43 PM, Aji Janis wrote:
> Keith,
>
>  You hit the problem that I purposely didn't ask.
> -Accumulo inputformat doesn't support multiple tables at this point and
> -I can't run three mappers in parallel on different tables and
> combine/send their output to a reducer (that I know of).
>
> If all three tables had the same rowid (eg: rowA exists in table 1, 2
> and 3) then we can write the row from each table w/a different
> family/qualifier/value to a new table. So it will be three mappers run
> sequentially and end result is a join... this is the best I came up
> with so far. If rowids are different accross three tables then I would
> have to reformat my rowid from all three tables (normalize) prior to
> writing the fourth/final table.
>
> Is calling a scanner on the other two tables from within a mapper
> (that takes the first table as the input) bad? Any clues on how that
> could be done in mapreduce?
>
>
> On Wed, Apr 17, 2013 at 10:59 AM, Keith Turner <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
>
>     If I am understaning you correctly, you are proposing for each row a
>     mapper gets to look that row up in two other tables?  This would
>     result in a lot of little round trip RPC calls and random disk
>     accesses.
>
>     I think a better solution would be to read all three tables into your
>     mappers, and do the join in the reduce.  This solution will avoid all
>     of the little RPC calls and do lots of sequential I/O instead of
>     random accesses.  Between the map and reduce, you could track which
>     table each row came from.  Any filtering could be done in the mapper
>     or by iterators.  Unfortunately Accumulo does not have the needed
>     input format for this out of the box.  There is a ticket,
>     ACCUMULO-391.
>
>
>
>     On Tue, Apr 16, 2013 at 5:28 PM, Aji Janis <[EMAIL PROTECTED]
>     <mailto:[EMAIL PROTECTED]>> wrote:
>     > Hello,
>     >
>     >  I am interested in learning what the best solution/practices
>     might be to
>     > join 3 accumulo tables by running a map reduce job. Interested
>     in getting
>     > feedback on best practices and such. Heres a pseudo code of what
>     I want to
>     > accomplish:
>     >
>     >
>     > AccumuloInputFormat accepts tableA
>     > Global variable <table_list> has table names: tableB, tableC
>     >
>     > In a mapper, for example, you would do something like this:
>     >
>     > for each row in TableA
>     >  if (row.family == "abc" && row.qualifier == "xyz") value >     getValue()
>     >  if (foundvalue) {
>     >
>     >   for each table in table_list
>     >     scan table with (this rowid && family = "def")
>     >     for each entry found in scan
>     >       write to final_table (rowid, value_as_family,
>     tablename_as_qualifier,
>     > entry_as_value_string)
>     >
>     > }//end if foundvalue
>     >
>     > }//end for loop
>     >
>     >
>     > This is a simple version of what I want to do. In my non
>     mapreduce java code
>     > I would do this by calling a using different scanners per table
>     in the list.
>     > Couple questions:
>     >
>     >
>     > - how bad/good is performance when using scanners withing mappers?
>     > - if I get one mapper per range in tableA, do I reset scanners?
>     how? or
>     > would I set up a scanner in the setup() of mapper ? --> i have
>     no clue how
>     > this will play out so thinking out loud here.
>     > - any optimization suggestions? or examples of creating
>     join_tables/indexes
>     > out there that I can refer to?
>     >
>     >
>     > Thank you for all suggestions.
>
>

--

Kurt Christensen
P.O. Box 811
Westminster, MD 21158-0811

------------------------------------------------------------------------
"One of the penalties for refusing to participate in politics is that
you end up being governed by your inferiors."
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB