Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo, mail # user - joining accumulo tables with mapreduce


+
Aji Janis 2013-04-16, 21:28
+
Keith Turner 2013-04-17, 14:59
Copy link to this message
-
Re: joining accumulo tables with mapreduce
Aji Janis 2013-04-17, 20:43
Keith,

 You hit the problem that I purposely didn't ask.
-Accumulo inputformat doesn't support multiple tables at this point and
-I can't run three mappers in parallel on different tables and combine/send
their output to a reducer (that I know of).

If all three tables had the same rowid (eg: rowA exists in table 1, 2 and
3) then we can write the row from each table w/a different
family/qualifier/value to a new table. So it will be three mappers run
sequentially and end result is a join... this is the best I came up with so
far. If rowids are different accross three tables then I would have to
reformat my rowid from all three tables (normalize) prior to writing the
fourth/final table.

Is calling a scanner on the other two tables from within a mapper (that
takes the first table as the input) bad? Any clues on how that could be
done in mapreduce?
On Wed, Apr 17, 2013 at 10:59 AM, Keith Turner <[EMAIL PROTECTED]> wrote:

> If I am understaning you correctly, you are proposing for each row a
> mapper gets to look that row up in two other tables?  This would
> result in a lot of little round trip RPC calls and random disk
> accesses.
>
> I think a better solution would be to read all three tables into your
> mappers, and do the join in the reduce.  This solution will avoid all
> of the little RPC calls and do lots of sequential I/O instead of
> random accesses.  Between the map and reduce, you could track which
> table each row came from.  Any filtering could be done in the mapper
> or by iterators.  Unfortunately Accumulo does not have the needed
> input format for this out of the box.  There is a ticket,
> ACCUMULO-391.
>
>
>
> On Tue, Apr 16, 2013 at 5:28 PM, Aji Janis <[EMAIL PROTECTED]> wrote:
> > Hello,
> >
> >  I am interested in learning what the best solution/practices might be to
> > join 3 accumulo tables by running a map reduce job. Interested in getting
> > feedback on best practices and such. Heres a pseudo code of what I want
> to
> > accomplish:
> >
> >
> > AccumuloInputFormat accepts tableA
> > Global variable <table_list> has table names: tableB, tableC
> >
> > In a mapper, for example, you would do something like this:
> >
> > for each row in TableA
> >  if (row.family == "abc" && row.qualifier == "xyz") value = getValue()
> >  if (foundvalue) {
> >
> >   for each table in table_list
> >     scan table with (this rowid && family = "def")
> >     for each entry found in scan
> >       write to final_table (rowid, value_as_family,
> tablename_as_qualifier,
> > entry_as_value_string)
> >
> > }//end if foundvalue
> >
> > }//end for loop
> >
> >
> > This is a simple version of what I want to do. In my non mapreduce java
> code
> > I would do this by calling a using different scanners per table in the
> list.
> > Couple questions:
> >
> >
> > - how bad/good is performance when using scanners withing mappers?
> > - if I get one mapper per range in tableA, do I reset scanners? how? or
> > would I set up a scanner in the setup() of mapper ? --> i have no clue
> how
> > this will play out so thinking out loud here.
> > - any optimization suggestions? or examples of creating
> join_tables/indexes
> > out there that I can refer to?
> >
> >
> > Thank you for all suggestions.
>
+
Keith Turner 2013-04-17, 23:39
+
Kurt Christensen 2013-05-04, 14:15
+
David Medinets 2013-04-18, 01:03