Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # user >> Cost of scanner usage in a MapReduce mapper?


Copy link to this message
-
Re: Cost of scanner usage in a MapReduce mapper?

For clarification, are you trying to create DAO's from the Key/Value fed
to a Mapper by AccumuloInputFormat, or are you trying to process a
different data set while simultaneously querying your DAO's?

> Today I had a really nice conversation with billie and vines on #accumulo.
> This email is a followup to that conversation, and there's a little more
> context of my problem here.
>
> We have an application that we've developed independently from MapReduce.
> To
> get away from the low-level keys and values of Accumulo, we quickly made a
> series of DAOs that each take in an Accumulo Instance as a constructor
> argument.  These DAOs internally create the necessary scanners and return
> domain-specific objects.  I imagine this is a common practice.
>
> Now, we've got a feature that needs to operate on all the data, so we're
> doing
> some MapReduce.  I think I understand now the architecture of
> AccumuloInputFormat from discussions on #accumulo.  What I didn't discuss
> was
> whether it was reasonable (or not reasonable because of the performance
> cost)
> to try to use one of our DAOs within a mapper.
>
> The mappers need to operate per row, and our system has potentially
> billions of
> rows.  With my DAOs, I can reuse the same Accumulo instance, but each call
> will create a new scanner from my instance, so a MapReduce job using a DAO
> in
> the mappers will potentially create billions of scanners over the course
> of
> operation.   However, the way we've designed these DAOs, it's easy to make
> sure
> all accesses are tied to the row the mapper is tasked with (in an attempt
> to
> maintain data locality).
>
> By comparison.  I feel the AccumuloInputFormat will create about as many
> Accumulo scanners as there are tablet servers, so dramatically less.
>
> Our current thinking is that creating billions scanners with these DAO
> accesses
> might cost too much in performance, but we're not completely sure this is
> the
> case with respect to the kind of caching Accumulo does with its clients.
>
> If the performance cost is indeed too high, then we're going to have to
> deal
> with the abstraction challenge of trying to avoid code duplication between
> our
> DAOs and our MapReduce jobs.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB