-Re: Cost of scanner usage in a MapReduce mapper?
skuehn@... 2012-11-01, 08:20
For clarification, are you trying to create DAO's from the Key/Value fed
to a Mapper by AccumuloInputFormat, or are you trying to process a
different data set while simultaneously querying your DAO's?
> Today I had a really nice conversation with billie and vines on #accumulo.
> This email is a followup to that conversation, and there's a little more
> context of my problem here.
> We have an application that we've developed independently from MapReduce.
> get away from the low-level keys and values of Accumulo, we quickly made a
> series of DAOs that each take in an Accumulo Instance as a constructor
> argument. These DAOs internally create the necessary scanners and return
> domain-specific objects. I imagine this is a common practice.
> Now, we've got a feature that needs to operate on all the data, so we're
> some MapReduce. I think I understand now the architecture of
> AccumuloInputFormat from discussions on #accumulo. What I didn't discuss
> whether it was reasonable (or not reasonable because of the performance
> to try to use one of our DAOs within a mapper.
> The mappers need to operate per row, and our system has potentially
> billions of
> rows. With my DAOs, I can reuse the same Accumulo instance, but each call
> will create a new scanner from my instance, so a MapReduce job using a DAO
> the mappers will potentially create billions of scanners over the course
> operation. However, the way we've designed these DAOs, it's easy to make
> all accesses are tied to the row the mapper is tasked with (in an attempt
> maintain data locality).
> By comparison. I feel the AccumuloInputFormat will create about as many
> Accumulo scanners as there are tablet servers, so dramatically less.
> Our current thinking is that creating billions scanners with these DAO
> might cost too much in performance, but we're not completely sure this is
> case with respect to the kind of caching Accumulo does with its clients.
> If the performance cost is indeed too high, then we're going to have to
> with the abstraction challenge of trying to avoid code duplication between
> DAOs and our MapReduce jobs.