On Wed, Oct 31, 2012 at 8:35 PM, Sukant Hajra <[EMAIL PROTECTED]>wrote:
> Today I had a really nice conversation with billie and vines on #accumulo.
> This email is a followup to that conversation, and there's a little more
> context of my problem here.
> We have an application that we've developed independently from MapReduce.
> get away from the low-level keys and values of Accumulo, we quickly made a
> series of DAOs that each take in an Accumulo Instance as a constructor
> argument. These DAOs internally create the necessary scanners and return
> domain-specific objects. I imagine this is a common practice.
> Now, we've got a feature that needs to operate on all the data, so we're
> some MapReduce. I think I understand now the architecture of
> AccumuloInputFormat from discussions on #accumulo. What I didn't discuss
> whether it was reasonable (or not reasonable because of the performance
> to try to use one of our DAOs within a mapper.
> The mappers need to operate per row, and our system has potentially
> billions of
> rows. With my DAOs, I can reuse the same Accumulo instance, but each call
> will create a new scanner from my instance, so a MapReduce job using a DAO
> the mappers will potentially create billions of scanners over the course of
> operation. However, the way we've designed these DAOs, it's easy to make
> all accesses are tied to the row the mapper is tasked with (in an attempt
> maintain data locality).
I would strongly recommend extending the InputFormatBase. There are a few
examples of doing this, namely the AccumuloInputFormat,
AccumuloRowInputFormat, and ChunkInputFormat in the filedata example.
However, instead of taking an instance, your DAO would take a Scanner that
has already been initialized and read to the end to the end of the current
row (rather, stop iterating over the scanner when the next row has been
reached, so that the same scanner can be passed to a new DAO).
> By comparison. I feel the AccumuloInputFormat will create about as many
> Accumulo scanners as there are tablet servers, so dramatically less.
> Our current thinking is that creating billions scanners with these DAO
> might cost too much in performance, but we're not completely sure this is
> case with respect to the kind of caching Accumulo does with its clients.
> If the performance cost is indeed too high, then we're going to have to
> with the abstraction challenge of trying to avoid code duplication between
> DAOs and our MapReduce jobs.
> Thanks for your feedback,