I am running a search job of a single piece of query data against potential
targets in an accumulo table, using AccumuloRowInputFormat. In most cases,
the query data itself is also in the same accumulo table.
To date, my client program has pulled the query data from accumulo using a
basic scanner, stored the data into HDFS, and added the file(s) in question
to distributed cache. My mapper then pulls the data from distributed cache
into a private class member in its setup method and uses it in all of the
I had a thought, that maybe I'm spending a bit too much overhead on the
client-side doing this, and that my job submission performance is slow
because of all of the HDFS i/o and distributed cache handling for arguably
small files, in the 100-200k range max.
Does it seem like a reasonable idea to skip the preparation on the
client-side, and have the mapper setup pull the data directly from accumulo
in its setup method instead?
Questions related to this:
1. Does this put a lot of pressure on the tabletserver which contains the
data, to have many mappers hitting at once during setup for the first wave?
2. Is there any way whatsoever for the mapper to use the existing client
connection already being made? Or would I have to do the usual setup with
my own zookeeper connection, and if so does that make for a much worse