I am investigating how well accumulo will handle mapreduce jobs. I am
interested in hearing about any known issues from anyone running mapreduce
with accumulo as their source and sink. Specifically, I want to hear your
thoughts about the following:
Assume cluster has 50 nodes.
Accumulo running is on three nodes
Solr is on three nodes
1. how many concurrent mutations can accumulo handle - more details on how
this works would be extremely helpful
2. is there a delay between when map reduce writes data to table vs. when
the data is available for read.
3. how are concurrent mutations to the same row handled (say from
different mappers/reducers) since accumulo isn't transactional
4. I am trying to solr index some accumulo data --- are there are any know
issues on accumulo end? solr end? how does one vs. multiple shard affect
the MR job?
5. should I have more accumulo/ solr nodes (ie an instance on each node in
cluster? is that necessary? workarounds?)
5. Normally I have log4j statements all over the java job. Can I still use
them with map reduce?
I apologize if any of these questions do not belong on this mailing list
(and please point me to where I can ask them, if possible). I am trying to
gather a lot of information to decide if this is a good approach for me and
the level of effort needed so I realize these are a lot of questions. I
very much appreciate any and all feedback. Thank you for your time in
John Vines 2013-03-02, 20:49
Ed Kohlwey 2013-03-03, 17:32
Aji Janis 2013-03-03, 19:14
Ed Kohlwey 2013-03-04, 05:18