Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> Mapreduce, Indexing and Logging


+
Aji Janis 2013-03-02, 20:11
+
John Vines 2013-03-02, 20:49
+
Ed Kohlwey 2013-03-03, 17:32
+
Aji Janis 2013-03-03, 19:14
Copy link to this message
-
Re: Mapreduce, Indexing and Logging
As John mentioned the specs of the three nodes will have a significant
effect. Since you already have hardware selected I would build it, run it,
and then add nodes if the performance is low. You can use things like the
Accumulo file output format to write directly to map files that you
subsequently import as well, which makes the performance of MapReduce
mostly independent of Accumulo performance.

On Mar 3, 2013 11:15 AM, "Aji Janis" <[EMAIL PROTECTED]> wrote:
>
> John and Ed thank you both for your responses.
>
> Using Solr for search is a requirement. When we process data theres quite
a bit of information we are interested in indexing (dates, locations, etc)
and we use Solr for that. All the data will be stored in Accumulo after
processing and then indexed in solr. But since I am trying to do all the
processing in map reduce I was interested in hearing any limitations there
might be if N (>=60) mappers or reducers try to put things in solr after
processing and before writing to accumulo.
>
>
>
> On Sun, Mar 3, 2013 at 12:32 PM, Ed Kohlwey <[EMAIL PROTECTED]> wrote:
>>
>> With respect to indexing, what are you trying to achieve? I have not
used Solr with Accumulo but have done indexing directly in Accumulo,
leveraging Lucene libraries as appropriate. You can get very good
performance specific to your domain by doing so and its less O&M overhead.
Of c course then you need to learn all about indexing so there's a little
bit of a tradeoff.
>>
>> On Mar 2, 2013 12:50 PM, "John Vines" <[EMAIL PROTECTED]> wrote:
>>>
>>> 1. This is quite variable. It depends on your hardware specs, primarily
CPU and disk throughput. It also depends on how your system is configured
for these resources and your typical mutation size. How your mutations are
distributed is another factor.
>>> 2. Under the hood, the output format uses a BatchWriter. There is a
guarantee that once a flush comes back from the batchwriter, the data is
available. Unless explicitly called, the batchwriter will flush whenever
half of it's capacity is full, or when idle for a short period (I want to
say 3 seconds, but I could be mistaken).
>>> 3. If the 2 mutations don't intersect at all, then there's no issue. If
they have identical columns, then whichever one has the newest timestamp
will come up first. If you are explicitly setting timestamps or they arrive
at the same time, the outcome is non-deterministic.
>>> 4. I'm going to defer this question to someone else
>>> 5. Ideally each datanode should be a tserver. And they will also be a
tasktracvker. This will help ensure data locality so you can get around any
network boundaries/overhead.
>>> 5. I don't see why not. There's a little bit of log4j statements in the
Accumulo client, so it would actually make it easier for you to deal with
them there too.
>>>
>>> John
>>>
>>>
>>> On Sat, Mar 2, 2013 at 3:11 PM, Aji Janis <[EMAIL PROTECTED]> wrote:
>>>>
>>>> Hello,
>>>>
>>>>  I am investigating how well accumulo will handle mapreduce jobs. I am
interested in hearing about any known issues from anyone running mapreduce
with accumulo as their source and sink. Specifically, I want to hear your
thoughts about the following:
>>>>
>>>> Assume cluster has 50 nodes.
>>>> Accumulo running is on three nodes
>>>> Solr is on three nodes
>>>>
>>>>
>>>> 1. how many concurrent mutations can accumulo handle - more details on
how this works would be extremely helpful
>>>> 2. is there a delay between when map reduce writes data to table vs.
when the data is available for read.
>>>> 3. how are concurrent mutations to the same row handled  (say from
different mappers/reducers) since accumulo isn't transactional
>>>> 4. I am trying to solr index some accumulo data --- are there are any
know issues on accumulo end? solr end? how does one vs. multiple shard
affect the MR job?
>>>> 5. should I have more accumulo/ solr nodes (ie an instance on each
node in cluster? is that necessary? workarounds?)
>>>> 5. Normally I have log4j statements all over the java job. Can I still
use them with map reduce?
list (and please point me to where I can ask them, if possible). I am
trying to gather a lot of information to decide if this is a good approach
for me and the level of effort needed so I realize these are a lot of
questions. I very much appreciate any and all feedback. Thank you for your
time in advance!