Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo, mail # user - Map Reduce on accumulo


+
Aji Janis 2012-12-04, 22:21
+
John Vines 2012-12-04, 22:45
+
Aji Janis 2012-12-04, 23:55
+
John Vines 2012-12-05, 01:36
+
Aji Janis 2012-12-06, 22:32
+
Billie Rinaldi 2012-12-07, 14:51
Copy link to this message
-
Re: Map Reduce on accumulo
Aji Janis 2012-12-07, 14:56
Thank you!

On Fri, Dec 7, 2012 at 9:51 AM, Billie Rinaldi <[EMAIL PROTECTED]> wrote:

> On Thu, Dec 6, 2012 at 2:32 PM, Aji Janis <[EMAIL PROTECTED]> wrote:
>
>> Thank you for the clarification. You mentioned that "Using the input
>> format, unless you override the autosplitting in it, you will get 1 mapper
>> per tablet." in your initial response. Again, pardon me for the newbie
>> question, but how do I find out if autosplitting is overriden or not?
>>
>
> You override autosplitting with the command
> AccumuloInputFormat.disableAutoAdjustRanges(Configuration).  So if you
> haven't done that, it will fit mappers to tablets.
>
> Billie
>
>
>
>>
>> Aji
>>
>>
>>
>> On Tue, Dec 4, 2012 at 8:36 PM, John Vines <[EMAIL PROTECTED]> wrote:
>>
>>> Your first two presumptions are correct. You will get 3 mappers and each
>>> mapper will have data for only one tablet.
>>>
>>> Each mapper will function exactly as a scanner for the range of the
>>> tablet, so you will get things in lexicographical order. So the mapper for
>>> tablet A will get all items for rowA in order before getting items for rowB.
>>>
>>> John
>>>
>>>
>>>
>>> On Tue, Dec 4, 2012 at 6:55 PM, Aji Janis <[EMAIL PROTECTED]> wrote:
>>>
>>>> Thank you John for your response. I do have a few followup questions.
>>>> Let me use a better example. Lets say my table and tabletserver
>>>> distributions are as follows:
>>>>
>>>> ---------------------------------------------
>>>> MyTable:
>>>>
>>>> rowA | f1 | q1 | v1
>>>> rowA | f2 | q2 | v2
>>>> rowA | f3 | q3 | v3
>>>>
>>>> rowB | f1 | q1 | v1
>>>> rowB | f1 | q2 | v2
>>>>
>>>> rowC | f1 | q1 | v1
>>>>
>>>> rowD | f1 | q1 | v1
>>>> rowD | f1 | q2 | v2
>>>>
>>>> rowE | f1 | q1 | v1
>>>>
>>>> ---------------------------------------------
>>>>
>>>> TabletServer1: Tablet A: rowA, rowC
>>>> TabletServer2: Tablet B: rowB
>>>> TabletServer2: Tablet C: rowD
>>>>
>>>> --------------------------------------------
>>>>
>>>> In this example, if I have a map reduce job that reads from the table
>>>> above and writes to MyTable2 table using
>>>> org.apache.accumulo.core.client.mapreduce.*AccumuloInputFormat *
>>>> and org.apache.accumulo.core.client.mapreduce.*AccumuloOutputFormat*.
>>>>
>>>> Lets not focus on what the map reduce job itself is. From
>>>> your explanation below sounds like if autosplitting is not overriden then
>>>> we get *three mappers* total. Is that right?
>>>>
>>>> Further, I will be right in assuming that a mapper will NOT get data
>>>> from multiple tablets. Correct?
>>>>
>>>> I am also very confused on what the *order of input to the mapper*will be. Would mapper_at_tabletA get
>>>> -all data from rowA before it gets all data from rowC or
>>>> -all data from rowC before it gets all data from rowA or
>>>> -something like:
>>>>    rowA | f1 | q1 | v1
>>>>    rowA | f2 | q2 | v2
>>>>    rowC | f1 | q1 | v1
>>>>    rowA | f3 | q3 | v3
>>>>
>>>> I know these are a lot of question but I really like to get a good
>>>> understanding of the architecture. Thank you!
>>>> Aji
>>>>
>>>>
>>>>
>>>> On Tue, Dec 4, 2012 at 5:45 PM, John Vines <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> A tablet consists of both an in memory portion and 0 to many files in
>>>>> HDFS. Each file may be one or many HDFS blocks. Accumulo gets a performance
>>>>> boost to the natural locality you get when you write data to HDFS, but if a
>>>>> tablet migrates that locality could be lost until data is compacted
>>>>> (rewritten). Locality could be retained due to data replication, but
>>>>> Accumulo does not make extraordinary effort to attempt to get a little bit
>>>>> of locality, as data will eventually be rewritten and locality restored.
>>>>>
>>>>> As for your example, if all data for a given row is inserted at the
>>>>> same time, then it is guaranteed to be in the same file. There is no
>>>>> atomicity guarantee regarding HDFS blocks though, so depending on the block
>>>>> size and the amount of data in the file (and it's distribution), it is
>>>>> possible for a few entries to span files even though they are adjacent.