Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> Map Reduce on accumulo


+
Aji Janis 2012-12-04, 22:21
+
John Vines 2012-12-04, 22:45
+
Aji Janis 2012-12-04, 23:55
Copy link to this message
-
Re: Map Reduce on accumulo
Your first two presumptions are correct. You will get 3 mappers and each
mapper will have data for only one tablet.

Each mapper will function exactly as a scanner for the range of the tablet,
so you will get things in lexicographical order. So the mapper for tablet A
will get all items for rowA in order before getting items for rowB.

John
On Tue, Dec 4, 2012 at 6:55 PM, Aji Janis <[EMAIL PROTECTED]> wrote:

> Thank you John for your response. I do have a few followup questions.
> Let me use a better example. Lets say my table and tabletserver
> distributions are as follows:
>
> ---------------------------------------------
> MyTable:
>
> rowA | f1 | q1 | v1
> rowA | f2 | q2 | v2
> rowA | f3 | q3 | v3
>
> rowB | f1 | q1 | v1
> rowB | f1 | q2 | v2
>
> rowC | f1 | q1 | v1
>
> rowD | f1 | q1 | v1
> rowD | f1 | q2 | v2
>
> rowE | f1 | q1 | v1
>
> ---------------------------------------------
>
> TabletServer1: Tablet A: rowA, rowC
> TabletServer2: Tablet B: rowB
> TabletServer2: Tablet C: rowD
>
> --------------------------------------------
>
> In this example, if I have a map reduce job that reads from the table
> above and writes to MyTable2 table using
> org.apache.accumulo.core.client.mapreduce.*AccumuloInputFormat *
> and org.apache.accumulo.core.client.mapreduce.*AccumuloOutputFormat*.
>
> Lets not focus on what the map reduce job itself is. From
> your explanation below sounds like if autosplitting is not overriden then
> we get *three mappers* total. Is that right?
>
> Further, I will be right in assuming that a mapper will NOT get data from
> multiple tablets. Correct?
>
> I am also very confused on what the *order of input to the mapper* will
> be. Would mapper_at_tabletA get
> -all data from rowA before it gets all data from rowC or
> -all data from rowC before it gets all data from rowA or
> -something like:
>    rowA | f1 | q1 | v1
>    rowA | f2 | q2 | v2
>    rowC | f1 | q1 | v1
>    rowA | f3 | q3 | v3
>
> I know these are a lot of question but I really like to get a good
> understanding of the architecture. Thank you!
> Aji
>
>
>
> On Tue, Dec 4, 2012 at 5:45 PM, John Vines <[EMAIL PROTECTED]> wrote:
>
>> A tablet consists of both an in memory portion and 0 to many files in
>> HDFS. Each file may be one or many HDFS blocks. Accumulo gets a performance
>> boost to the natural locality you get when you write data to HDFS, but if a
>> tablet migrates that locality could be lost until data is compacted
>> (rewritten). Locality could be retained due to data replication, but
>> Accumulo does not make extraordinary effort to attempt to get a little bit
>> of locality, as data will eventually be rewritten and locality restored.
>>
>> As for your example, if all data for a given row is inserted at the same
>> time, then it is guaranteed to be in the same file. There is no atomicity
>> guarantee regarding HDFS blocks though, so depending on the block size and
>> the amount of data in the file (and it's distribution), it is possible for
>> a few entries to span files even though they are adjacent.
>>
>> Using the input format, unless you override the autosplitting in it, you
>> will get 1 mapper per tablet. If you disable auto-splitting, then you get
>> one mapper per range you specify.
>>
>> Hope this helps, let me know if you have other questions or need
>> clarification.
>>
>> John
>>
>>
>>
>> On Tue, Dec 4, 2012 at 5:21 PM, Aji Janis <[EMAIL PROTECTED]> wrote:
>>
>>> NOTE: I am fairly sure this hasn't been asked on here yet - my apologies
>>> if it was already asked in which case please forward me a link to the
>>> answers.Thank you.
>>>
>>> If my environment set up is as follows:
>>> -64MB HDFS block
>>> -5 tablet servers
>>> -10 tablets of size 1GB each per tablet server
>>>
>>> If I have a table like below:
>>> rowA | f1 | q1 | v1
>>> rowA | f1 | q2 | v2
>>>
>>> rowB | f1 | q1 | v3
>>>
>>> rowC | f1 | q1 | v4
>>> rowC | f2 | q1 | v5
>>> rowC | f3 | q3 | v6
>>>
>>> From the little documentation, I know all data about rowA will go one
+
Aji Janis 2012-12-06, 22:32
+
Billie Rinaldi 2012-12-07, 14:51
+
Aji Janis 2012-12-07, 14:56
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB