Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Accumulo >> mail # user >> Map Reduce on accumulo


+
Aji Janis 2012-12-04, 22:21
+
John Vines 2012-12-04, 22:45
Copy link to this message
-
Re: Map Reduce on accumulo
Thank you John for your response. I do have a few followup questions.
Let me use a better example. Lets say my table and tabletserver
distributions are as follows:

---------------------------------------------
MyTable:

rowA | f1 | q1 | v1
rowA | f2 | q2 | v2
rowA | f3 | q3 | v3

rowB | f1 | q1 | v1
rowB | f1 | q2 | v2

rowC | f1 | q1 | v1

rowD | f1 | q1 | v1
rowD | f1 | q2 | v2

rowE | f1 | q1 | v1

---------------------------------------------

TabletServer1: Tablet A: rowA, rowC
TabletServer2: Tablet B: rowB
TabletServer2: Tablet C: rowD

--------------------------------------------

In this example, if I have a map reduce job that reads from the table above
and writes to MyTable2 table using
org.apache.accumulo.core.client.mapreduce.*AccumuloInputFormat *
and org.apache.accumulo.core.client.mapreduce.*AccumuloOutputFormat*.

Lets not focus on what the map reduce job itself is. From
your explanation below sounds like if autosplitting is not overriden then
we get *three mappers* total. Is that right?

Further, I will be right in assuming that a mapper will NOT get data from
multiple tablets. Correct?

I am also very confused on what the *order of input to the mapper* will be.
Would mapper_at_tabletA get
-all data from rowA before it gets all data from rowC or
-all data from rowC before it gets all data from rowA or
-something like:
   rowA | f1 | q1 | v1
   rowA | f2 | q2 | v2
   rowC | f1 | q1 | v1
   rowA | f3 | q3 | v3

I know these are a lot of question but I really like to get a good
understanding of the architecture. Thank you!
Aji

On Tue, Dec 4, 2012 at 5:45 PM, John Vines <[EMAIL PROTECTED]> wrote:

> A tablet consists of both an in memory portion and 0 to many files in
> HDFS. Each file may be one or many HDFS blocks. Accumulo gets a performance
> boost to the natural locality you get when you write data to HDFS, but if a
> tablet migrates that locality could be lost until data is compacted
> (rewritten). Locality could be retained due to data replication, but
> Accumulo does not make extraordinary effort to attempt to get a little bit
> of locality, as data will eventually be rewritten and locality restored.
>
> As for your example, if all data for a given row is inserted at the same
> time, then it is guaranteed to be in the same file. There is no atomicity
> guarantee regarding HDFS blocks though, so depending on the block size and
> the amount of data in the file (and it's distribution), it is possible for
> a few entries to span files even though they are adjacent.
>
> Using the input format, unless you override the autosplitting in it, you
> will get 1 mapper per tablet. If you disable auto-splitting, then you get
> one mapper per range you specify.
>
> Hope this helps, let me know if you have other questions or need
> clarification.
>
> John
>
>
>
> On Tue, Dec 4, 2012 at 5:21 PM, Aji Janis <[EMAIL PROTECTED]> wrote:
>
>> NOTE: I am fairly sure this hasn't been asked on here yet - my apologies
>> if it was already asked in which case please forward me a link to the
>> answers.Thank you.
>>
>> If my environment set up is as follows:
>> -64MB HDFS block
>> -5 tablet servers
>> -10 tablets of size 1GB each per tablet server
>>
>> If I have a table like below:
>> rowA | f1 | q1 | v1
>> rowA | f1 | q2 | v2
>>
>> rowB | f1 | q1 | v3
>>
>> rowC | f1 | q1 | v4
>> rowC | f2 | q1 | v5
>> rowC | f3 | q3 | v6
>>
>> From the little documentation, I know all data about rowA will go one
>> tablet which may or may not contain data about other rows ie its all or
>> none. So my questions are:
>>
>> How are the tablets mapped to a Datanode or HDFS block? Obviously, One
>> tablet is split into multiple HDFS blocks (8 in this case) so would they be
>> stored on the same or different datanode(s) or does it not matter?
>>
>> In the example above, would all data about RowC (or A or B) go onto the
>> same HDFS block or different HDFS blocks?
>>
>> When executing a map reduce job how many mappers would I get? (one per
>> hdfs block? or per tablet? or per server?)
+
John Vines 2012-12-05, 01:36
+
Aji Janis 2012-12-06, 22:32
+
Billie Rinaldi 2012-12-07, 14:51
+
Aji Janis 2012-12-07, 14:56
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB