|
Aji Janis
2012-12-04, 22:21
John Vines
2012-12-04, 22:45
Aji Janis
2012-12-04, 23:55
John Vines
2012-12-05, 01:36
Aji Janis
2012-12-06, 22:32
Billie Rinaldi
2012-12-07, 14:51
Aji Janis
2012-12-07, 14:56
|
-
Map Reduce on accumuloAji Janis 2012-12-04, 22:21
NOTE: I am fairly sure this hasn't been asked on here yet - my apologies if
it was already asked in which case please forward me a link to the answers.Thank you. If my environment set up is as follows: -64MB HDFS block -5 tablet servers -10 tablets of size 1GB each per tablet server If I have a table like below: rowA | f1 | q1 | v1 rowA | f1 | q2 | v2 rowB | f1 | q1 | v3 rowC | f1 | q1 | v4 rowC | f2 | q1 | v5 rowC | f3 | q3 | v6 >From the little documentation, I know all data about rowA will go one tablet which may or may not contain data about other rows ie its all or none. So my questions are: How are the tablets mapped to a Datanode or HDFS block? Obviously, One tablet is split into multiple HDFS blocks (8 in this case) so would they be stored on the same or different datanode(s) or does it not matter? In the example above, would all data about RowC (or A or B) go onto the same HDFS block or different HDFS blocks? When executing a map reduce job how many mappers would I get? (one per hdfs block? or per tablet? or per server?) Thank you in advance for any and all suggestions.
-
Re: Map Reduce on accumuloJohn Vines 2012-12-04, 22:45
A tablet consists of both an in memory portion and 0 to many files in HDFS.
Each file may be one or many HDFS blocks. Accumulo gets a performance boost to the natural locality you get when you write data to HDFS, but if a tablet migrates that locality could be lost until data is compacted (rewritten). Locality could be retained due to data replication, but Accumulo does not make extraordinary effort to attempt to get a little bit of locality, as data will eventually be rewritten and locality restored. As for your example, if all data for a given row is inserted at the same time, then it is guaranteed to be in the same file. There is no atomicity guarantee regarding HDFS blocks though, so depending on the block size and the amount of data in the file (and it's distribution), it is possible for a few entries to span files even though they are adjacent. Using the input format, unless you override the autosplitting in it, you will get 1 mapper per tablet. If you disable auto-splitting, then you get one mapper per range you specify. Hope this helps, let me know if you have other questions or need clarification. John On Tue, Dec 4, 2012 at 5:21 PM, Aji Janis <[EMAIL PROTECTED]> wrote: > NOTE: I am fairly sure this hasn't been asked on here yet - my apologies > if it was already asked in which case please forward me a link to the > answers.Thank you. > > If my environment set up is as follows: > -64MB HDFS block > -5 tablet servers > -10 tablets of size 1GB each per tablet server > > If I have a table like below: > rowA | f1 | q1 | v1 > rowA | f1 | q2 | v2 > > rowB | f1 | q1 | v3 > > rowC | f1 | q1 | v4 > rowC | f2 | q1 | v5 > rowC | f3 | q3 | v6 > > From the little documentation, I know all data about rowA will go one > tablet which may or may not contain data about other rows ie its all or > none. So my questions are: > > How are the tablets mapped to a Datanode or HDFS block? Obviously, One > tablet is split into multiple HDFS blocks (8 in this case) so would they be > stored on the same or different datanode(s) or does it not matter? > > In the example above, would all data about RowC (or A or B) go onto the > same HDFS block or different HDFS blocks? > > When executing a map reduce job how many mappers would I get? (one per > hdfs block? or per tablet? or per server?) > > Thank you in advance for any and all suggestions. >
-
Re: Map Reduce on accumuloAji Janis 2012-12-04, 23:55
Thank you John for your response. I do have a few followup questions.
Let me use a better example. Lets say my table and tabletserver distributions are as follows: --------------------------------------------- MyTable: rowA | f1 | q1 | v1 rowA | f2 | q2 | v2 rowA | f3 | q3 | v3 rowB | f1 | q1 | v1 rowB | f1 | q2 | v2 rowC | f1 | q1 | v1 rowD | f1 | q1 | v1 rowD | f1 | q2 | v2 rowE | f1 | q1 | v1 --------------------------------------------- TabletServer1: Tablet A: rowA, rowC TabletServer2: Tablet B: rowB TabletServer2: Tablet C: rowD -------------------------------------------- In this example, if I have a map reduce job that reads from the table above and writes to MyTable2 table using org.apache.accumulo.core.client.mapreduce.*AccumuloInputFormat * and org.apache.accumulo.core.client.mapreduce.*AccumuloOutputFormat*. Lets not focus on what the map reduce job itself is. From your explanation below sounds like if autosplitting is not overriden then we get *three mappers* total. Is that right? Further, I will be right in assuming that a mapper will NOT get data from multiple tablets. Correct? I am also very confused on what the *order of input to the mapper* will be. Would mapper_at_tabletA get -all data from rowA before it gets all data from rowC or -all data from rowC before it gets all data from rowA or -something like: rowA | f1 | q1 | v1 rowA | f2 | q2 | v2 rowC | f1 | q1 | v1 rowA | f3 | q3 | v3 I know these are a lot of question but I really like to get a good understanding of the architecture. Thank you! Aji On Tue, Dec 4, 2012 at 5:45 PM, John Vines <[EMAIL PROTECTED]> wrote: > A tablet consists of both an in memory portion and 0 to many files in > HDFS. Each file may be one or many HDFS blocks. Accumulo gets a performance > boost to the natural locality you get when you write data to HDFS, but if a > tablet migrates that locality could be lost until data is compacted > (rewritten). Locality could be retained due to data replication, but > Accumulo does not make extraordinary effort to attempt to get a little bit > of locality, as data will eventually be rewritten and locality restored. > > As for your example, if all data for a given row is inserted at the same > time, then it is guaranteed to be in the same file. There is no atomicity > guarantee regarding HDFS blocks though, so depending on the block size and > the amount of data in the file (and it's distribution), it is possible for > a few entries to span files even though they are adjacent. > > Using the input format, unless you override the autosplitting in it, you > will get 1 mapper per tablet. If you disable auto-splitting, then you get > one mapper per range you specify. > > Hope this helps, let me know if you have other questions or need > clarification. > > John > > > > On Tue, Dec 4, 2012 at 5:21 PM, Aji Janis <[EMAIL PROTECTED]> wrote: > >> NOTE: I am fairly sure this hasn't been asked on here yet - my apologies >> if it was already asked in which case please forward me a link to the >> answers.Thank you. >> >> If my environment set up is as follows: >> -64MB HDFS block >> -5 tablet servers >> -10 tablets of size 1GB each per tablet server >> >> If I have a table like below: >> rowA | f1 | q1 | v1 >> rowA | f1 | q2 | v2 >> >> rowB | f1 | q1 | v3 >> >> rowC | f1 | q1 | v4 >> rowC | f2 | q1 | v5 >> rowC | f3 | q3 | v6 >> >> From the little documentation, I know all data about rowA will go one >> tablet which may or may not contain data about other rows ie its all or >> none. So my questions are: >> >> How are the tablets mapped to a Datanode or HDFS block? Obviously, One >> tablet is split into multiple HDFS blocks (8 in this case) so would they be >> stored on the same or different datanode(s) or does it not matter? >> >> In the example above, would all data about RowC (or A or B) go onto the >> same HDFS block or different HDFS blocks? >> >> When executing a map reduce job how many mappers would I get? (one per >> hdfs block? or per tablet? or per server?)
-
Re: Map Reduce on accumuloJohn Vines 2012-12-05, 01:36
Your first two presumptions are correct. You will get 3 mappers and each
mapper will have data for only one tablet. Each mapper will function exactly as a scanner for the range of the tablet, so you will get things in lexicographical order. So the mapper for tablet A will get all items for rowA in order before getting items for rowB. John On Tue, Dec 4, 2012 at 6:55 PM, Aji Janis <[EMAIL PROTECTED]> wrote: > Thank you John for your response. I do have a few followup questions. > Let me use a better example. Lets say my table and tabletserver > distributions are as follows: > > --------------------------------------------- > MyTable: > > rowA | f1 | q1 | v1 > rowA | f2 | q2 | v2 > rowA | f3 | q3 | v3 > > rowB | f1 | q1 | v1 > rowB | f1 | q2 | v2 > > rowC | f1 | q1 | v1 > > rowD | f1 | q1 | v1 > rowD | f1 | q2 | v2 > > rowE | f1 | q1 | v1 > > --------------------------------------------- > > TabletServer1: Tablet A: rowA, rowC > TabletServer2: Tablet B: rowB > TabletServer2: Tablet C: rowD > > -------------------------------------------- > > In this example, if I have a map reduce job that reads from the table > above and writes to MyTable2 table using > org.apache.accumulo.core.client.mapreduce.*AccumuloInputFormat * > and org.apache.accumulo.core.client.mapreduce.*AccumuloOutputFormat*. > > Lets not focus on what the map reduce job itself is. From > your explanation below sounds like if autosplitting is not overriden then > we get *three mappers* total. Is that right? > > Further, I will be right in assuming that a mapper will NOT get data from > multiple tablets. Correct? > > I am also very confused on what the *order of input to the mapper* will > be. Would mapper_at_tabletA get > -all data from rowA before it gets all data from rowC or > -all data from rowC before it gets all data from rowA or > -something like: > rowA | f1 | q1 | v1 > rowA | f2 | q2 | v2 > rowC | f1 | q1 | v1 > rowA | f3 | q3 | v3 > > I know these are a lot of question but I really like to get a good > understanding of the architecture. Thank you! > Aji > > > > On Tue, Dec 4, 2012 at 5:45 PM, John Vines <[EMAIL PROTECTED]> wrote: > >> A tablet consists of both an in memory portion and 0 to many files in >> HDFS. Each file may be one or many HDFS blocks. Accumulo gets a performance >> boost to the natural locality you get when you write data to HDFS, but if a >> tablet migrates that locality could be lost until data is compacted >> (rewritten). Locality could be retained due to data replication, but >> Accumulo does not make extraordinary effort to attempt to get a little bit >> of locality, as data will eventually be rewritten and locality restored. >> >> As for your example, if all data for a given row is inserted at the same >> time, then it is guaranteed to be in the same file. There is no atomicity >> guarantee regarding HDFS blocks though, so depending on the block size and >> the amount of data in the file (and it's distribution), it is possible for >> a few entries to span files even though they are adjacent. >> >> Using the input format, unless you override the autosplitting in it, you >> will get 1 mapper per tablet. If you disable auto-splitting, then you get >> one mapper per range you specify. >> >> Hope this helps, let me know if you have other questions or need >> clarification. >> >> John >> >> >> >> On Tue, Dec 4, 2012 at 5:21 PM, Aji Janis <[EMAIL PROTECTED]> wrote: >> >>> NOTE: I am fairly sure this hasn't been asked on here yet - my apologies >>> if it was already asked in which case please forward me a link to the >>> answers.Thank you. >>> >>> If my environment set up is as follows: >>> -64MB HDFS block >>> -5 tablet servers >>> -10 tablets of size 1GB each per tablet server >>> >>> If I have a table like below: >>> rowA | f1 | q1 | v1 >>> rowA | f1 | q2 | v2 >>> >>> rowB | f1 | q1 | v3 >>> >>> rowC | f1 | q1 | v4 >>> rowC | f2 | q1 | v5 >>> rowC | f3 | q3 | v6 >>> >>> From the little documentation, I know all data about rowA will go one
-
Re: Map Reduce on accumuloAji Janis 2012-12-06, 22:32
Thank you for the clarification. You mentioned that "Using the input
format, unless you override the autosplitting in it, you will get 1 mapper per tablet." in your initial response. Again, pardon me for the newbie question, but how do I find out if autosplitting is overriden or not? Aji On Tue, Dec 4, 2012 at 8:36 PM, John Vines <[EMAIL PROTECTED]> wrote: > Your first two presumptions are correct. You will get 3 mappers and each > mapper will have data for only one tablet. > > Each mapper will function exactly as a scanner for the range of the > tablet, so you will get things in lexicographical order. So the mapper for > tablet A will get all items for rowA in order before getting items for rowB. > > John > > > > On Tue, Dec 4, 2012 at 6:55 PM, Aji Janis <[EMAIL PROTECTED]> wrote: > >> Thank you John for your response. I do have a few followup questions. >> Let me use a better example. Lets say my table and tabletserver >> distributions are as follows: >> >> --------------------------------------------- >> MyTable: >> >> rowA | f1 | q1 | v1 >> rowA | f2 | q2 | v2 >> rowA | f3 | q3 | v3 >> >> rowB | f1 | q1 | v1 >> rowB | f1 | q2 | v2 >> >> rowC | f1 | q1 | v1 >> >> rowD | f1 | q1 | v1 >> rowD | f1 | q2 | v2 >> >> rowE | f1 | q1 | v1 >> >> --------------------------------------------- >> >> TabletServer1: Tablet A: rowA, rowC >> TabletServer2: Tablet B: rowB >> TabletServer2: Tablet C: rowD >> >> -------------------------------------------- >> >> In this example, if I have a map reduce job that reads from the table >> above and writes to MyTable2 table using >> org.apache.accumulo.core.client.mapreduce.*AccumuloInputFormat * >> and org.apache.accumulo.core.client.mapreduce.*AccumuloOutputFormat*. >> >> Lets not focus on what the map reduce job itself is. From >> your explanation below sounds like if autosplitting is not overriden then >> we get *three mappers* total. Is that right? >> >> Further, I will be right in assuming that a mapper will NOT get data from >> multiple tablets. Correct? >> >> I am also very confused on what the *order of input to the mapper* will >> be. Would mapper_at_tabletA get >> -all data from rowA before it gets all data from rowC or >> -all data from rowC before it gets all data from rowA or >> -something like: >> rowA | f1 | q1 | v1 >> rowA | f2 | q2 | v2 >> rowC | f1 | q1 | v1 >> rowA | f3 | q3 | v3 >> >> I know these are a lot of question but I really like to get a good >> understanding of the architecture. Thank you! >> Aji >> >> >> >> On Tue, Dec 4, 2012 at 5:45 PM, John Vines <[EMAIL PROTECTED]> wrote: >> >>> A tablet consists of both an in memory portion and 0 to many files in >>> HDFS. Each file may be one or many HDFS blocks. Accumulo gets a performance >>> boost to the natural locality you get when you write data to HDFS, but if a >>> tablet migrates that locality could be lost until data is compacted >>> (rewritten). Locality could be retained due to data replication, but >>> Accumulo does not make extraordinary effort to attempt to get a little bit >>> of locality, as data will eventually be rewritten and locality restored. >>> >>> As for your example, if all data for a given row is inserted at the same >>> time, then it is guaranteed to be in the same file. There is no atomicity >>> guarantee regarding HDFS blocks though, so depending on the block size and >>> the amount of data in the file (and it's distribution), it is possible for >>> a few entries to span files even though they are adjacent. >>> >>> Using the input format, unless you override the autosplitting in it, you >>> will get 1 mapper per tablet. If you disable auto-splitting, then you get >>> one mapper per range you specify. >>> >>> Hope this helps, let me know if you have other questions or need >>> clarification. >>> >>> John >>> >>> >>> >>> On Tue, Dec 4, 2012 at 5:21 PM, Aji Janis <[EMAIL PROTECTED]> wrote: >>> >>>> NOTE: I am fairly sure this hasn't been asked on here yet - my >>>> apologies if it was already asked in which case please forward me a link to
-
Re: Map Reduce on accumuloBillie Rinaldi 2012-12-07, 14:51
On Thu, Dec 6, 2012 at 2:32 PM, Aji Janis <[EMAIL PROTECTED]> wrote:
> Thank you for the clarification. You mentioned that "Using the input > format, unless you override the autosplitting in it, you will get 1 mapper > per tablet." in your initial response. Again, pardon me for the newbie > question, but how do I find out if autosplitting is overriden or not? > You override autosplitting with the command AccumuloInputFormat.disableAutoAdjustRanges(Configuration). So if you haven't done that, it will fit mappers to tablets. Billie > > Aji > > > > On Tue, Dec 4, 2012 at 8:36 PM, John Vines <[EMAIL PROTECTED]> wrote: > >> Your first two presumptions are correct. You will get 3 mappers and each >> mapper will have data for only one tablet. >> >> Each mapper will function exactly as a scanner for the range of the >> tablet, so you will get things in lexicographical order. So the mapper for >> tablet A will get all items for rowA in order before getting items for rowB. >> >> John >> >> >> >> On Tue, Dec 4, 2012 at 6:55 PM, Aji Janis <[EMAIL PROTECTED]> wrote: >> >>> Thank you John for your response. I do have a few followup questions. >>> Let me use a better example. Lets say my table and tabletserver >>> distributions are as follows: >>> >>> --------------------------------------------- >>> MyTable: >>> >>> rowA | f1 | q1 | v1 >>> rowA | f2 | q2 | v2 >>> rowA | f3 | q3 | v3 >>> >>> rowB | f1 | q1 | v1 >>> rowB | f1 | q2 | v2 >>> >>> rowC | f1 | q1 | v1 >>> >>> rowD | f1 | q1 | v1 >>> rowD | f1 | q2 | v2 >>> >>> rowE | f1 | q1 | v1 >>> >>> --------------------------------------------- >>> >>> TabletServer1: Tablet A: rowA, rowC >>> TabletServer2: Tablet B: rowB >>> TabletServer2: Tablet C: rowD >>> >>> -------------------------------------------- >>> >>> In this example, if I have a map reduce job that reads from the table >>> above and writes to MyTable2 table using >>> org.apache.accumulo.core.client.mapreduce.*AccumuloInputFormat * >>> and org.apache.accumulo.core.client.mapreduce.*AccumuloOutputFormat*. >>> >>> Lets not focus on what the map reduce job itself is. From >>> your explanation below sounds like if autosplitting is not overriden then >>> we get *three mappers* total. Is that right? >>> >>> Further, I will be right in assuming that a mapper will NOT get data >>> from multiple tablets. Correct? >>> >>> I am also very confused on what the *order of input to the mapper* will >>> be. Would mapper_at_tabletA get >>> -all data from rowA before it gets all data from rowC or >>> -all data from rowC before it gets all data from rowA or >>> -something like: >>> rowA | f1 | q1 | v1 >>> rowA | f2 | q2 | v2 >>> rowC | f1 | q1 | v1 >>> rowA | f3 | q3 | v3 >>> >>> I know these are a lot of question but I really like to get a good >>> understanding of the architecture. Thank you! >>> Aji >>> >>> >>> >>> On Tue, Dec 4, 2012 at 5:45 PM, John Vines <[EMAIL PROTECTED]> wrote: >>> >>>> A tablet consists of both an in memory portion and 0 to many files in >>>> HDFS. Each file may be one or many HDFS blocks. Accumulo gets a performance >>>> boost to the natural locality you get when you write data to HDFS, but if a >>>> tablet migrates that locality could be lost until data is compacted >>>> (rewritten). Locality could be retained due to data replication, but >>>> Accumulo does not make extraordinary effort to attempt to get a little bit >>>> of locality, as data will eventually be rewritten and locality restored. >>>> >>>> As for your example, if all data for a given row is inserted at the >>>> same time, then it is guaranteed to be in the same file. There is no >>>> atomicity guarantee regarding HDFS blocks though, so depending on the block >>>> size and the amount of data in the file (and it's distribution), it is >>>> possible for a few entries to span files even though they are adjacent. >>>> >>>> Using the input format, unless you override the autosplitting in it, >>>> you will get 1 mapper per tablet. If you disable auto-splitting, then you
-
Re: Map Reduce on accumuloAji Janis 2012-12-07, 14:56
Thank you!
On Fri, Dec 7, 2012 at 9:51 AM, Billie Rinaldi <[EMAIL PROTECTED]> wrote: > On Thu, Dec 6, 2012 at 2:32 PM, Aji Janis <[EMAIL PROTECTED]> wrote: > >> Thank you for the clarification. You mentioned that "Using the input >> format, unless you override the autosplitting in it, you will get 1 mapper >> per tablet." in your initial response. Again, pardon me for the newbie >> question, but how do I find out if autosplitting is overriden or not? >> > > You override autosplitting with the command > AccumuloInputFormat.disableAutoAdjustRanges(Configuration). So if you > haven't done that, it will fit mappers to tablets. > > Billie > > > >> >> Aji >> >> >> >> On Tue, Dec 4, 2012 at 8:36 PM, John Vines <[EMAIL PROTECTED]> wrote: >> >>> Your first two presumptions are correct. You will get 3 mappers and each >>> mapper will have data for only one tablet. >>> >>> Each mapper will function exactly as a scanner for the range of the >>> tablet, so you will get things in lexicographical order. So the mapper for >>> tablet A will get all items for rowA in order before getting items for rowB. >>> >>> John >>> >>> >>> >>> On Tue, Dec 4, 2012 at 6:55 PM, Aji Janis <[EMAIL PROTECTED]> wrote: >>> >>>> Thank you John for your response. I do have a few followup questions. >>>> Let me use a better example. Lets say my table and tabletserver >>>> distributions are as follows: >>>> >>>> --------------------------------------------- >>>> MyTable: >>>> >>>> rowA | f1 | q1 | v1 >>>> rowA | f2 | q2 | v2 >>>> rowA | f3 | q3 | v3 >>>> >>>> rowB | f1 | q1 | v1 >>>> rowB | f1 | q2 | v2 >>>> >>>> rowC | f1 | q1 | v1 >>>> >>>> rowD | f1 | q1 | v1 >>>> rowD | f1 | q2 | v2 >>>> >>>> rowE | f1 | q1 | v1 >>>> >>>> --------------------------------------------- >>>> >>>> TabletServer1: Tablet A: rowA, rowC >>>> TabletServer2: Tablet B: rowB >>>> TabletServer2: Tablet C: rowD >>>> >>>> -------------------------------------------- >>>> >>>> In this example, if I have a map reduce job that reads from the table >>>> above and writes to MyTable2 table using >>>> org.apache.accumulo.core.client.mapreduce.*AccumuloInputFormat * >>>> and org.apache.accumulo.core.client.mapreduce.*AccumuloOutputFormat*. >>>> >>>> Lets not focus on what the map reduce job itself is. From >>>> your explanation below sounds like if autosplitting is not overriden then >>>> we get *three mappers* total. Is that right? >>>> >>>> Further, I will be right in assuming that a mapper will NOT get data >>>> from multiple tablets. Correct? >>>> >>>> I am also very confused on what the *order of input to the mapper*will be. Would mapper_at_tabletA get >>>> -all data from rowA before it gets all data from rowC or >>>> -all data from rowC before it gets all data from rowA or >>>> -something like: >>>> rowA | f1 | q1 | v1 >>>> rowA | f2 | q2 | v2 >>>> rowC | f1 | q1 | v1 >>>> rowA | f3 | q3 | v3 >>>> >>>> I know these are a lot of question but I really like to get a good >>>> understanding of the architecture. Thank you! >>>> Aji >>>> >>>> >>>> >>>> On Tue, Dec 4, 2012 at 5:45 PM, John Vines <[EMAIL PROTECTED]> wrote: >>>> >>>>> A tablet consists of both an in memory portion and 0 to many files in >>>>> HDFS. Each file may be one or many HDFS blocks. Accumulo gets a performance >>>>> boost to the natural locality you get when you write data to HDFS, but if a >>>>> tablet migrates that locality could be lost until data is compacted >>>>> (rewritten). Locality could be retained due to data replication, but >>>>> Accumulo does not make extraordinary effort to attempt to get a little bit >>>>> of locality, as data will eventually be rewritten and locality restored. >>>>> >>>>> As for your example, if all data for a given row is inserted at the >>>>> same time, then it is guaranteed to be in the same file. There is no >>>>> atomicity guarantee regarding HDFS blocks though, so depending on the block >>>>> size and the amount of data in the file (and it's distribution), it is >>>>> possible for a few entries to span files even though they are adjacent. |