|
Bryan Keller
2012-08-02, 06:31
Adrien Mogenet
2012-08-02, 06:39
Bryan Keller
2012-08-02, 17:15
Jean-Daniel Cryans
2012-08-02, 18:37
Bryan Keller
2012-08-02, 21:56
Alex Baranau
2012-08-07, 15:48
|
-
Poor data locality of MR jobBryan Keller 2012-08-02, 06:31
I have an 8 node cluster and a table that is pretty well balanced with on average 36 regions/node. When I run a mapreduce job on the cluster against this table, the data locality of the mappers is poor, e.g 100 rack local mappers and only 188 data local mappers. I would expect nearly all of the mappers to be data local. DNS appears to be fine, i.e. the hostname in the splits is the same as the hostnames in the task attempts.
The performance of the rack local mappers is poor and causes overall scan performance to suffer. The table isn't new and from what I understand, HDFS replication will eventually keep region data blocks local to the regionserver. Are there other reasons for data locality to be poor and any way to fix it?
-
Re: Poor data locality of MR jobAdrien Mogenet 2012-08-02, 06:39
Did you pre split your table or did you let balancer assign regions to
regionservers for you ? Did your regionserver(s) fail ? On Thu, Aug 2, 2012 at 8:31 AM, Bryan Keller <[EMAIL PROTECTED]> wrote: > I have an 8 node cluster and a table that is pretty well balanced with on > average 36 regions/node. When I run a mapreduce job on the cluster against > this table, the data locality of the mappers is poor, e.g 100 rack local > mappers and only 188 data local mappers. I would expect nearly all of the > mappers to be data local. DNS appears to be fine, i.e. the hostname in the > splits is the same as the hostnames in the task attempts. > > The performance of the rack local mappers is poor and causes overall scan > performance to suffer. > > The table isn't new and from what I understand, HDFS replication will > eventually keep region data blocks local to the regionserver. Are there > other reasons for data locality to be poor and any way to fix it? > > -- Adrien Mogenet 06.59.16.64.22 http://www.mogenet.me
-
Re: Poor data locality of MR jobBryan Keller 2012-08-02, 17:15
I presplit the table. The regionservers have gone down on occassion but have been up for a while (weeks). How could that result in having no regions on one node?
On Aug 1, 2012, at 11:39 PM, Adrien Mogenet <[EMAIL PROTECTED]> wrote: > Did you pre split your table or did you let balancer assign regions to > regionservers for you ? > > Did your regionserver(s) fail ? > > On Thu, Aug 2, 2012 at 8:31 AM, Bryan Keller <[EMAIL PROTECTED]> wrote: > >> I have an 8 node cluster and a table that is pretty well balanced with on >> average 36 regions/node. When I run a mapreduce job on the cluster against >> this table, the data locality of the mappers is poor, e.g 100 rack local >> mappers and only 188 data local mappers. I would expect nearly all of the >> mappers to be data local. DNS appears to be fine, i.e. the hostname in the >> splits is the same as the hostnames in the task attempts. >> >> The performance of the rack local mappers is poor and causes overall scan >> performance to suffer. >> >> The table isn't new and from what I understand, HDFS replication will >> eventually keep region data blocks local to the regionserver. Are there >> other reasons for data locality to be poor and any way to fix it? >> >> > > > -- > Adrien Mogenet > 06.59.16.64.22 > http://www.mogenet.me
-
Re: Poor data locality of MR jobJean-Daniel Cryans 2012-08-02, 18:37
On Wed, Aug 1, 2012 at 11:31 PM, Bryan Keller <[EMAIL PROTECTED]> wrote:
> I have an 8 node cluster and a table that is pretty well balanced with on average 36 regions/node. When I run a mapreduce job on the cluster against this table, the data locality of the mappers is poor, e.g 100 rack local mappers and only 188 data local mappers. I would expect nearly all of the mappers to be data local. DNS appears to be fine, i.e. the hostname in the splits is the same as the hostnames in the task attempts. Thanks for looking at this already, it's the first thing that came in mind when looking at the title. > The table isn't new and from what I understand, HDFS replication will eventually keep region data blocks local to the regionserver. Are there other reasons for data locality to be poor and any way to fix it? Block locality doesn't play a role here, TableInputFormat publishes where the region is but then where the data belonging to that region is is another matter that's not taken into account. In any case, even if the region was on node A and all the data happened to be on node B, C, and D, your mapper would still only talk to A since that's where the region is. So only 1 region server serves a region meaning that there's only 1 node where you can send the map to in order to have data locality. I'm ready to guess that the the maps that aren't local are launched towards the end of the job, because you might have slower maps and/or not a perfect balance of regions per region server. For example, let's say one node is full of data-local maps but it still has 2-3 more to process while other nodes have availability. The JT has a locality timeout for each map so that if one node is just too busy it will fall back to rack-local nodes instead. In this example those 2-3 maps might get sent elsewhere. There are ways to tune this depending on which scheduler you are using, but it will mostly involve waiting more for each task to make sure they can get to the right node. At your scale it sounds more to me like over-optimizations. How big of a hit are you taking? J-D
-
Re: Poor data locality of MR jobBryan Keller 2012-08-02, 21:56
I see what you mean about block locality, that is at the regionserver level, transparent to the MR job. This doesn't happen only to the final mappers, some of the early mappers are rack local. The table is reasonably well distributed across the nodes but not perfectly (that is a question I have in a different thread).
Performance of the rack local mappers vs data local is roughly 2x slower, so the performance hit is significant. On Aug 2, 2012, at 11:37 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote: > On Wed, Aug 1, 2012 at 11:31 PM, Bryan Keller <[EMAIL PROTECTED]> wrote: >> I have an 8 node cluster and a table that is pretty well balanced with on average 36 regions/node. When I run a mapreduce job on the cluster against this table, the data locality of the mappers is poor, e.g 100 rack local mappers and only 188 data local mappers. I would expect nearly all of the mappers to be data local. DNS appears to be fine, i.e. the hostname in the splits is the same as the hostnames in the task attempts. > > Thanks for looking at this already, it's the first thing that came in > mind when looking at the title. > >> The table isn't new and from what I understand, HDFS replication will eventually keep region data blocks local to the regionserver. Are there other reasons for data locality to be poor and any way to fix it? > > Block locality doesn't play a role here, TableInputFormat publishes > where the region is but then where the data belonging to that region > is is another matter that's not taken into account. In any case, even > if the region was on node A and all the data happened to be on node B, > C, and D, your mapper would still only talk to A since that's where > the region is. > > So only 1 region server serves a region meaning that there's only 1 > node where you can send the map to in order to have data locality. I'm > ready to guess that the the maps that aren't local are launched > towards the end of the job, because you might have slower maps and/or > not a perfect balance of regions per region server. > > For example, let's say one node is full of data-local maps but it > still has 2-3 more to process while other nodes have availability. The > JT has a locality timeout for each map so that if one node is just too > busy it will fall back to rack-local nodes instead. In this example > those 2-3 maps might get sent elsewhere. > > There are ways to tune this depending on which scheduler you are > using, but it will mostly involve waiting more for each task to make > sure they can get to the right node. > > At your scale it sounds more to me like over-optimizations. How big of > a hit are you taking? > > J-D
-
Re: Poor data locality of MR jobAlex Baranau 2012-08-07, 15:48
My reply may be unrelated to the metric that shows "data local tasks" and
more about data blocks locality (as J-D mentioned, this is separate thing and not reflected on Mapper metrics), but it might be also helpful for understanding performance of feeding data from HBase into Mapper with regard to "data blocks locality". > The regionservers have gone down on occassion > but have been up for a while (weeks) RS failures can be the reason for breaking data (blocks) locality. The way HDFS writes data (let's first assume no failures happened), there will be (likely) one full replica of bolcks of HFiles of the Region assigned to RS. And other replicas (if replication >1) will be distributed randomly over the cluster. So there (likely) will be no other full replica of HFiles of this specific Region at any other server. Now, imagine that this RS fails. After it is reassigned to a different server, it no longer serves only local data: there's no full replica of data of that Region locally. And there's not much you can do about it, since block replicas are distributed randomly over the cluster. Periodic major_compaction which rewrites old files helps to recover blocks data locality (until next RS failure) since it will again write full replica of newly created files locally. But note that it will rewrite not all files: e.g. those old (already major-compacted some time ago) HFiles of regions which were not changed since last major compaction will remain same. I believe someone is working on making replication process (replicas balancer) to be more smart at the moment. Hopes are to see this work soon :) Alex Baranau ------ Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr On Thu, Aug 2, 2012 at 5:56 PM, Bryan Keller <[EMAIL PROTECTED]> wrote: > I see what you mean about block locality, that is at the regionserver > level, transparent to the MR job. This doesn't happen only to the final > mappers, some of the early mappers are rack local. The table is reasonably > well distributed across the nodes but not perfectly (that is a question I > have in a different thread). > > Performance of the rack local mappers vs data local is roughly 2x slower, > so the performance hit is significant. > > On Aug 2, 2012, at 11:37 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]> > wrote: > > > On Wed, Aug 1, 2012 at 11:31 PM, Bryan Keller <[EMAIL PROTECTED]> wrote: > >> I have an 8 node cluster and a table that is pretty well balanced with > on average 36 regions/node. When I run a mapreduce job on the cluster > against this table, the data locality of the mappers is poor, e.g 100 rack > local mappers and only 188 data local mappers. I would expect nearly all of > the mappers to be data local. DNS appears to be fine, i.e. the hostname in > the splits is the same as the hostnames in the task attempts. > > > > Thanks for looking at this already, it's the first thing that came in > > mind when looking at the title. > > > >> The table isn't new and from what I understand, HDFS replication will > eventually keep region data blocks local to the regionserver. Are there > other reasons for data locality to be poor and any way to fix it? > > > > Block locality doesn't play a role here, TableInputFormat publishes > > where the region is but then where the data belonging to that region > > is is another matter that's not taken into account. In any case, even > > if the region was on node A and all the data happened to be on node B, > > C, and D, your mapper would still only talk to A since that's where > > the region is. > > > > So only 1 region server serves a region meaning that there's only 1 > > node where you can send the map to in order to have data locality. I'm > > ready to guess that the the maps that aren't local are launched > > towards the end of the job, because you might have slower maps and/or > > not a perfect balance of regions per region server. > > > > For example, let's say one node is full of data-local maps but it Alex Baranau Sematext :: http://blog.sematext.com/ :: Hadoop - HBase - ElasticSearch - Solr |