|
william kinney
2009-07-28, 15:18
Scott Carey
2009-07-28, 16:58
Ted Dunning
2009-07-28, 18:13
william kinney
2009-07-28, 19:15
Ted Dunning
2009-07-28, 20:25
Erik Paulson
2009-07-28, 20:59
Ted Dunning
2009-07-28, 22:15
william kinney
2009-07-28, 21:40
Ted Dunning
2009-07-28, 20:27
Scott Carey
2009-07-28, 22:35
Jason Venner
2009-07-29, 02:01
william kinney
2009-07-30, 04:10
william kinney
2009-07-30, 04:07
Scott Carey
2009-07-30, 05:31
william kinney
2009-07-30, 14:37
william kinney
2009-07-30, 17:42
Scott Carey
2009-07-30, 18:39
Todd Lipcon
2009-07-30, 18:51
Scott Carey
2009-07-30, 20:32
Scott Carey
2009-07-30, 18:19
william kinney
2009-07-30, 21:32
Scott Carey
2009-08-01, 01:07
Scott Carey
2009-07-31, 21:31
Steve Loughran
2009-07-29, 09:17
Todd Lipcon
2009-07-29, 18:47
Scott Carey
2009-07-29, 19:07
Todd Lipcon
2009-07-29, 19:10
|
-
Map performance with custom binary formatwilliam kinney 2009-07-28, 15:18
Hi,
Thanks in advance for the help! I have a performance question relating to how fast I can expect Hadoop to scale. Running Cloudera's 0.18.3-10. I have custom binary format, which is just Google Protocol Buffer (protobuf) serialized data: 669 files, ~30GB total size (ranging 10MB to 100MB each). 128MB block size. 10 Hadoop Nodes. I tested my InputFormat and RecordReader for my input format, and it showed about 56MB/s performance (single thread, no hadoop, passed in test file via FileInputFormat instead of FSDataInputStream) on hardware similar to what I have in my cluster. I also then tested some simple Map logic along w/ the above, and got around 54MB/s. I believe that difference can be accounted for parsing the protobuf data into java objects. Anyways, when I put this logic into a job that has - no reduce (.setNumReduceTasks(0);) - no emit - just protobuf parsing calls (like above) I get a finish time of 10mins, 25sec, which is about 106.24 MB/s. So my question, why is the rate only 2x what I see on a single thread, non-hadoop test? Would it not be: 54MB/s x 10 (Num Nodes) - small hadoop overhead ? Is there any area of my configuration I should look into for tuning? Anyway I could get more accurate performance monitoring of my job? On a side note, I tried the same job after combining the files into about 11 files (still 30GB in size), and actually saw a decrease in performance (~90MB/s). Any help is appreciated. Thanks! Will some hadoop-site.xml values: dfs.replication 3 io.file.buffer.size 65536 dfs.datanode.handler.count 3 mapred.tasktracker.map.tasks.maximum 6 dfs.namenode.handler.count 5 +
william kinney 2009-07-28, 15:18
-
Re: Map performance with custom binary formatScott Carey 2009-07-28, 16:58
Well, the first thing to do in any performance bottleneck investigation is
to look at the machine hardware resource usage. During your test, what is the CPU use and disk usage? What about network utilization? Top, vmstat, iostat, and some network usage monitoring would be useful. It could be many things causing your lack of scalability, but without actually monitoring your machines to see if there is an obvious bottleneck its just random guessing and hunches. On 7/28/09 8:18 AM, "william kinney" <[EMAIL PROTECTED]> wrote: > Hi, > > Thanks in advance for the help! > > I have a performance question relating to how fast I can expect Hadoop > to scale. Running Cloudera's 0.18.3-10. > > I have custom binary format, which is just Google Protocol Buffer > (protobuf) serialized data: > > 669 files, ~30GB total size (ranging 10MB to 100MB each). > 128MB block size. > 10 Hadoop Nodes. > > I tested my InputFormat and RecordReader for my input format, and it > showed about 56MB/s performance (single thread, no hadoop, passed in > test file via FileInputFormat instead of FSDataInputStream) on > hardware similar to what I have in my cluster. > I also then tested some simple Map logic along w/ the above, and got > around 54MB/s. I believe that difference can be accounted for parsing > the protobuf data into java objects. > > Anyways, when I put this logic into a job that has > - no reduce (.setNumReduceTasks(0);) > - no emit > - just protobuf parsing calls (like above) > > I get a finish time of 10mins, 25sec, which is about 106.24 MB/s. > > So my question, why is the rate only 2x what I see on a single thread, > non-hadoop test? Would it not be: > 54MB/s x 10 (Num Nodes) - small hadoop overhead ? > > Is there any area of my configuration I should look into for tuning? > > Anyway I could get more accurate performance monitoring of my job? > > On a side note, I tried the same job after combining the files into > about 11 files (still 30GB in size), and actually saw a decrease in > performance (~90MB/s). > > Any help is appreciated. Thanks! > > Will > > some hadoop-site.xml values: > dfs.replication 3 > io.file.buffer.size 65536 > dfs.datanode.handler.count 3 > mapred.tasktracker.map.tasks.maximum 6 > dfs.namenode.handler.count 5 > +
Scott Carey 2009-07-28, 16:58
-
Re: Map performance with custom binary formatTed Dunning 2009-07-28, 18:13
Scott's advice is sage. There is bound to be a smoking gun here. Run
ganglia and look at the machine and hadoop stats. It looks like you have done a very nice job to this point and if the maps are pulling data from a local disk, then I would expect performance to be comparable. If, for some inexplicable (right now) reason the data is not local to the maps, then performance would be limited. You should be able to see on the job status page how many maps were local to the data. Another issue might be that there are too many maps being started so that the splits are small. It might be nice to log that. On Tue, Jul 28, 2009 at 9:58 AM, Scott Carey <[EMAIL PROTECTED]>wrote: > Well, the first thing to do in any performance bottleneck investigation is > to look at the machine hardware resource usage. > > During your test, what is the CPU use and disk usage? What about network > utilization? > Top, vmstat, iostat, and some network usage monitoring would be useful. It > could be many things causing your lack of scalability, but without actually > monitoring your machines to see if there is an obvious bottleneck its just > random guessing and hunches. > > > > On 7/28/09 8:18 AM, "william kinney" <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > Thanks in advance for the help! > > > > I have a performance question relating to how fast I can expect Hadoop > > to scale. Running Cloudera's 0.18.3-10. > > > > I have custom binary format, which is just Google Protocol Buffer > > (protobuf) serialized data: > > > > 669 files, ~30GB total size (ranging 10MB to 100MB each). > > 128MB block size. > > 10 Hadoop Nodes. > > > > I tested my InputFormat and RecordReader for my input format, and it > > showed about 56MB/s performance (single thread, no hadoop, passed in > > test file via FileInputFormat instead of FSDataInputStream) on > > hardware similar to what I have in my cluster. > > I also then tested some simple Map logic along w/ the above, and got > > around 54MB/s. I believe that difference can be accounted for parsing > > the protobuf data into java objects. > > > > Anyways, when I put this logic into a job that has > > - no reduce (.setNumReduceTasks(0);) > > - no emit > > - just protobuf parsing calls (like above) > > > > I get a finish time of 10mins, 25sec, which is about 106.24 MB/s. > > > > So my question, why is the rate only 2x what I see on a single thread, > > non-hadoop test? Would it not be: > > 54MB/s x 10 (Num Nodes) - small hadoop overhead ? > > > > Is there any area of my configuration I should look into for tuning? > > > > Anyway I could get more accurate performance monitoring of my job? > > > > On a side note, I tried the same job after combining the files into > > about 11 files (still 30GB in size), and actually saw a decrease in > > performance (~90MB/s). > > > > Any help is appreciated. Thanks! > > > > Will > > > > some hadoop-site.xml values: > > dfs.replication 3 > > io.file.buffer.size 65536 > > dfs.datanode.handler.count 3 > > mapred.tasktracker.map.tasks.maximum 6 > > dfs.namenode.handler.count 5 > > > > -- Ted Dunning, CTO DeepDyve +
Ted Dunning 2009-07-28, 18:13
-
Re: Map performance with custom binary formatwilliam kinney 2009-07-28, 19:15
Sorry, forgot to include that detail.
Some data from ganglia: CPU: - on all 10 nodes, I am seeing for the life of the job 85-95% CPU usage, with about 10% of that being "System" CPU, vs "User". - Single node graph: http://imagebin.org/57520 - Cluster graph: http://imagebin.org/57523 Memory: - Memory used before job is about 0.4GB, During job it fluctuates up to 0.6GB and 0.7GB, then back down to 0.4GB. Most of the node memory (8GB) is showing as "Cached". - Single node graph: http://imagebin.org/57522 Network: - IN and OUT: Each node 6-12MB/s, cumulative about 30-44MB/s. - Single node graph: http://imagebin.org/57521 - Cluster graph: http://imagebin.org/57525 iostat (disk) (sampled most of the nodes, below values are ranges I saw): tps: 0.41-1.27 Blk_read/s: 46-58 Blk_wrtn/s: 20-23 (have two disks per node, both SAS, 10k RPM) --- Are those Blk_read/wrtn/s as in block size (4096?) = bytes/second? Also, from the job page (different job, same Map method, just more data...~40GB. 781 files): Map input records 629,738,080 Map input bytes 41,538,992,880 Anything else I can look into? Do my original numbers (only 2x performance) jump out at you as being way off? Or it is common to see that a setup similar to mine? I should also note that given its a custom binary format, I do not support Splitting (isSplittable() is false). I don't think that would count for such a large discrepancy in expected performance, would it? Thanks for the help, Will On Tue, Jul 28, 2009 at 12:58 PM, Scott Carey<[EMAIL PROTECTED]> wrote: > Well, the first thing to do in any performance bottleneck investigation is > to look at the machine hardware resource usage. > > During your test, what is the CPU use and disk usage? What about network > utilization? > Top, vmstat, iostat, and some network usage monitoring would be useful. It > could be many things causing your lack of scalability, but without actually > monitoring your machines to see if there is an obvious bottleneck its just > random guessing and hunches. > > > > On 7/28/09 8:18 AM, "william kinney" <[EMAIL PROTECTED]> wrote: > >> Hi, >> >> Thanks in advance for the help! >> >> I have a performance question relating to how fast I can expect Hadoop >> to scale. Running Cloudera's 0.18.3-10. >> >> I have custom binary format, which is just Google Protocol Buffer >> (protobuf) serialized data: >> >> 669 files, ~30GB total size (ranging 10MB to 100MB each). >> 128MB block size. >> 10 Hadoop Nodes. >> >> I tested my InputFormat and RecordReader for my input format, and it >> showed about 56MB/s performance (single thread, no hadoop, passed in >> test file via FileInputFormat instead of FSDataInputStream) on >> hardware similar to what I have in my cluster. >> I also then tested some simple Map logic along w/ the above, and got >> around 54MB/s. I believe that difference can be accounted for parsing >> the protobuf data into java objects. >> >> Anyways, when I put this logic into a job that has >> - no reduce (.setNumReduceTasks(0);) >> - no emit >> - just protobuf parsing calls (like above) >> >> I get a finish time of 10mins, 25sec, which is about 106.24 MB/s. >> >> So my question, why is the rate only 2x what I see on a single thread, >> non-hadoop test? Would it not be: >> 54MB/s x 10 (Num Nodes) - small hadoop overhead ? >> >> Is there any area of my configuration I should look into for tuning? >> >> Anyway I could get more accurate performance monitoring of my job? >> >> On a side note, I tried the same job after combining the files into >> about 11 files (still 30GB in size), and actually saw a decrease in >> performance (~90MB/s). >> >> Any help is appreciated. Thanks! >> >> Will >> >> some hadoop-site.xml values: >> dfs.replication 3 >> io.file.buffer.size 65536 >> dfs.datanode.handler.count 3 >> mapred.tasktracker.map.tasks.maximum 6 >> dfs.namenode.handler.count 5 >> > > +
william kinney 2009-07-28, 19:15
-
Re: Map performance with custom binary formatTed Dunning 2009-07-28, 20:25
On Tue, Jul 28, 2009 at 12:15 PM, william kinney
<[EMAIL PROTECTED]>wrote: > > Also, from the job page (different job, same Map method, just more > data...~40GB. 781 files): > Map input records 629,738,080 > Map input bytes 41,538,992,880 > > Anything else I can look into? Yes. The number of data local maps and how many maps total. > Do my original numbers (only 2x performance) jump out at you as being > way off? Or it is common to see that a setup similar to mine? It is way off. My experience is that from 5 EC2 nodes, I can sustain 100-200MB / s to the *network*. These are lesser machines than you have and you have twice as many. Moreover, your test program is nicely designed to avoid all of the overhead attendant on running a full program. It is reasonable to expect significant slow down due to startup and due to going through HDFS, but for local blocks I would expect good performance. Is it possible that the 50MB/s on a single node was not a real number? It seems somewhat high but probably reasonable with modern hardware. Was the file already in memory? -- Ted Dunning, CTO DeepDyve +
Ted Dunning 2009-07-28, 20:25
-
Re: Map performance with custom binary formatErik Paulson 2009-07-28, 20:59
On Tue, Jul 28, 2009 at 01:25:49PM -0700, Ted Dunning wrote:
> On Tue, Jul 28, 2009 at 12:15 PM, william kinney > <[EMAIL PROTECTED]>wrote: > > > > > Also, from the job page (different job, same Map method, just more > > data...~40GB. 781 files): > > Map input records 629,738,080 > > Map input bytes 41,538,992,880 > > > > Anything else I can look into? > > > Yes. The number of data local maps and how many maps total. > Do "data local maps" short-circuit to the local filesystem at all, or do they read data over HTTP from the data node's jetty instance over the loopback device? -Erik +
Erik Paulson 2009-07-28, 20:59
-
Re: Map performance with custom binary formatTed Dunning 2009-07-28, 22:15
Don't remember exactly. I do know that they are faster, especially if the
NIC is already saturated. On Tue, Jul 28, 2009 at 1:59 PM, Erik Paulson <[EMAIL PROTECTED]> wrote: > > Yes. The number of data local maps and how many maps total. > > > > Do "data local maps" short-circuit to the local filesystem at all, or do > they read data over HTTP from the data node's jetty instance over the > loopback device? -- Ted Dunning, CTO DeepDyve +
Ted Dunning 2009-07-28, 22:15
-
Re: Map performance with custom binary formatwilliam kinney 2009-07-28, 21:40
Counter Map
Reduce Total File Systems HDFS bytes read 41,538,992,880 0 41,538,992,880 Rack-local map tasks 0 0 49 Job Counters Launched map tasks 0 0 794 Data-local map tasks 0 0 732 Map input records 629,738,080 0 629,738,080 Map-Reduce Framework Map input bytes 41,538,992,880 0 41,538,992,880 Map output records 0 0 0 The 50MB/s was not on a hadoop node, but rather a local java command line program that called the RecordReader with a FileInputStream of a test file (~ 100MB, taken from one of the files on the hdfs used in the job) and looped through it (ie, while((bytesRead gbr.readGPB(BytesWritable, LongWritable)) > 0) ). I then did the protobuf parsing as it appears in my Hadoop job map method. Single thread. performance was ~50MB/s. Ran it locally on one of the boxes that Hadoop is on (to ensure same hardware and JVM). So, it wasn't already in memory, but rather read from disk via FileInputStream (Didn't use BufferedInputStream). Hardware is pretty beefy, Dual Core Xeon 2.6Ghz, 2 x 10K SAS, 8GB RAM. Sun JVM "1.6.0_13", 64-bit HotSpot. On Tue, Jul 28, 2009 at 4:25 PM, Ted Dunning<[EMAIL PROTECTED]> wrote: > On Tue, Jul 28, 2009 at 12:15 PM, william kinney > <[EMAIL PROTECTED]>wrote: > >> >> Also, from the job page (different job, same Map method, just more >> data...~40GB. 781 files): >> Map input records 629,738,080 >> Map input bytes 41,538,992,880 >> >> Anything else I can look into? > > > Yes. The number of data local maps and how many maps total. > > >> Do my original numbers (only 2x performance) jump out at you as being >> way off? Or it is common to see that a setup similar to mine? > > > It is way off. My experience is that from 5 EC2 nodes, I can sustain > 100-200MB / s to the *network*. These are lesser machines than you have and > you have twice as many. Moreover, your test program is nicely designed to > avoid all of the overhead attendant on running a full program. It is > reasonable to expect significant slow down due to startup and due to going > through HDFS, but for local blocks I would expect good performance. > > Is it possible that the 50MB/s on a single node was not a real number? It > seems somewhat high but probably reasonable with modern hardware. Was the > file already in memory? > > > -- > Ted Dunning, CTO > DeepDyve > +
william kinney 2009-07-28, 21:40
-
Re: Map performance with custom binary formatTed Dunning 2009-07-28, 20:27
Depends on the distribution of file sizes.
If you run your single node program on all the files, what speed do you get? On Tue, Jul 28, 2009 at 12:15 PM, william kinney <[EMAIL PROTECTED]>wrote: > I should also note that given its a custom binary format, I do not > support Splitting (isSplittable() is false). I don't think that would > count for such a large discrepancy in expected performance, would it? > -- Ted Dunning, CTO DeepDyve +
Ted Dunning 2009-07-28, 20:27
-
Re: Map performance with custom binary formatScott Carey 2009-07-28, 22:35
See below:
On 7/28/09 12:15 PM, "william kinney" <[EMAIL PROTECTED]> wrote: > Sorry, forgot to include that detail. > > Some data from ganglia: > > CPU: > - on all 10 nodes, I am seeing for the life of the job 85-95% CPU > usage, with about 10% of that being "System" CPU, vs "User". > - Single node graph: http://imagebin.org/57520 > - Cluster graph: http://imagebin.org/57523 Ok, CPU is definitely loaded. Identify which processes are primarily responsible (Tasks? Datanode? Tasktracker?) You'll want to make the processes eating CPU during a run spit out some stack traces to 'profile' the activity. Use either the 'jstack' utility with the JDK, or do a 'kill -3 <pid>' on a java process to spit out the stack trace to stdout. You'll want to do this a handful of times on a single job if possible to identify any trends. > > Memory: > - Memory used before job is about 0.4GB, During job it fluctuates > up to 0.6GB and 0.7GB, then back down to 0.4GB. Most of the node > memory (8GB) is showing as "Cached". > - Single node graph: http://imagebin.org/57522 So the OS is mostly just caching disk files in RAM. > > Network: > - IN and OUT: Each node 6-12MB/s, cumulative about 30-44MB/s. > - Single node graph: http://imagebin.org/57521 > - Cluster graph: http://imagebin.org/57525 > That is a not insignificant, but cumulative across the cluster its not much. > iostat (disk) (sampled most of the nodes, below values are ranges I saw): > tps: 0.41-1.27 > Blk_read/s: 46-58 > Blk_wrtn/s: 20-23 > (have two disks per node, both SAS, 10k RPM) > Did you do iostat with a parameter to have it spit out more than one row? By default, it spits out data averaged since boot time, like vmstat. My favorite iostat params for monitoring are: iostat -mx 5 iostat -dmx 5 (or 10 or 15 or 60 second intervals depending on what I'm doing) Ganglia might have some I/O info -- you want both iops and some sort of bytes/sec measurement. > --- > Are those Blk_read/wrtn/s as in block size (4096?) = bytes/second? > I think its the 512 byte block notion, but I always use -m to put it in useful units. > Also, from the job page (different job, same Map method, just more > data...~40GB. 781 files): > Map input records 629,738,080 > Map input bytes 41,538,992,880 > > Anything else I can look into? Based on your other email: There are almost 800 map tasks, these seem to mostly be data local. The current implementation of the JobTracker schedules rather slowly, and can at best place one new task per node per 2 seconds or so on a small cluster. So, with 10 servers, it will take at least 80 seconds just to schedule all the tasks. If each server can run 8 tasks concurrently, then if the average task doesn¹t take somewhat longer than 16 seconds, the system will not reach full utilization. What does the web interface tell you about the number of concurrent map tasks during the run? Does it approach the max task slots? You can look at the logs for an individual task, and see how much data it read, and how long it took. It might be hitting your 50MB/sec or close in a burst, or perhaps not. Given the sort of bottlenecks I often see, I suspect the scheduling. But, you have almost maxed CPU use, so its probably not that. Getting stack dumps to see what the processor is doing during your test will help narrow it down. > > Do my original numbers (only 2x performance) jump out at you as being > way off? Or it is common to see that a setup similar to mine? > > I should also note that given its a custom binary format, I do not > support Splitting (isSplittable() is false). I don't think that would > count for such a large discrepancy in expected performance, would it? > If the files are all larger than the block size, it would cause a lot more network activity -- but unless your switch or network is broken or not gigabit -- there is a lot of capacity left in the network. > Thanks for the help, +
Scott Carey 2009-07-28, 22:35
-
Re: Map performance with custom binary formatJason Venner 2009-07-29, 02:01
Is it possible that your tasks are not falling evenly over the machines of
your cluster, but piling up on a small number of machines? On Tue, Jul 28, 2009 at 3:35 PM, Scott Carey <[EMAIL PROTECTED]>wrote: > See below: > > > On 7/28/09 12:15 PM, "william kinney" <[EMAIL PROTECTED]> wrote: > > > Sorry, forgot to include that detail. > > > > Some data from ganglia: > > > > CPU: > > - on all 10 nodes, I am seeing for the life of the job 85-95% CPU > > usage, with about 10% of that being "System" CPU, vs "User". > > - Single node graph: http://imagebin.org/57520 > > - Cluster graph: http://imagebin.org/57523 > > Ok, CPU is definitely loaded. Identify which processes are primarily > responsible (Tasks? Datanode? Tasktracker?) You'll want to make the > processes eating CPU during a run spit out some stack traces to 'profile' > the activity. Use either the 'jstack' utility with the JDK, or do a 'kill > -3 <pid>' on a java process to spit out the stack trace to stdout. You'll > want to do this a handful of times on a single job if possible to identify > any trends. > > > > > Memory: > > - Memory used before job is about 0.4GB, During job it fluctuates > > up to 0.6GB and 0.7GB, then back down to 0.4GB. Most of the node > > memory (8GB) is showing as "Cached". > > - Single node graph: http://imagebin.org/57522 > > So the OS is mostly just caching disk files in RAM. > > > > > Network: > > - IN and OUT: Each node 6-12MB/s, cumulative about 30-44MB/s. > > - Single node graph: http://imagebin.org/57521 > > - Cluster graph: http://imagebin.org/57525 > > > > That is a not insignificant, but cumulative across the cluster its not > much. > > > iostat (disk) (sampled most of the nodes, below values are ranges I saw): > > tps: 0.41-1.27 > > Blk_read/s: 46-58 > > Blk_wrtn/s: 20-23 > > (have two disks per node, both SAS, 10k RPM) > > > > Did you do iostat with a parameter to have it spit out more than one row? > By default, it spits out data averaged since boot time, like vmstat. > My favorite iostat params for monitoring are: > iostat -mx 5 > iostat -dmx 5 > (or 10 or 15 or 60 second intervals depending on what I'm doing) Ganglia > might have some I/O info -- you want both iops and some sort of bytes/sec > measurement. > > > --- > > Are those Blk_read/wrtn/s as in block size (4096?) = bytes/second? > > > > I think its the 512 byte block notion, but I always use -m to put it in > useful units. > > > Also, from the job page (different job, same Map method, just more > > data...~40GB. 781 files): > > Map input records 629,738,080 > > Map input bytes 41,538,992,880 > > > > Anything else I can look into? > > Based on your other email: > > There are almost 800 map tasks, these seem to mostly be data local. The > current implementation of the JobTracker schedules rather slowly, and can > at > best place one new task per node per 2 seconds or so on a small cluster. > So, with 10 servers, it will take at least 80 seconds just to schedule all > the tasks. > If each server can run 8 tasks concurrently, then if the average task > doesn¹t take somewhat longer than 16 seconds, the system will not reach > full > utilization. > > What does the web interface tell you about the number of concurrent map > tasks during the run? Does it approach the max task slots? > > You can look at the logs for an individual task, and see how much data it > read, and how long it took. It might be hitting your 50MB/sec or close in > a > burst, or perhaps not. > > Given the sort of bottlenecks I often see, I suspect the scheduling. But, > you have almost maxed CPU use, so its probably not that. Getting stack > dumps to see what the processor is doing during your test will help narrow > it down. > > > > > > Do my original numbers (only 2x performance) jump out at you as being > > way off? Or it is common to see that a setup similar to mine? > > > > I should also note that given its a custom binary format, I do not Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals +
Jason Venner 2009-07-29, 02:01
-
Re: Map performance with custom binary formatwilliam kinney 2009-07-30, 04:10
"Is it possible that your tasks are not falling evenly over the machines of
your cluster, but piling up on a small number of machines?" - TaskTracker web interface on each node showed constant (2) active map tasks ("RUNNING"). Also, CPU is pegged on each node during the job. Thanks, Will On Tue, Jul 28, 2009 at 10:01 PM, Jason Venner<[EMAIL PROTECTED]> wrote: > Is it possible that your tasks are not falling evenly over the machines of > your cluster, but piling up on a small number of machines? > > On Tue, Jul 28, 2009 at 3:35 PM, Scott Carey <[EMAIL PROTECTED]>wrote: > >> See below: >> >> >> On 7/28/09 12:15 PM, "william kinney" <[EMAIL PROTECTED]> wrote: >> >> > Sorry, forgot to include that detail. >> > >> > Some data from ganglia: >> > >> > CPU: >> > - on all 10 nodes, I am seeing for the life of the job 85-95% CPU >> > usage, with about 10% of that being "System" CPU, vs "User". >> > - Single node graph: http://imagebin.org/57520 >> > - Cluster graph: http://imagebin.org/57523 >> >> Ok, CPU is definitely loaded. Identify which processes are primarily >> responsible (Tasks? Datanode? Tasktracker?) You'll want to make the >> processes eating CPU during a run spit out some stack traces to 'profile' >> the activity. Use either the 'jstack' utility with the JDK, or do a 'kill >> -3 <pid>' on a java process to spit out the stack trace to stdout. You'll >> want to do this a handful of times on a single job if possible to identify >> any trends. >> >> > >> > Memory: >> > - Memory used before job is about 0.4GB, During job it fluctuates >> > up to 0.6GB and 0.7GB, then back down to 0.4GB. Most of the node >> > memory (8GB) is showing as "Cached". >> > - Single node graph: http://imagebin.org/57522 >> >> So the OS is mostly just caching disk files in RAM. >> >> > >> > Network: >> > - IN and OUT: Each node 6-12MB/s, cumulative about 30-44MB/s. >> > - Single node graph: http://imagebin.org/57521 >> > - Cluster graph: http://imagebin.org/57525 >> > >> >> That is a not insignificant, but cumulative across the cluster its not >> much. >> >> > iostat (disk) (sampled most of the nodes, below values are ranges I saw): >> > tps: 0.41-1.27 >> > Blk_read/s: 46-58 >> > Blk_wrtn/s: 20-23 >> > (have two disks per node, both SAS, 10k RPM) >> > >> >> Did you do iostat with a parameter to have it spit out more than one row? >> By default, it spits out data averaged since boot time, like vmstat. >> My favorite iostat params for monitoring are: >> iostat -mx 5 >> iostat -dmx 5 >> (or 10 or 15 or 60 second intervals depending on what I'm doing) Ganglia >> might have some I/O info -- you want both iops and some sort of bytes/sec >> measurement. >> >> > --- >> > Are those Blk_read/wrtn/s as in block size (4096?) = bytes/second? >> > >> >> I think its the 512 byte block notion, but I always use -m to put it in >> useful units. >> >> > Also, from the job page (different job, same Map method, just more >> > data...~40GB. 781 files): >> > Map input records 629,738,080 >> > Map input bytes 41,538,992,880 >> > >> > Anything else I can look into? >> >> Based on your other email: >> >> There are almost 800 map tasks, these seem to mostly be data local. The >> current implementation of the JobTracker schedules rather slowly, and can >> at >> best place one new task per node per 2 seconds or so on a small cluster. >> So, with 10 servers, it will take at least 80 seconds just to schedule all >> the tasks. >> If each server can run 8 tasks concurrently, then if the average task >> doesn¹t take somewhat longer than 16 seconds, the system will not reach >> full >> utilization. >> >> What does the web interface tell you about the number of concurrent map >> tasks during the run? Does it approach the max task slots? >> >> You can look at the logs for an individual task, and see how much data it >> read, and how long it took. It might be hitting your 50MB/sec or close in >> a > +
william kinney 2009-07-30, 04:10
-
Re: Map performance with custom binary formatwilliam kinney 2009-07-30, 04:07
OK:
implemented some iotop/iostat monitoring in ganglia. Looks pretty standard (job was 23:00 to 23:06): - Single Node Disk Read: http://imagebin.org/57716 - Single Node Disk Write: http://imagebin.org/57717 On each node, noticed that the two TaskTracker$Child processes were consuming close to 90% of each core. The TaskTracker and DataNode were close to 0%. For the TT children, I did jstack dumps, but didn't really see much that popped out other than a lot of time spent in a SimpleDateFormat section and the protobuf parse. I switched the SDF out with commons.lang FastDateFormat, which reduced the total time for both the Hadoop job and the local/non-hadoop test, so still a discrepancy between local and hadoop runs. "You can look at the logs for an individual task, and see how much data it read, and how long it took. It might be hitting your 50MB/sec or close in a burst, or perhaps not." - I decided to log the performance of each RecordReader use within hadoop, which is essentially 1:1 for TaskTracker$Child process since I have 1 InputSplit per file (ie, no splitting), right?. Saw: Example 1) 527639090 bytes in : 18050 ms. (27.8778 MB/s) Example 2) 533770314 bytes in : 23494 ms. (21.6669 MB/s) Example 3) 529711886 bytes in : 20092 ms. (25.1429 MB/s) ...etc For reference, the non-hadoop/local test: 530710906 bytes in : 9133 ms. (55.41721 MB/s) Regarding the JobTracker only doing 1 task / node / 2 seconds, that will definitely hurt. Although the above discrepancy takes priority for me, for now. "What does the web interface tell you about the number of concurrent map tasks during the run? Does it approach the max task slots?" - Yeah it definitely does, from the TaskTracker page on each node, I'm seeing almost always 2 "RUNNING" tasks (and an accumulating list of "COMMIT_PENDING" tasks under Non-Running, which slowly grows as the job progresses). Normal? Also, I used a profiler to profile a local/non-hadoop test of the RecordReader/Map(): class: %Time org.apache.commons.lang.time.FastDateFormat.format(long): 46% com......parseFrom(byte[]): 42% java.io.FileInputStream.read(byte[], int, int): 5% ...rest are 1%'ish I guess this doesn't show anything helpful. I'll try to attach it to hadoop remotely...anyone have any experience doing this w/ YourKit Java Profiler? Anyways, decided to test the "large files" vs "small files" theory again: Small files (1449 files, ranging 10-100MB. average: 32 MB) - HDFS bytes read 49,057,491,374 - Map input records 737,850,142 - Finished in: 7mins, 26sec ... 104.898 MB/s Large files (22 files, around 500MB. average 514MB) - HDFS bytes read 11,852,421,152 - Map input records 179,657,432 - Finished in: 1mins, 8sec ... 166.225 MB/s Not sure why before the large files were taking longer, perhaps the SimpleDateFormat>FastDateFormat change? Anyways, good to see where I need to take the file sizes too...but still 166 MB is not the rate I was hoping for (given the # of nodes and local performance). So I guess in summary, hadoop TaskTracker$Child processes that are doing the Map() and RecordReader are about 50% slower than the normal, local non-hadoop version. In addition, their rate (~25MB/s) * Num Nodes (10) suggests ~ 250MB/s total job performance, but I'm only seeing ~166MB/s. Will On Tue, Jul 28, 2009 at 6:35 PM, Scott Carey<[EMAIL PROTECTED]> wrote: > See below: > > > On 7/28/09 12:15 PM, "william kinney" <[EMAIL PROTECTED]> wrote: > >> Sorry, forgot to include that detail. >> >> Some data from ganglia: >> >> CPU: >> - on all 10 nodes, I am seeing for the life of the job 85-95% CPU >> usage, with about 10% of that being "System" CPU, vs "User". >> - Single node graph: http://imagebin.org/57520 >> - Cluster graph: http://imagebin.org/57523 > > Ok, CPU is definitely loaded. Identify which processes are primarily > responsible (Tasks? Datanode? Tasktracker?) You'll want to make the > processes eating CPU during a run spit out some stack traces to 'profile' +
william kinney 2009-07-30, 04:07
-
Re: Map performance with custom binary formatScott Carey 2009-07-30, 05:31
What is the JRE for the Hadoop nodes versus local? What are the JVM arguments for the child tasks and the local version (and heap size)? What is the hardware and platform details for the nodes versus the local test?
Is the data compressed in Hadoop (check the config)? You mention the TaskTracker web inerface during a job, but what about the JobTracker interface? This should show the global view of currently scheduled maps versus total slots. Lastly, check out some more stack traces on the tasks. If they are all still in the DateFormat stuff? Surely some of them should be in your parseFrom() method too? On 7/29/09 9:07 PM, "william kinney" <[EMAIL PROTECTED]> wrote: OK: implemented some iotop/iostat monitoring in ganglia. Looks pretty standard (job was 23:00 to 23:06): - Single Node Disk Read: http://imagebin.org/57716 - Single Node Disk Write: http://imagebin.org/57717 On each node, noticed that the two TaskTracker$Child processes were consuming close to 90% of each core. The TaskTracker and DataNode were close to 0%. For the TT children, I did jstack dumps, but didn't really see much that popped out other than a lot of time spent in a SimpleDateFormat section and the protobuf parse. I switched the SDF out with commons.lang FastDateFormat, which reduced the total time for both the Hadoop job and the local/non-hadoop test, so still a discrepancy between local and hadoop runs. "You can look at the logs for an individual task, and see how much data it read, and how long it took. It might be hitting your 50MB/sec or close in a burst, or perhaps not." - I decided to log the performance of each RecordReader use within hadoop, which is essentially 1:1 for TaskTracker$Child process since I have 1 InputSplit per file (ie, no splitting), right?. Saw: Example 1) 527639090 bytes in : 18050 ms. (27.8778 MB/s) Example 2) 533770314 bytes in : 23494 ms. (21.6669 MB/s) Example 3) 529711886 bytes in : 20092 ms. (25.1429 MB/s) ...etc For reference, the non-hadoop/local test: 530710906 bytes in : 9133 ms. (55.41721 MB/s) Regarding the JobTracker only doing 1 task / node / 2 seconds, that will definitely hurt. Although the above discrepancy takes priority for me, for now. "What does the web interface tell you about the number of concurrent map tasks during the run? Does it approach the max task slots?" - Yeah it definitely does, from the TaskTracker page on each node, I'm seeing almost always 2 "RUNNING" tasks (and an accumulating list of "COMMIT_PENDING" tasks under Non-Running, which slowly grows as the job progresses). Normal? Also, I used a profiler to profile a local/non-hadoop test of the RecordReader/Map(): class: %Time org.apache.commons.lang.time.FastDateFormat.format(long): 46% com......parseFrom(byte[]): 42% java.io.FileInputStream.read(byte[], int, int): 5% ...rest are 1%'ish I guess this doesn't show anything helpful. I'll try to attach it to hadoop remotely...anyone have any experience doing this w/ YourKit Java Profiler? Anyways, decided to test the "large files" vs "small files" theory again: Small files (1449 files, ranging 10-100MB. average: 32 MB) - HDFS bytes read 49,057,491,374 - Map input records 737,850,142 - Finished in: 7mins, 26sec ... 104.898 MB/s Large files (22 files, around 500MB. average 514MB) - HDFS bytes read 11,852,421,152 - Map input records 179,657,432 - Finished in: 1mins, 8sec ... 166.225 MB/s Not sure why before the large files were taking longer, perhaps the SimpleDateFormat>FastDateFormat change? Anyways, good to see where I need to take the file sizes too...but still 166 MB is not the rate I was hoping for (given the # of nodes and local performance). So I guess in summary, hadoop TaskTracker$Child processes that are doing the Map() and RecordReader are about 50% slower than the normal, local non-hadoop version. In addition, their rate (~25MB/s) * Num Nodes (10) suggests ~ 250MB/s total job performance, but I'm only seeing ~166MB/s. Will On Tue, Jul 28, 2009 at 6:35 PM, Scott Carey<[EMAIL PROTECTED]> wrote: +
Scott Carey 2009-07-30, 05:31
-
Re: Map performance with custom binary formatwilliam kinney 2009-07-30, 14:37
Local is executed on a Hadoop node (when no job is running), So same
JRE/hardware. JRE: java version "1.6.0_13" Java(TM) SE Runtime Environment (build 1.6.0_13-b03) Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode) JVM arguments for child task: /usr/java/jdk1.6.0_13/jre/bin/java -Djava.library.path=/usr/lib/hadoop/lib/native/Linux-amd64-64:/disk1/hadoop/mapred/local/taskTracker/jobcache/job_200907242015_0048/attempt_200907242015_0048_m_000008_0/work -Xmx486m -Djava.io.tmpdir=/disk1/hadoop/mapred/local/taskTracker/jobcache/job_200907242015_0048/attempt_200907242015_0048_m_000008_0/work/tmp -classpath /etc/hadoop/conf:/usr/java/default/lib/tools.jar:/usr/lib/hadoop:/usr/lib/hadoop/hadoop-0.18.3-10.cloudera.CH0_3-core.jar:/usr/lib/hadoop/lib/commons-cli-2.0-SNAPSHOT.jar:/usr/lib/hadoop/lib/commons-codec-1.3.jar:/usr/lib/hadoop/lib/commons-httpclient-3.0.1.jar:/usr/lib/hadoop/lib/commons-logging-1.0.4.jar:/usr/lib/hadoop/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop/lib/commons-net-1.4.1.jar:/usr/lib/hadoop/lib/hadoop-0.18.3-10.cloudera.CH0_3-fairscheduler.jar:/usr/lib/hadoop/lib/hadoop-0.18.3-10.cloudera.CH0_3-scribe-log4j.jar:/usr/lib/hadoop/lib/hsqldb.jar:/usr/lib/hadoop/lib/jets3t-0.6.1.jar:/usr/lib/hadoop/lib/jetty-5.1.4.jar:/usr/lib/hadoop/lib/junit-4.5.jar:/usr/lib/hadoop/lib/kfs-0.1.3.jar:/usr/lib/hadoop/lib/log4j-1.2.15.jar:/usr/lib/hadoop/lib/mysql-connector-java-5.0.8-bin.jar:/usr/lib/hadoop/lib/oro-2.0.8.jar:/usr/lib/hadoop/lib/servlet-api.jar:/usr/lib/hadoop/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop/lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop/lib/xmlenc-0.52.jar:/usr/lib/hadoop/lib/jetty-ext/commons-el.jar:/usr/lib/hadoop/lib/jetty-ext/jasper-compiler.jar:/usr/lib/hadoop/lib/jetty-ext/jasper-runtime.jar:/usr/lib/hadoop/lib/jetty-ext/jsp-api.jar::/disk1/hadoop/mapred/local/taskTracker/jobcache/job_200907242015_0048/jars/classes:/disk1/hadoop/mapred/local/taskTracker/jobcache/job_200907242015_0048/jars:/disk2/hadoop/mapred/local/taskTracker/archive/master.ph-cloud.detica.us.com/tmp/hadoop-hadoop/mapred/system/job_200907242015_0048/libjars/ph-gpb-1.0.5.jar/ph-gpb-1.0.5.jar:/disk2/hadoop/mapred/local/taskTracker/archive/<master-hostname>/tmp/hadoop-hadoop/mapred/system/job_200907242015_0048/libjars/protobuf-java-2.0.3.jar/protobuf-java-2.0.3.jar:/disk1/hadoop/mapred/local/taskTracker/archive/<master-hostname>/tmp/hadoop-hadoop/mapred/system/job_200907242015_0048/libjars/commons-lang-2.4.jar/commons-lang-2.4.jar:/disk1/hadoop/mapred/local/taskTracker/jobcache/job_200907242015_0048/attempt_200907242015_0048_m_000008_0/work -Dhadoop.log.dir=/var/log/hadoop -Dhadoop.root.logger=INFO,TLA -Dhadoop.tasklog.taskid=attempt_200907242015_0048_m_000008_0 -Dhadoop.tasklog.totalLogFileSize=0 org.apache.hadoop.mapred.TaskTracker$Child 127.0.0.1 52698 attempt_200907242015_0048_m_000008_0 child task heap: -Xmx486m (mapred.child.ulimit is 995328) Local call has no JVM arguments, just: java -cp <myjar>.jar com......RecordReaderTest <fileToTest> Hardware platform: Linux 2.6.27.24-170.2.68.fc10.x86_64 8GB Dual Core Xeon E5205 2 x 300GB SAS 10k, no RAID. (HP ProLiant blade) Data is not compressed. JobTracker: Running: Started around 20, but as the job progressed it slowly increased to at the end: 432 (when Pending was 0). Running dropped to 0/Status was marked Succeeded about 10 seconds after that. Is this normal? The total # of Tasks was 1449. Stack Traces. Looked at about 20 stack traces from 2 different nodes. Consistently saw: 2 x org.apache.hadoop.dfs.DFSClient$LeaseChecker @ Thread.sleep() "Comm thread for attempt_200907242015_0050_m_001409_0" @ Thread.sleep() "IPC Client (47) connection to <master-hostname>/192.168.1.100:8020 from wkinney" @ Object.wait() "IPC Client (47) connection to /127.0.0.1:49202 from an unknown user" @ Object.wait() VM, GC, Signal Dispatcher, Low Memory Detector, CompilerThread, Finalizer, Reference Handler... Then would sometimes see FastDateFormat thread, parseFrom(), or somewhere near there (e.g. MapRunner.run()) Finally, I consistently saw this: "Thread-5" daemon prio=10 tid=0x0000000040bbfc00 nid=0x2f87 in Object.wait() [0x00007fb7498ce000..0x00007fb7498cebf0] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x00007fb769fdec00> (a java.util.LinkedList) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1905) - locked <0x00007fb769fdec00> (a java.util.LinkedList) I'm guessing this is normal DataNode activity... Will On Thu, Jul 30, 2009 at 1:31 AM, Scott Carey<[EMAIL PROTECTED]> wrote: +
william kinney 2009-07-30, 14:37
-
Re: Map performance with custom binary formatwilliam kinney 2009-07-30, 17:42
To simplify things, I removed the protobuf parse and DateFormat calls.
Local execution I noticed was now over 200MB/s....which is much greater than my physical disk performance (hdparm -t shows ~80MB/s). I "sync"ed and flush the RAM cache (echo 3 > /proc/sys/vm/drop_caches), and now get around 110MB/s. Still higher than the disk physical limit, so I'm quite puzzled. When running this pared down Map job, with 22 ~500MB files (~11GB), I see a total job performance of 364.624 MB/s (finished in 31 seconds). When looking at the TaskTracker$Child logs, to see how long each process was taking per file, I saw 30-37 MB/s. So I guess two things: - The total job performance/time is very good, and looks to scale with my 10 nodes (34MB * 10?). Should it have scaled even better, since I have 2 TT Children running per node? - The child performance is still much less than what I'm seeing in a local process (110MB/s vs ~30 MB/s). I'll need to profile some of the processes to find out where it's spending the time, correct? Any other ideas? Thanks! Will On Thu, Jul 30, 2009 at 10:37 AM, william kinney<[EMAIL PROTECTED]> wrote: > Local is executed on a Hadoop node (when no job is running), So same > JRE/hardware. > > JRE: > java version "1.6.0_13" > Java(TM) SE Runtime Environment (build 1.6.0_13-b03) > Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode) > > JVM arguments for child task: > /usr/java/jdk1.6.0_13/jre/bin/java > -Djava.library.path=/usr/lib/hadoop/lib/native/Linux-amd64-64:/disk1/hadoop/mapred/local/taskTracker/jobcache/job_200907242015_0048/attempt_200907242015_0048_m_000008_0/work > -Xmx486m -Djava.io.tmpdir=/disk1/hadoop/mapred/local/taskTracker/jobcache/job_200907242015_0048/attempt_200907242015_0048_m_000008_0/work/tmp > -classpath /etc/hadoop/conf:/usr/java/default/lib/tools.jar:/usr/lib/hadoop:/usr/lib/hadoop/hadoop-0.18.3-10.cloudera.CH0_3-core.jar:/usr/lib/hadoop/lib/commons-cli-2.0-SNAPSHOT.jar:/usr/lib/hadoop/lib/commons-codec-1.3.jar:/usr/lib/hadoop/lib/commons-httpclient-3.0.1.jar:/usr/lib/hadoop/lib/commons-logging-1.0.4.jar:/usr/lib/hadoop/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop/lib/commons-net-1.4.1.jar:/usr/lib/hadoop/lib/hadoop-0.18.3-10.cloudera.CH0_3-fairscheduler.jar:/usr/lib/hadoop/lib/hadoop-0.18.3-10.cloudera.CH0_3-scribe-log4j.jar:/usr/lib/hadoop/lib/hsqldb.jar:/usr/lib/hadoop/lib/jets3t-0.6.1.jar:/usr/lib/hadoop/lib/jetty-5.1.4.jar:/usr/lib/hadoop/lib/junit-4.5.jar:/usr/lib/hadoop/lib/kfs-0.1.3.jar:/usr/lib/hadoop/lib/log4j-1.2.15.jar:/usr/lib/hadoop/lib/mysql-connector-java-5.0.8-bin.jar:/usr/lib/hadoop/lib/oro-2.0.8.jar:/usr/lib/hadoop/lib/servlet-api.jar:/usr/lib/hadoop/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop/lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop/lib/xmlenc-0.52.jar:/usr/lib/hadoop/lib/jetty-ext/commons-el.jar:/usr/lib/hadoop/lib/jetty-ext/jasper-compiler.jar:/usr/lib/hadoop/lib/jetty-ext/jasper-runtime.jar:/usr/lib/hadoop/lib/jetty-ext/jsp-api.jar::/disk1/hadoop/mapred/local/taskTracker/jobcache/job_200907242015_0048/jars/classes:/disk1/hadoop/mapred/local/taskTracker/jobcache/job_200907242015_0048/jars:/disk2/hadoop/mapred/local/taskTracker/archive/master.ph-cloud.detica.us.com/tmp/hadoop-hadoop/mapred/system/job_200907242015_0048/libjars/ph-gpb-1.0.5.jar/ph-gpb-1.0.5.jar:/disk2/hadoop/mapred/local/taskTracker/archive/<master-hostname>/tmp/hadoop-hadoop/mapred/system/job_200907242015_0048/libjars/protobuf-java-2.0.3.jar/protobuf-java-2.0.3.jar:/disk1/hadoop/mapred/local/taskTracker/archive/<master-hostname>/tmp/hadoop-hadoop/mapred/system/job_200907242015_0048/libjars/commons-lang-2.4.jar/commons-lang-2.4.jar:/disk1/hadoop/mapred/local/taskTracker/jobcache/job_200907242015_0048/attempt_200907242015_0048_m_000008_0/work > -Dhadoop.log.dir=/var/log/hadoop -Dhadoop.root.logger=INFO,TLA > -Dhadoop.tasklog.taskid=attempt_200907242015_0048_m_000008_0 > -Dhadoop.tasklog.totalLogFileSize=0 > org.apache.hadoop.mapred.TaskTracker$Child 127.0.0.1 52698 > attempt_200907242015_0048_m_000008_0 +
william kinney 2009-07-30, 17:42
-
Re: Map performance with custom binary formatScott Carey 2009-07-30, 18:39
Maybe I should have read this one before commenting on the other :D
On 7/30/09 10:42 AM, "william kinney" <[EMAIL PROTECTED]> wrote: > To simplify things, I removed the protobuf parse and DateFormat calls. > Local execution I noticed was now over 200MB/s....which is much > greater than my physical disk performance (hdparm -t shows ~80MB/s). I > "sync"ed and flush the RAM cache (echo 3 > /proc/sys/vm/drop_caches), > and now get around 110MB/s. Still higher than the disk physical limit, > so I'm quite puzzled. The OS is still caching writes, if this is the read rate it is puzzling unless it is measuring reading some of the same data more than once. Odd. > > When running this pared down Map job, with 22 ~500MB files (~11GB), I > see a total job performance of 364.624 MB/s (finished in 31 seconds). > > When looking at the TaskTracker$Child logs, to see how long each > process was taking per file, I saw 30-37 MB/s. > > So I guess two things: > - The total job performance/time is very good, and looks to scale > with my 10 nodes (34MB * 10?). Should it have scaled even better, > since I have 2 TT Children running per node? > - The child performance is still much less than what I'm seeing in a > local process (110MB/s vs ~30 MB/s). I'll need to profile some of the > processes to find out where it's spending the time, correct? Any other > ideas? Hmm that difference seems a bit troubling. For one, you are running two tasks at once per node -- is there any way to do your local, non MR test with two concurrent processes or threads? There might be some concurrent read issues at the OS level. The single large file map task comparison to the same file locally will still be a useful test. As for read concurrency and the OS, what file system? Make sure your OS readahead on the device is set to a good value (at least 2048 blocks, preferably 8192 ish): # /sbin/blockdev --setra 8192 /dev/<devices> For me <devices> is 'sd*' Use the deadline scheduler: # echo 'deadline' > /sys/block/sda/queue/scheduler (for each device) I have found that XFS is also superior to ext3 with concurrent reads or writes, if tuned right. Mount options: noatime,allocsize=32m,logbufs=8 But since you seem CPU bound and not I/O bound this probably isn't an issue. If even a single task on a single large file is slower in MB/sec than your test program, then I suspect read/write buffer issues or misuse somewhere. (Just a SWAG though). -Scott > > Thanks! > Will > > > > On Thu, Jul 30, 2009 at 10:37 AM, william > kinney<[EMAIL PROTECTED]> wrote: >> Local is executed on a Hadoop node (when no job is running), So same >> JRE/hardware. >> >> JRE: >> java version "1.6.0_13" >> Java(TM) SE Runtime Environment (build 1.6.0_13-b03) >> Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode) >> >> JVM arguments for child task: >> /usr/java/jdk1.6.0_13/jre/bin/java >> -Djava.library.path=/usr/lib/hadoop/lib/native/Linux-amd64-64:/disk1/hadoop/m >> apred/local/taskTracker/jobcache/job_200907242015_0048/attempt_200907242015_0 >> 048_m_000008_0/work >> -Xmx486m >> -Djava.io.tmpdir=/disk1/hadoop/mapred/local/taskTracker/jobcache/job_20090724 >> 2015_0048/attempt_200907242015_0048_m_000008_0/work/tmp >> -classpath >> /etc/hadoop/conf:/usr/java/default/lib/tools.jar:/usr/lib/hadoop:/usr/lib/had >> oop/hadoop-0.18.3-10.cloudera.CH0_3-core.jar:/usr/lib/hadoop/lib/commons-cli- >> 2.0-SNAPSHOT.jar:/usr/lib/hadoop/lib/commons-codec-1.3.jar:/usr/lib/hadoop/li >> b/commons-httpclient-3.0.1.jar:/usr/lib/hadoop/lib/commons-logging-1.0.4.jar: >> /usr/lib/hadoop/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop/lib/commons >> -net-1.4.1.jar:/usr/lib/hadoop/lib/hadoop-0.18.3-10.cloudera.CH0_3-fairschedu >> ler.jar:/usr/lib/hadoop/lib/hadoop-0.18.3-10.cloudera.CH0_3-scribe-log4j.jar: >> /usr/lib/hadoop/lib/hsqldb.jar:/usr/lib/hadoop/lib/jets3t-0.6.1.jar:/usr/lib/ >> hadoop/lib/jetty-5.1.4.jar:/usr/lib/hadoop/lib/junit-4.5.jar:/usr/lib/hadoop/ >> lib/kfs-0.1.3.jar +
Scott Carey 2009-07-30, 18:39
-
Re: Map performance with custom binary formatTodd Lipcon 2009-07-30, 18:51
On Thu, Jul 30, 2009 at 11:39 AM, Scott Carey <[EMAIL PROTECTED]>wrote:
> > > Use the deadline scheduler: > # echo 'deadline' > /sys/block/sda/queue/scheduler (for each device) > Have you found the deadline scheduler to be significantly better than the default cfq? I've used deadline for RDBMS workloads, but I wouldn't imagine the difference would be distinct for Hadoop. -Todd > I have found that XFS is also superior to ext3 with concurrent reads or > writes, if tuned right. > Mount options: noatime,allocsize=32m,logbufs=8 > But since you seem CPU bound and not I/O bound this probably isn't an > issue. > > If even a single task on a single large file is slower in MB/sec than your > test program, then I suspect read/write buffer issues or misuse somewhere. > (Just a SWAG though). > > -Scott > > > > > Thanks! > > Will > > > > > > > > On Thu, Jul 30, 2009 at 10:37 AM, william > > kinney<[EMAIL PROTECTED]> wrote: > >> Local is executed on a Hadoop node (when no job is running), So same > >> JRE/hardware. > >> > >> JRE: > >> java version "1.6.0_13" > >> Java(TM) SE Runtime Environment (build 1.6.0_13-b03) > >> Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode) > >> > >> JVM arguments for child task: > >> /usr/java/jdk1.6.0_13/jre/bin/java > >> > -Djava.library.path=/usr/lib/hadoop/lib/native/Linux-amd64-64:/disk1/hadoop/m > >> > apred/local/taskTracker/jobcache/job_200907242015_0048/attempt_200907242015_0 > >> 048_m_000008_0/work > >> -Xmx486m > >> > -Djava.io.tmpdir=/disk1/hadoop/mapred/local/taskTracker/jobcache/job_20090724 > >> 2015_0048/attempt_200907242015_0048_m_000008_0/work/tmp > >> -classpath > >> > /etc/hadoop/conf:/usr/java/default/lib/tools.jar:/usr/lib/hadoop:/usr/lib/had > >> > oop/hadoop-0.18.3-10.cloudera.CH0_3-core.jar:/usr/lib/hadoop/lib/commons-cli- > >> > 2.0-SNAPSHOT.jar:/usr/lib/hadoop/lib/commons-codec-1.3.jar:/usr/lib/hadoop/li > >> > b/commons-httpclient-3.0.1.jar:/usr/lib/hadoop/lib/commons-logging-1.0.4.jar: > >> > /usr/lib/hadoop/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop/lib/commons > >> > -net-1.4.1.jar:/usr/lib/hadoop/lib/hadoop-0.18.3-10.cloudera.CH0_3-fairschedu > >> > ler.jar:/usr/lib/hadoop/lib/hadoop-0.18.3-10.cloudera.CH0_3-scribe-log4j.jar: > >> > /usr/lib/hadoop/lib/hsqldb.jar:/usr/lib/hadoop/lib/jets3t-0.6.1.jar:/usr/lib/ > >> > hadoop/lib/jetty-5.1.4.jar:/usr/lib/hadoop/lib/junit-4.5.jar:/usr/lib/hadoop/ > >> > lib/kfs-0.1.3.jar:/usr/lib/hadoop/lib/log4j-1.2.15.jar:/usr/lib/hadoop/lib/my > >> > sql-connector-java-5.0.8-bin.jar:/usr/lib/hadoop/lib/oro-2.0.8.jar:/usr/lib/h > >> > adoop/lib/servlet-api.jar:/usr/lib/hadoop/lib/slf4j-api-1.4.3.jar:/usr/lib/ha > >> > doop/lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop/lib/xmlenc-0.52.jar:/usr/lib > >> > /hadoop/lib/jetty-ext/commons-el.jar:/usr/lib/hadoop/lib/jetty-ext/jasper-com > >> > piler.jar:/usr/lib/hadoop/lib/jetty-ext/jasper-runtime.jar:/usr/lib/hadoop/li > >> > b/jetty-ext/jsp-api.jar::/disk1/hadoop/mapred/local/taskTracker/jobcache/job_ > >> > 200907242015_0048/jars/classes:/disk1/hadoop/mapred/local/taskTracker/jobcach > >> > e/job_200907242015_0048/jars:/disk2/hadoop/mapred/local/taskTracker/archive/m > >> > aster.ph-cloud.detica.us.com/tmp/hadoop-hadoop/mapred/system/job_200907242015 > >> > _0048/libjars/ph-gpb-1.0.5.jar/ph-gpb-1.0.5.jar:/disk2/hadoop/mapred/local/ta > >> > skTracker/archive/<master-hostname>/tmp/hadoop-hadoop/mapred/system/job_20090 > >> > 7242015_0048/libjars/protobuf-java-2.0.3.jar/protobuf-java-2.0.3.jar:/disk1/h > >> > adoop/mapred/local/taskTracker/archive/<master-hostname>/tmp/hadoop-hadoop/ma > >> > pred/system/job_200907242015_0048/libjars/commons-lang-2.4.jar/commons-lang-2 > >> > .4.jar:/disk1/hadoop/mapred/local/taskTracker/jobcache/job_200907242015_0048/ > >> attempt_200907242015_0048_m_000008_0/work > >> -Dhadoop.log.dir=/var/log/hadoop -Dhadoop.root.logger=INFO,TLA > >> -Dhadoop.tasklog.taskid=attempt_200907242015_0048_m_000008_0 > >> -Dhadoop.tasklog.totalLogFileSize=0 +
Todd Lipcon 2009-07-30, 18:51
-
Re: Map performance with custom binary formatScott Carey 2009-07-30, 20:32
On 7/30/09 11:51 AM, "Todd Lipcon" <[EMAIL PROTECTED]> wrote: > On Thu, Jul 30, 2009 at 11:39 AM, Scott Carey <[EMAIL PROTECTED]>wrote: > >> >> >> Use the deadline scheduler: >> # echo 'deadline' > /sys/block/sda/queue/scheduler (for each device) >> > > Have you found the deadline scheduler to be significantly better than the > default cfq? I've used deadline for RDBMS workloads, but I wouldn't imagine > the difference would be distinct for Hadoop. > > -Todd On a 29 job dependency chain workflow, it consistently was ~5% faster for me on tasks that did a lot of spilling to disk (XFS). The machines run 13 concurrent maps and 9 concurrent reduces and have 4 disks JBOD. Centos 5.3 So, it wasn't significant, but it did seem to be there. Since deadline is a device level global elevator, it should be better for this type of workload (throughput) than cfq, which is designed to prevent I/O starvation at a per-process level. The difference is greatest when you have multiple concurrent sequential readers or writers mixed with a little random I/O. CFQ tends to favor the random I/O while deadline tends to favor bytes/sec throughput. If those random I/O's are latency sensitive and important, CFQ is better, if the total byte throughput is most important deadline helps more. However, if the readahead value isn't set high, both will favor the random i/o. I have a few 'fio' benchmark profiles and results testing concurrent I/O performance as a function of scheduler and readahead value. Deadline + sizeable readahead gives highest bytes/sec total throughput but will increase latency for small reads. > > >> I have found that XFS is also superior to ext3 with concurrent reads or >> writes, if tuned right. >> Mount options: noatime,allocsize=32m,logbufs=8 >> But since you seem CPU bound and not I/O bound this probably isn't an >> issue. >> >> If even a single task on a single large file is slower in MB/sec than your >> test program, then I suspect read/write buffer issues or misuse somewhere. >> (Just a SWAG though). >> >> -Scott >> >>> >>> Thanks! >>> Will >>> >>> >>> >>> On Thu, Jul 30, 2009 at 10:37 AM, william >>> kinney<[EMAIL PROTECTED]> wrote: >>>> Local is executed on a Hadoop node (when no job is running), So same >>>> JRE/hardware. >>>> >>>> JRE: >>>> java version "1.6.0_13" >>>> Java(TM) SE Runtime Environment (build 1.6.0_13-b03) >>>> Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode) >>>> >>>> JVM arguments for child task: >>>> /usr/java/jdk1.6.0_13/jre/bin/java >>>> >> -Djava.library.path=/usr/lib/hadoop/lib/native/Linux-amd64-64:/disk1/hadoop/m >>>> >> apred/local/taskTracker/jobcache/job_200907242015_0048/attempt_200907242015_0 >>>> 048_m_000008_0/work >>>> -Xmx486m >>>> >> -Djava.io.tmpdir=/disk1/hadoop/mapred/local/taskTracker/jobcache/job_20090724 >>>> 2015_0048/attempt_200907242015_0048_m_000008_0/work/tmp >>>> -classpath >>>> >> /etc/hadoop/conf:/usr/java/default/lib/tools.jar:/usr/lib/hadoop:/usr/lib/had >>>> >> oop/hadoop-0.18.3-10.cloudera.CH0_3-core.jar:/usr/lib/hadoop/lib/commons-cli- >>>> >> 2.0-SNAPSHOT.jar:/usr/lib/hadoop/lib/commons-codec-1.3.jar:/usr/lib/hadoop/li >>>> >> b/commons-httpclient-3.0.1.jar:/usr/lib/hadoop/lib/commons-logging-1.0.4.jar: >>>> >> /usr/lib/hadoop/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop/lib/commons >>>> >> -net-1.4.1.jar:/usr/lib/hadoop/lib/hadoop-0.18.3-10.cloudera.CH0_3-fairschedu >>>> >> ler.jar:/usr/lib/hadoop/lib/hadoop-0.18.3-10.cloudera.CH0_3-scribe-log4j.jar: >>>> >> /usr/lib/hadoop/lib/hsqldb.jar:/usr/lib/hadoop/lib/jets3t-0.6.1.jar:/usr/lib/ >>>> >> hadoop/lib/jetty-5.1.4.jar:/usr/lib/hadoop/lib/junit-4.5.jar:/usr/lib/hadoop/ >>>> >> lib/kfs-0.1.3.jar:/usr/lib/hadoop/lib/log4j-1.2.15.jar:/usr/lib/hadoop/lib/my >>>> >> sql-connector-java-5.0.8-bin.jar:/usr/lib/hadoop/lib/oro-2.0.8.jar:/usr/lib/h >>>> >> adoop/lib/servlet-api.jar:/usr/lib/hadoop/lib/slf4j-api-1.4.3.jar:/usr/lib/ha >>>> >> doop/lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop/lib/xmlenc-0.52.jar:/usr/lib +
Scott Carey 2009-07-30, 20:32
-
Re: Map performance with custom binary formatScott Carey 2009-07-30, 18:19
Comments inline:
On 7/30/09 7:37 AM, "william kinney" <[EMAIL PROTECTED]> wrote: > Local is executed on a Hadoop node (when no job is running), So same > JRE/hardware. > > JRE: > java version "1.6.0_13" > Java(TM) SE Runtime Environment (build 1.6.0_13-b03) > Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode) > > JVM arguments for child task: > /usr/java/jdk1.6.0_13/jre/bin/java > -Djava.library.path=/usr/lib/hadoop/lib/native/Linux-amd64-64:/disk1/hadoop/ma > pred/local/taskTracker/jobcache/job_200907242015_0048/attempt_200907242015_004 > 8_m_000008_0/work > -Xmx486m You might benefit from JVM 1.6.0_14 with -XX:+UseCompressedOops, but that would probably help both roughly equally. It won't help much if you aren't creating significant work for the garbage collector thouth. > > Local call has no JVM arguments, just: > java -cp <myjar>.jar com......RecordReaderTest <fileToTest> > You might want to try -Xmx486m as an experiment on the local test to see if it affects the behavior. If you are doing a lot of garbage creation it may. > > Data is not compressed. Hmm, that was a random guess, because it would obviously affect CPU use. Another thing to try -- make sure your writer that is writing into HDFS is wrapped with a buffer (try 32k or 64k). That's another random guess for something that might not show up well in stack traces without a profiler -- but also might already be done. > > JobTracker: > Running: Started around 20, but as the job progressed it slowly > increased to at the end: 432 (when Pending was 0). Running dropped to > 0/Status was marked Succeeded about 10 seconds after that. Is this > normal? The total # of Tasks was 1449. This is the "one new task per heartbeat" scheduler slowness. The next version of the Fair Scheduler will schedule many tasks in one heartbeat which should make this faster. Its a big reason that fewer, larger files was faster. Though if you are CPU bound, you only need 2 tasks running at the same time per node on your hardware to be at near top efficiency. Fewer tasks per node (say, 4) with more RAM each (800MB) might do better on this sort of workload. > > Stack Traces. > Looked at about 20 stack traces from 2 different nodes. Consistently saw: > 2 x org.apache.hadoop.dfs.DFSClient$LeaseChecker @ Thread.sleep() > "Comm thread for attempt_200907242015_0050_m_001409_0" @ Thread.sleep() > "IPC Client (47) connection to <master-hostname>/192.168.1.100:8020 > from wkinney" @ Object.wait() > "IPC Client (47) connection to /127.0.0.1:49202 from an unknown user" > @ Object.wait() > VM, GC, Signal Dispatcher, Low Memory Detector, CompilerThread, > Finalizer, Reference Handler... Sounds like the usual threads that don't do much. > > Then would sometimes see FastDateFormat thread, parseFrom(), or > somewhere near there (e.g. MapRunner.run()) The meat of the task. > > Finally, I consistently saw this: > "Thread-5" daemon prio=10 tid=0x0000000040bbfc00 nid=0x2f87 in > Object.wait() [0x00007fb7498ce000..0x00007fb7498cebf0] > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x00007fb769fdec00> (a java.util.LinkedList) > at > org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.jav > a:1905) > - locked <0x00007fb769fdec00> (a java.util.LinkedList) > I'm guessing this is normal DataNode activity... Yes, this is the normal dfs thread. It can be hard to catch it doing work with just stack traces and no profiler attached. > > Will > > There is definitely a mystery here. I expect the task scheduling delays and some startup inefficiency but the overall difference is odd. What about a local test on a single, larger file versus a hadoop job on that same single, larger file (which would have just one map job)? This test may be very enlightening. > On Thu, Jul 30, 2009 at 1:31 AM, Scott Carey<[EMAIL PROTECTED]> wrote: >> What is the JRE for the Hadoop nodes versus local? What are the JVM +
Scott Carey 2009-07-30, 18:19
-
Re: Map performance with custom binary formatwilliam kinney 2009-07-30, 21:32
Hrm I think I found an issue. In my RecordReader, I do an
Arrays.copyOfRange() to get the protobuf binary section for RecordReader.next(key,value). In the profile dumps from child map processes, this one call takes up ~90% of the CPU Samples time. So, I wrapped the line w/ a System.nanoTime(), and got: Local process: total(ms): 902.138069, avg: 112.145 ns Hadoop Child processes: 1) total(ms): 6953.47106, avg: 726.906 ns 2) total(ms): 6503.962176, avg: 802.270 ns 3) total(ms): 5482.494256, avg: 677.589 ns 4) total(ms): 5291.664592, avg: 661.764 ns 5) total(ms): 5568.289465, avg: 697.353 ns 6) total(ms): 5638.778551, avg: 702.290 ns ...etc So for some reason, that call is taking over 6 times longer in hadoop... The buffer size for it is 65536 for both processes. Any ideas? ------ "You might want to try -Xmx486m as an experiment on the local test to see if it affects the behavior. If you are doing a lot of garbage creation it may." -Tried it, no changes. "Hmm, that was a random guess, because it would obviously affect CPU use. Another thing to try -- make sure your writer that is writing into HDFS is wrapped with a buffer (try 32k or 64k). That's another random guess for something that might not show up well in stack traces without a profiler -- but also might already be done" - So you're saying when writing the file into the HDFS, I should make sure it ends in 64k chunk (ie, zero-out until i reach such a point)? So all file sizes are a multiple of 64kb? "There is definitely a mystery here. I expect the task scheduling delays and some startup inefficiency but the overall difference is odd. What about a local test on a single, larger file versus a hadoop job on that same single, larger file (which would have just one map job)? This test may be very enlightening." - Total job time was 20 seconds for the 506MB file. Task took 19 seconds. Local process on the same file took ~ 3 seconds. "Hmm that difference seems a bit troubling. For one, you are running two tasks at once per node -- is there any way to do your local, non MR test with two concurrent processes or threads?" - Does the above test answer this? Only one task was executed on the node that took 19 seconds. ext3 filesystem. "Make sure your OS readahead on the device is set to a good value (at least 2048 blocks, preferably 8192 ish):" - For RA its showing 256, BSZ is 2048. RA should be 8192 ? Should BSZ then be larger? What about "SSZ"? Since the performance is high for the local process, would that then mean my disk i/o is sufficient, as you suggested? Do I still need to change any of these settings? "If even a single task on a single large file is slower in MB/sec than your test program, then I suspect read/write buffer issues or misuse somewhere." - Do you know of an instance where I'd have buffer issues with the Child process, and not local? The only difference I can think of is of course how the buffer is filled, FileInputStream vs FSDataInputStream. But once it is filled, why would reading portions of that buffer (ie, Arrays.copyOfRange()) take long in one instance but not another? Would it be helpful to get a histogram of the Arrays.copyOfRange(), rather than the average and total? Perhaps for the most part it is fine (~ 120 ns), but chokes sometimes (thefore increasing total time and average). Thanks for the help, Will On Thu, Jul 30, 2009 at 2:19 PM, Scott Carey<[EMAIL PROTECTED]> wrote: > Comments inline: > > On 7/30/09 7:37 AM, "william kinney" <[EMAIL PROTECTED]> wrote: > >> Local is executed on a Hadoop node (when no job is running), So same >> JRE/hardware. >> >> JRE: >> java version "1.6.0_13" >> Java(TM) SE Runtime Environment (build 1.6.0_13-b03) >> Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode) >> >> JVM arguments for child task: >> /usr/java/jdk1.6.0_13/jre/bin/java >> -Djava.library.path=/usr/lib/hadoop/lib/native/Linux-amd64-64:/disk1/hadoop/ma >> pred/local/taskTracker/jobcache/job_200907242015_0048/attempt_200907242015_004 +
william kinney 2009-07-30, 21:32
-
Re: Map performance with custom binary formatScott Carey 2009-08-01, 01:07
On 7/30/09 2:32 PM, "william kinney" <[EMAIL PROTECTED]> wrote:
> > "If even a single task on a single large file is slower in MB/sec than your > test program, then I suspect read/write buffer issues or misuse somewhere." > - Do you know of an instance where I'd have buffer issues with the > Child process, and not local? The only difference I can think of is of > course how the buffer is filled, FileInputStream vs FSDataInputStream. > But once it is filled, why would reading portions of that buffer (ie, > Arrays.copyOfRange()) take long in one instance but not another? Is your local test writing to HDFS or local files? Likewise, what happens if your local test reads from HDFS instead of a local file? If both tests use HDFS on both ends and the performance difference is still there, then we have narrowed it down. If the performance becomes much more similar, we have likewise narrowed it down. > > Would it be helpful to get a histogram of the Arrays.copyOfRange(), > rather than the average and total? Perhaps for the most part it is > fine (~ 120 ns), but chokes sometimes (thefore increasing total time > and average). > > Thanks for the help, > Will > > > On Thu, Jul 30, 2009 at 2:19 PM, Scott Carey<[EMAIL PROTECTED]> wrote: >> Comments inline: >> >> On 7/30/09 7:37 AM, "william kinney" <[EMAIL PROTECTED]> wrote: >> >>> Local is executed on a Hadoop node (when no job is running), So same >>> JRE/hardware. >>> >>> JRE: >>> java version "1.6.0_13" >>> Java(TM) SE Runtime Environment (build 1.6.0_13-b03) >>> Java HotSpot(TM) 64-Bit Server VM (build 11.3-b02, mixed mode) >>> >>> JVM arguments for child task: >>> /usr/java/jdk1.6.0_13/jre/bin/java >>> -Djava.library.path=/usr/lib/hadoop/lib/native/Linux-amd64-64:/disk1/hadoop/ >>> ma >>> pred/local/taskTracker/jobcache/job_200907242015_0048/attempt_200907242015_0 >>> 04 >>> 8_m_000008_0/work >>> -Xmx486m >> >> You might benefit from JVM 1.6.0_14 with -XX:+UseCompressedOops, but that >> would probably help both roughly equally. It won't help much if you aren't >> creating significant work for the garbage collector thouth. >> >>> >>> Local call has no JVM arguments, just: >>> java -cp <myjar>.jar com......RecordReaderTest <fileToTest> >>> >> >> You might want to try -Xmx486m as an experiment on the local test to see if >> it affects the behavior. If you are doing a lot of garbage creation it may. >> >> >>> >>> Data is not compressed. >> >> Hmm, that was a random guess, because it would obviously affect CPU use. >> Another thing to try -- make sure your writer that is writing into HDFS is >> wrapped with a buffer (try 32k or 64k). That's another random guess for >> something that might not show up well in stack traces without a profiler -- >> but also might already be done. >> >>> >>> JobTracker: >>> Running: Started around 20, but as the job progressed it slowly >>> increased to at the end: 432 (when Pending was 0). Running dropped to >>> 0/Status was marked Succeeded about 10 seconds after that. Is this >>> normal? The total # of Tasks was 1449. >> >> This is the "one new task per heartbeat" scheduler slowness. The next >> version of the Fair Scheduler will schedule many tasks in one heartbeat >> which should make this faster. >> Its a big reason that fewer, larger files was faster. Though if you are CPU >> bound, you only need 2 tasks running at the same time per node on your >> hardware to be at near top efficiency. Fewer tasks per node (say, 4) with >> more RAM each (800MB) might do better on this sort of workload. >> >> >>> >>> Stack Traces. >>> Looked at about 20 stack traces from 2 different nodes. Consistently saw: >>> 2 x org.apache.hadoop.dfs.DFSClient$LeaseChecker @ Thread.sleep() >>> "Comm thread for attempt_200907242015_0050_m_001409_0" @ Thread.sleep() >>> "IPC Client (47) connection to <master-hostname>/192.168.1.100:8020 >>> from wkinney" @ Object.wait() >>> "IPC Client (47) connection to /127.0.0.1:49202 from an unknown user" >>> @ Object.wait() +
Scott Carey 2009-08-01, 01:07
-
Re: Map performance with custom binary formatScott Carey 2009-07-31, 21:31
On 7/30/09 2:32 PM, "william kinney" <[EMAIL PROTECTED]> wrote: > Hrm I think I found an issue. In my RecordReader, I do an > Arrays.copyOfRange() to get the protobuf binary section for > RecordReader.next(key,value). In the profile dumps from child map > processes, this one call takes up ~90% of the CPU Samples time. > So, I wrapped the line w/ a System.nanoTime(), and got: > > Local process: > total(ms): 902.138069, avg: 112.145 ns > > Hadoop Child processes: > 1) total(ms): 6953.47106, avg: 726.906 ns > 2) total(ms): 6503.962176, avg: 802.270 ns > 3) total(ms): 5482.494256, avg: 677.589 ns > 4) total(ms): 5291.664592, avg: 661.764 ns > 5) total(ms): 5568.289465, avg: 697.353 ns > 6) total(ms): 5638.778551, avg: 702.290 ns > ...etc > > So for some reason, that call is taking over 6 times longer in hadoop... > > The buffer size for it is 65536 for both processes. > > Any ideas? > That is a very interesting result. Try counting the number of times that the above is called to ensure that is the same for both -- if the average size of the copy is much smaller it will be slower. Other ideas -- is one using a native ByteBuffer underneath the covers somewhere and the other not? Is there some other difference in buffering on either side of that copy? > > ------ > "You might want to try -Xmx486m as an experiment on the local test to see if > it affects the behavior. If you are doing a lot of garbage creation it may." > -Tried it, no changes. > > "Hmm, that was a random guess, because it would obviously affect CPU use. > Another thing to try -- make sure your writer that is writing into HDFS is > wrapped with a buffer (try 32k or 64k). That's another random guess for > something that might not show up well in stack traces without a profiler -- > but also might already be done" > - So you're saying when writing the file into the HDFS, I should > make sure it ends in 64k chunk (ie, zero-out until i reach such a > point)? So all file sizes are a multiple of 64kb? No, just that its using something like a BufferedOutputStream when writing from your custom format out (HDFS does this itself so it shouldn't be necessary) and BufferedInputStream for reading. > > "There is definitely a mystery here. I expect the task scheduling delays and > some startup inefficiency but the overall difference is odd. What about a > local test on a single, larger file versus a hadoop job on that same single, > larger file (which would have just one map job)? This test may be very > enlightening." > - Total job time was 20 seconds for the 506MB file. Task took 19 > seconds. Local process on the same file took ~ 3 seconds. Ok, so drilling down here is where we need to look (and what the results above are). Scheduling may be a few seconds of that. > > "Hmm that difference seems a bit troubling. For one, you are running two > tasks at once per node -- is there any way to do your local, non MR test > with two concurrent processes or threads?" > - Does the above test answer this? Only one task was executed on the > node that took 19 seconds. > Yeah, it looks like we have ruled that out. > ext3 filesystem. > > "Make sure your OS readahead on the device is set to a good value (at least > 2048 blocks, preferably 8192 ish):" > - For RA its showing 256, BSZ is 2048. RA should be 8192 ? Should > BSZ then be larger? What about "SSZ"? I'm referring to /sbin/blockdev --getra <device> Which is just "RA". SSZ is sector size -- that can't change, and I think "BSZ" is block size, and is also static. Use /sbin/blockdev/ --setra <value> <device> to set the readahead. This will increase sequential throughput somewhat at the device level, but moreso if there are two or more concurrent reads. It doesn't affect random I/O performance. Basically, if the block layer detects a sequence of I/O's that are sequential, it starts reading "ahead" of the last I/O and keeps increasing the size of this readahead as long as the sequential access continues, up to a max size. You seem CPU bound, especially considering your evidence above. I/O tuning might help somewhere, but not this use case. I am not familiar enough with that part of Hadoop to know. In general, that buffer may be too small, or be backed by a Native ByteBuffer which will be slow for small reads into Java memory. A crude histogram could be useful, but I wouldn't spend too much time on it. Some basic characterization of the difference measured other than averages and totals might be insightful, but I'm unsure how useful the results would be. +
Scott Carey 2009-07-31, 21:31
-
Re: Map performance with custom binary formatSteve Loughran 2009-07-29, 09:17
Scott Carey wrote:
> Well, the first thing to do in any performance bottleneck investigation is > to look at the machine hardware resource usage. > > During your test, what is the CPU use and disk usage? What about network > utilization? > Top, vmstat, iostat, and some network usage monitoring would be useful. It > could be many things causing your lack of scalability, but without actually > monitoring your machines to see if there is an obvious bottleneck its just > random guessing and hunches. iotop is very good here +
Steve Loughran 2009-07-29, 09:17
-
Re: Map performance with custom binary formatTodd Lipcon 2009-07-29, 18:47
On Wed, Jul 29, 2009 at 2:17 AM, Steve Loughran <[EMAIL PROTECTED]> wrote:
> Scott Carey wrote: > >> Well, the first thing to do in any performance bottleneck investigation is >> to look at the machine hardware resource usage. >> >> During your test, what is the CPU use and disk usage? What about network >> utilization? >> Top, vmstat, iostat, and some network usage monitoring would be useful. >> It >> could be many things causing your lack of scalability, but without >> actually >> monitoring your machines to see if there is an obvious bottleneck its just >> random guessing and hunches. >> > > iotop is very good here > But sadly not available with kernels before 2.6.20 :( I was looking into this recently and the next best thing I could find was blktrace, but it's not nearly as nice. Any other tools people know about, especially that work on RHEL5/CentOS5? -Todd +
Todd Lipcon 2009-07-29, 18:47
-
Re: Map performance with custom binary formatScott Carey 2009-07-29, 19:07
On 7/29/09 11:47 AM, "Todd Lipcon" <[EMAIL PROTECTED]> wrote: > On Wed, Jul 29, 2009 at 2:17 AM, Steve Loughran <[EMAIL PROTECTED]> wrote: > >> Scott Carey wrote: >> >>> Well, the first thing to do in any performance bottleneck investigation is >>> to look at the machine hardware resource usage. >>> >>> During your test, what is the CPU use and disk usage? What about network >>> utilization? >>> Top, vmstat, iostat, and some network usage monitoring would be useful. >>> It >>> could be many things causing your lack of scalability, but without >>> actually >>> monitoring your machines to see if there is an obvious bottleneck its just >>> random guessing and hunches. >>> >> >> iotop is very good here >> > > But sadly not available with kernels before 2.6.20 :( I was looking into > this recently and the next best thing I could find was blktrace, but it's > not nearly as nice. > > Any other tools people know about, especially that work on RHEL5/CentOS5? > Iftop worked for me for network traffic: http://www.ex-parrot.com/pdw/iftop/ For disk monitoring on CentOS/RHEL 5 Linux, every time I try and drill down a problem with iostat I wish I was on Windows or Solaris instead where monitoring and tracing disk I/O isn't a decade behind. On both of those, not only can you see what process is causing the disk activity, you can find out what FILES the activity is on. > -Todd > +
Scott Carey 2009-07-29, 19:07
-
Re: Map performance with custom binary formatTodd Lipcon 2009-07-29, 19:10
On Wed, Jul 29, 2009 at 12:07 PM, Scott Carey <[EMAIL PROTECTED]>wrote:
> > On 7/29/09 11:47 AM, "Todd Lipcon" <[EMAIL PROTECTED]> wrote: > > > On Wed, Jul 29, 2009 at 2:17 AM, Steve Loughran <[EMAIL PROTECTED]> > wrote: > > > >> Scott Carey wrote: > >> > >>> Well, the first thing to do in any performance bottleneck investigation > is > >>> to look at the machine hardware resource usage. > >>> > >>> During your test, what is the CPU use and disk usage? What about > network > >>> utilization? > >>> Top, vmstat, iostat, and some network usage monitoring would be useful. > >>> It > >>> could be many things causing your lack of scalability, but without > >>> actually > >>> monitoring your machines to see if there is an obvious bottleneck its > just > >>> random guessing and hunches. > >>> > >> > >> iotop is very good here > >> > > > > But sadly not available with kernels before 2.6.20 :( I was looking into > > this recently and the next best thing I could find was blktrace, but it's > > not nearly as nice. > > > > Any other tools people know about, especially that work on RHEL5/CentOS5? > > > > Iftop worked for me for network traffic: > http://www.ex-parrot.com/pdw/iftop/ > > For disk monitoring on CentOS/RHEL 5 Linux, every time I try and drill down > a problem with iostat I wish I was on Windows or Solaris instead where > monitoring and tracing disk I/O isn't a decade behind. On both of those, > not only can you see what process is causing the disk activity, you can > find > out what FILES the activity is on. > Yep, dtrace is great for this. I imagine if one were clever they could use strace to look at IO patterns on a per-process level. If anyone knows of any handy wrappers to do this, that would be great for looking at performance issues like this. -Todd +
Todd Lipcon 2009-07-29, 19:10
|