|
Saqib Jang -- Margalla Co...
2011-06-28, 17:16
Darren Govoni
2011-06-28, 17:21
Saqib Jang -- Margalla Co...
2011-06-28, 17:27
Darren Govoni
2011-06-28, 17:41
Matthew Foley
2011-06-28, 19:04
Saqib Jang -- Margalla Co...
2011-06-28, 22:06
Matei Zaharia
2011-06-28, 23:02
James Seigel
2011-06-28, 23:04
Mathias Herberts
2011-06-28, 23:05
Russell Jurney
2011-06-28, 23:13
Matt Davies
2011-06-29, 04:27
Bharath Mundlapudi
2011-06-29, 06:07
Michel Segel
2011-06-29, 21:04
Matthew Foley
2011-06-30, 04:04
Bharath Mundlapudi
2011-06-30, 18:49
Geoff Howard
2011-07-01, 11:13
Jeff.Schmitz@...
2011-07-11, 14:20
|
-
Sanity check re: value of 10GbE NICs for Hadoop?Saqib Jang -- Margalla Co... 2011-06-28, 17:16
Folks,
I've been digging into the potential benefits of using 10 Gigabit Ethernet (10GbE) NIC server connections for Hadoop and wanted to run what I've come up with through initial research by the list for 'sanity check' feedback. I'd very much appreciate your input on the importance (or lack of it) of the following potential benefits of 10GbE server connectivity as well as other thoughts regarding 10GbE and Hadoop (My interest is specifically in the value of 10GbE server connections and 10GbE switching infrastructure, over scenarios such as bonded 1GbE server connections with 10GbE switching). 1. HDFS Data Loading. The higher throughput enabled by 10GbE server and switching infrastructure allows faster processing and distribution of data. 2. Hadoop Cluster Scalability. High-performance for initial data processing and distribution directly impacts the degree of parallelism or scalability supported by the cluster. 3. HDFS Replication. Higher speed server connections allows faster file replication. 4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and latency directly impact the shuffle phase of a data set reduction especially for tasks that are at the document level (including large documents) and lots of metadata generated by those documents as well as video analytics and images. 5. Data Reporting. 10GbE server networking etwork performance can improve data reporting performance, especially if the Hadoop cluster is running multiple data reductions. 6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop could be reorganized to use a cluster or network file system. This would allow Hadoop even with its Java implementation to have higher performance I/O and not have to be so concerned with disk drive density in the same server. 7. Others? thanks, Saqib Saqib Jang Principal/Founder Margalla Communications, Inc. 1339 Portola Road, Woodside, CA 94062 (650) 274 8745 www.margallacomm.com
-
Re: Sanity check re: value of 10GbE NICs for Hadoop?Darren Govoni 2011-06-28, 17:21
Hadoop, like other parallel networked computation architectures is I/O
bound, predominantly. This means any increase in network bandwidth is "A Good Thing" and can have drastic positive effects on performance. All your points stem from this simple realization. Although I'm confused by your #6. Hadoop already uses a distributed file system. HDFS. On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote: > Folks, > > I've been digging into the potential benefits of using > > 10 Gigabit Ethernet (10GbE) NIC server connections for > > Hadoop and wanted to run what I've come up with > > through initial research by the list for 'sanity check' > > feedback. I'd very much appreciate your input on > > the importance (or lack of it) of the following potential benefits of > > 10GbE server connectivity as well as other thoughts regarding > > 10GbE and Hadoop (My interest is specifically in the value > > of 10GbE server connections and 10GbE switching infrastructure, > > over scenarios such as bonded 1GbE server connections with > > 10GbE switching). > > > > 1. HDFS Data Loading. The higher throughput enabled by 10GbE > > server and switching infrastructure allows faster processing and > > distribution of data. > > 2. Hadoop Cluster Scalability. High-performance for initial data > processing > > and distribution directly impacts the degree of parallelism or scalability > supported > > by the cluster. > > 3. HDFS Replication. Higher speed server connections allows faster > file replication. > > 4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and > latency directly impact the > > shuffle phase of a data set reduction especially for tasks that are at the > document level > > (including large documents) and lots of metadata generated by those > documents as well as video analytics and images. > > 5. Data Reporting. 10GbE server networking etwork performance can > > improve data reporting performance, especially if the Hadoop cluster is > running > > multiple data reductions. > > 6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop could be > reorganized > > to use a cluster or network file system. This would allow Hadoop even with > its Java implementation > > to have higher performance I/O and not have to be so concerned with disk > drive density in the same server. > > 7. Others? > > > > > > thanks, > > Saqib > > > > Saqib Jang > > Principal/Founder > > Margalla Communications, Inc. > > 1339 Portola Road, Woodside, CA 94062 > > (650) 274 8745 > > www.margallacomm.com > > > > > >
-
RE: Sanity check re: value of 10GbE NICs for Hadoop?Saqib Jang -- Margalla Co... 2011-06-28, 17:27
Darren,
Thanks, the last pt was basically about 10GbE potentially allowing the use of a network file system e.g. via NFS as an alternative to HDFS, the question is there any merit in this. Basically, I was exploring if the commercial clustered NAS products offer any high-availability or data management benefits for use with Hadoop? Saqib -----Original Message----- From: Darren Govoni [mailto:[EMAIL PROTECTED]] Sent: Tuesday, June 28, 2011 10:21 AM To: [EMAIL PROTECTED] Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop? Hadoop, like other parallel networked computation architectures is I/O bound, predominantly. This means any increase in network bandwidth is "A Good Thing" and can have drastic positive effects on performance. All your points stem from this simple realization. Although I'm confused by your #6. Hadoop already uses a distributed file system. HDFS. On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote: > Folks, > > I've been digging into the potential benefits of using > > 10 Gigabit Ethernet (10GbE) NIC server connections for > > Hadoop and wanted to run what I've come up with > > through initial research by the list for 'sanity check' > > feedback. I'd very much appreciate your input on > > the importance (or lack of it) of the following potential benefits of > > 10GbE server connectivity as well as other thoughts regarding > > 10GbE and Hadoop (My interest is specifically in the value > > of 10GbE server connections and 10GbE switching infrastructure, > > over scenarios such as bonded 1GbE server connections with > > 10GbE switching). > > > > 1. HDFS Data Loading. The higher throughput enabled by 10GbE > > server and switching infrastructure allows faster processing and > > distribution of data. > > 2. Hadoop Cluster Scalability. High-performance for initial data > processing > > and distribution directly impacts the degree of parallelism or > scalability supported > > by the cluster. > > 3. HDFS Replication. Higher speed server connections allows faster > file replication. > > 4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and > latency directly impact the > > shuffle phase of a data set reduction especially for tasks that are at > the document level > > (including large documents) and lots of metadata generated by those > documents as well as video analytics and images. > > 5. Data Reporting. 10GbE server networking etwork performance can > > improve data reporting performance, especially if the Hadoop cluster > is running > > multiple data reductions. > > 6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop could be > reorganized > > to use a cluster or network file system. This would allow Hadoop even > with its Java implementation > > to have higher performance I/O and not have to be so concerned with > disk drive density in the same server. > > 7. Others? > > > > > > thanks, > > Saqib > > > > Saqib Jang > > Principal/Founder > > Margalla Communications, Inc. > > 1339 Portola Road, Woodside, CA 94062 > > (650) 274 8745 > > www.margallacomm.com > > > > > >
-
Re: Sanity check re: value of 10GbE NICs for Hadoop?Darren Govoni 2011-06-28, 17:41
I see. However, Hadoop is designed to operate best with HDFS because
of its inherent striping and blocking strategy - which is tracked by Hadoop. Going outside of that mechanism will probably yield poor results and/or confuse Hadoop. Just my thoughts. On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote: > Darren, > Thanks, the last pt was basically about 10GbE potentially allowing the use > of a network file system e.g. via NFS as an alternative to HDFS, the > question > is there any merit in this. Basically, I was exploring if the commercial > clustered > NAS products offer any high-availability or data management benefits for use > with Hadoop? > > Saqib > > -----Original Message----- > From: Darren Govoni [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, June 28, 2011 10:21 AM > To: [EMAIL PROTECTED] > Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop? > > Hadoop, like other parallel networked computation architectures is I/O > bound, predominantly. > This means any increase in network bandwidth is "A Good Thing" and can have > drastic positive effects on performance. All your points stem from this > simple realization. > > Although I'm confused by your #6. Hadoop already uses a distributed file > system. HDFS. > > On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote: >> Folks, >> >> I've been digging into the potential benefits of using >> >> 10 Gigabit Ethernet (10GbE) NIC server connections for >> >> Hadoop and wanted to run what I've come up with >> >> through initial research by the list for 'sanity check' >> >> feedback. I'd very much appreciate your input on >> >> the importance (or lack of it) of the following potential benefits of >> >> 10GbE server connectivity as well as other thoughts regarding >> >> 10GbE and Hadoop (My interest is specifically in the value >> >> of 10GbE server connections and 10GbE switching infrastructure, >> >> over scenarios such as bonded 1GbE server connections with >> >> 10GbE switching). >> >> >> >> 1. HDFS Data Loading. The higher throughput enabled by 10GbE >> >> server and switching infrastructure allows faster processing and >> >> distribution of data. >> >> 2. Hadoop Cluster Scalability. High-performance for initial data >> processing >> >> and distribution directly impacts the degree of parallelism or >> scalability supported >> >> by the cluster. >> >> 3. HDFS Replication. Higher speed server connections allows faster >> file replication. >> >> 4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and >> latency directly impact the >> >> shuffle phase of a data set reduction especially for tasks that are at >> the document level >> >> (including large documents) and lots of metadata generated by those >> documents as well as video analytics and images. >> >> 5. Data Reporting. 10GbE server networking etwork performance can >> >> improve data reporting performance, especially if the Hadoop cluster >> is running >> >> multiple data reductions. >> >> 6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop could > be >> reorganized >> >> to use a cluster or network file system. This would allow Hadoop even >> with its Java implementation >> >> to have higher performance I/O and not have to be so concerned with >> disk drive density in the same server. >> >> 7. Others? >> >> >> >> >> >> thanks, >> >> Saqib >> >> >> >> Saqib Jang >> >> Principal/Founder >> >> Margalla Communications, Inc. >> >> 1339 Portola Road, Woodside, CA 94062 >> >> (650) 274 8745 >> >> www.margallacomm.com >> >> >> >> >> >> >
-
Re: Sanity check re: value of 10GbE NICs for Hadoop?Matthew Foley 2011-06-28, 19:04
Hadoop common provides an abstract FileSystem class, and Hadoop applications
should be designed to run on that. HDFS is just one implementation of a valid Hadoop filesystem, and ports to S3 and KFS as well as OS-supported LocalFileSystem are provided in Hadoop common. Use of NFS-mounted storage would fall under the LocalFileSystem model. However, one of the core values of Hadoop is the model of "bring the computation to the data". This does not seem viable with an NFS-based NAS-model storage subsystem. Thus, while it will "work" for small clusters and small jobs, it is unlikely to scale with high performance to thousands of nodes and petabytes of data in the way Hadoop can scale with HDFS or S3. --Matt On Jun 28, 2011, at 10:41 AM, Darren Govoni wrote: I see. However, Hadoop is designed to operate best with HDFS because of its inherent striping and blocking strategy - which is tracked by Hadoop. Going outside of that mechanism will probably yield poor results and/or confuse Hadoop. Just my thoughts. On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote: > Darren, > Thanks, the last pt was basically about 10GbE potentially allowing the use > of a network file system e.g. via NFS as an alternative to HDFS, the > question > is there any merit in this. Basically, I was exploring if the commercial > clustered > NAS products offer any high-availability or data management benefits for use > with Hadoop? > > Saqib > > -----Original Message----- > From: Darren Govoni [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, June 28, 2011 10:21 AM > To: [EMAIL PROTECTED] > Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop? > > Hadoop, like other parallel networked computation architectures is I/O > bound, predominantly. > This means any increase in network bandwidth is "A Good Thing" and can have > drastic positive effects on performance. All your points stem from this > simple realization. > > Although I'm confused by your #6. Hadoop already uses a distributed file > system. HDFS. > > On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote: >> Folks, >> >> I've been digging into the potential benefits of using >> >> 10 Gigabit Ethernet (10GbE) NIC server connections for >> >> Hadoop and wanted to run what I've come up with >> >> through initial research by the list for 'sanity check' >> >> feedback. I'd very much appreciate your input on >> >> the importance (or lack of it) of the following potential benefits of >> >> 10GbE server connectivity as well as other thoughts regarding >> >> 10GbE and Hadoop (My interest is specifically in the value >> >> of 10GbE server connections and 10GbE switching infrastructure, >> >> over scenarios such as bonded 1GbE server connections with >> >> 10GbE switching). >> >> >> >> 1. HDFS Data Loading. The higher throughput enabled by 10GbE >> >> server and switching infrastructure allows faster processing and >> >> distribution of data. >> >> 2. Hadoop Cluster Scalability. High-performance for initial data >> processing >> >> and distribution directly impacts the degree of parallelism or >> scalability supported >> >> by the cluster. >> >> 3. HDFS Replication. Higher speed server connections allows faster >> file replication. >> >> 4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and >> latency directly impact the >> >> shuffle phase of a data set reduction especially for tasks that are at >> the document level >> >> (including large documents) and lots of metadata generated by those >> documents as well as video analytics and images. >> >> 5. Data Reporting. 10GbE server networking etwork performance can >> >> improve data reporting performance, especially if the Hadoop cluster >> is running >> >> multiple data reductions. >> >> 6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop could > be >> reorganized >> >> to use a cluster or network file system. This would allow Hadoop even
-
RE: Sanity check re: value of 10GbE NICs for Hadoop?Saqib Jang -- Margalla Co... 2011-06-28, 22:06
Matt,
Thanks, this is helpful, I was wondering if you may have some thoughts on the list of other potential benefits of 10GbE NICs for Hadoop (listed in my original e-mail to the list)? regards, Saqib -----Original Message----- From: Matthew Foley [mailto:[EMAIL PROTECTED]] Sent: Tuesday, June 28, 2011 12:04 PM To: [EMAIL PROTECTED] Cc: Matthew Foley Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop? Hadoop common provides an abstract FileSystem class, and Hadoop applications should be designed to run on that. HDFS is just one implementation of a valid Hadoop filesystem, and ports to S3 and KFS as well as OS-supported LocalFileSystem are provided in Hadoop common. Use of NFS-mounted storage would fall under the LocalFileSystem model. However, one of the core values of Hadoop is the model of "bring the computation to the data". This does not seem viable with an NFS-based NAS-model storage subsystem. Thus, while it will "work" for small clusters and small jobs, it is unlikely to scale with high performance to thousands of nodes and petabytes of data in the way Hadoop can scale with HDFS or S3. --Matt On Jun 28, 2011, at 10:41 AM, Darren Govoni wrote: I see. However, Hadoop is designed to operate best with HDFS because of its inherent striping and blocking strategy - which is tracked by Hadoop. Going outside of that mechanism will probably yield poor results and/or confuse Hadoop. Just my thoughts. On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote: > Darren, > Thanks, the last pt was basically about 10GbE potentially allowing the > use of a network file system e.g. via NFS as an alternative to HDFS, > the question is there any merit in this. Basically, I was exploring if > the commercial clustered NAS products offer any high-availability or > data management benefits for use with Hadoop? > > Saqib > > -----Original Message----- > From: Darren Govoni [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, June 28, 2011 10:21 AM > To: [EMAIL PROTECTED] > Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop? > > Hadoop, like other parallel networked computation architectures is I/O > bound, predominantly. > This means any increase in network bandwidth is "A Good Thing" and can > have drastic positive effects on performance. All your points stem > from this simple realization. > > Although I'm confused by your #6. Hadoop already uses a distributed > file system. HDFS. > > On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote: >> Folks, >> >> I've been digging into the potential benefits of using >> >> 10 Gigabit Ethernet (10GbE) NIC server connections for >> >> Hadoop and wanted to run what I've come up with >> >> through initial research by the list for 'sanity check' >> >> feedback. I'd very much appreciate your input on >> >> the importance (or lack of it) of the following potential benefits of >> >> 10GbE server connectivity as well as other thoughts regarding >> >> 10GbE and Hadoop (My interest is specifically in the value >> >> of 10GbE server connections and 10GbE switching infrastructure, >> >> over scenarios such as bonded 1GbE server connections with >> >> 10GbE switching). >> >> >> >> 1. HDFS Data Loading. The higher throughput enabled by 10GbE >> >> server and switching infrastructure allows faster processing and >> >> distribution of data. >> >> 2. Hadoop Cluster Scalability. High-performance for initial data >> processing >> >> and distribution directly impacts the degree of parallelism or >> scalability supported >> >> by the cluster. >> >> 3. HDFS Replication. Higher speed server connections allows faster >> file replication. >> >> 4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and >> latency directly impact the >> >> shuffle phase of a data set reduction especially for tasks that are >> at the document level >> >> (including large documents) and lots of metadata generated by those
-
Re: Sanity check re: value of 10GbE NICs for Hadoop?Matei Zaharia 2011-06-28, 23:02
Ideally, to evaluate whether you want to go for 10GbE NICs, you would profile your target Hadoop workload and see whether it's communication-bound. Hadoop jobs can definitely be communication-bound if you shuffle a lot of data between map and reduce, but I've also seen a lot of clusters that are CPU-bound (due to decompression, running python, or just running expensive user code) or disk-IO-bound. You might be surprised at what your bottleneck is.
Matei On Jun 28, 2011, at 3:06 PM, Saqib Jang -- Margalla Communications wrote: > Matt, > Thanks, this is helpful, I was wondering if you may have some thoughts > on the list of other potential benefits of 10GbE NICs for Hadoop > (listed in my original e-mail to the list)? > > regards, > Saqib > > -----Original Message----- > From: Matthew Foley [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, June 28, 2011 12:04 PM > To: [EMAIL PROTECTED] > Cc: Matthew Foley > Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop? > > Hadoop common provides an abstract FileSystem class, and Hadoop applications > should be designed to run on that. HDFS is just one implementation of a > valid Hadoop filesystem, and ports to S3 and KFS as well as OS-supported > LocalFileSystem are provided in Hadoop common. Use of NFS-mounted storage > would fall under the LocalFileSystem model. > > However, one of the core values of Hadoop is the model of "bring the > computation to the data". This does not seem viable with an NFS-based > NAS-model storage subsystem. Thus, while it will "work" for small clusters > and small jobs, it is unlikely to scale with high performance to thousands > of nodes and petabytes of data in the way Hadoop can scale with HDFS or S3. > > --Matt > > > On Jun 28, 2011, at 10:41 AM, Darren Govoni wrote: > > I see. However, Hadoop is designed to operate best with HDFS because of its > inherent striping and blocking strategy - which is tracked by Hadoop. > Going outside of that mechanism will probably yield poor results and/or > confuse Hadoop. > > Just my thoughts. > > On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote: >> Darren, >> Thanks, the last pt was basically about 10GbE potentially allowing the >> use of a network file system e.g. via NFS as an alternative to HDFS, >> the question is there any merit in this. Basically, I was exploring if >> the commercial clustered NAS products offer any high-availability or >> data management benefits for use with Hadoop? >> >> Saqib >> >> -----Original Message----- >> From: Darren Govoni [mailto:[EMAIL PROTECTED]] >> Sent: Tuesday, June 28, 2011 10:21 AM >> To: [EMAIL PROTECTED] >> Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop? >> >> Hadoop, like other parallel networked computation architectures is I/O >> bound, predominantly. >> This means any increase in network bandwidth is "A Good Thing" and can >> have drastic positive effects on performance. All your points stem >> from this simple realization. >> >> Although I'm confused by your #6. Hadoop already uses a distributed >> file system. HDFS. >> >> On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote: >>> Folks, >>> >>> I've been digging into the potential benefits of using >>> >>> 10 Gigabit Ethernet (10GbE) NIC server connections for >>> >>> Hadoop and wanted to run what I've come up with >>> >>> through initial research by the list for 'sanity check' >>> >>> feedback. I'd very much appreciate your input on >>> >>> the importance (or lack of it) of the following potential benefits of >>> >>> 10GbE server connectivity as well as other thoughts regarding >>> >>> 10GbE and Hadoop (My interest is specifically in the value >>> >>> of 10GbE server connections and 10GbE switching infrastructure, >>> >>> over scenarios such as bonded 1GbE server connections with >>> >>> 10GbE switching). >>> >>> >>> >>> 1. HDFS Data Loading. The higher throughput enabled by 10GbE >>> >>> server and switching infrastructure allows faster processing and
-
Re: Sanity check re: value of 10GbE NICs for Hadoop?James Seigel 2011-06-28, 23:04
If you are very adhoc-y, more bandwidth the merry-er!
James Sent from my mobile. Please excuse the typos. On 2011-06-28, at 5:03 PM, Matei Zaharia <[EMAIL PROTECTED]> wrote: > Ideally, to evaluate whether you want to go for 10GbE NICs, you would profile your target Hadoop workload and see whether it's communication-bound. Hadoop jobs can definitely be communication-bound if you shuffle a lot of data between map and reduce, but I've also seen a lot of clusters that are CPU-bound (due to decompression, running python, or just running expensive user code) or disk-IO-bound. You might be surprised at what your bottleneck is. > > Matei > > On Jun 28, 2011, at 3:06 PM, Saqib Jang -- Margalla Communications wrote: > >> Matt, >> Thanks, this is helpful, I was wondering if you may have some thoughts >> on the list of other potential benefits of 10GbE NICs for Hadoop >> (listed in my original e-mail to the list)? >> >> regards, >> Saqib >> >> -----Original Message----- >> From: Matthew Foley [mailto:[EMAIL PROTECTED]] >> Sent: Tuesday, June 28, 2011 12:04 PM >> To: [EMAIL PROTECTED] >> Cc: Matthew Foley >> Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop? >> >> Hadoop common provides an abstract FileSystem class, and Hadoop applications >> should be designed to run on that. HDFS is just one implementation of a >> valid Hadoop filesystem, and ports to S3 and KFS as well as OS-supported >> LocalFileSystem are provided in Hadoop common. Use of NFS-mounted storage >> would fall under the LocalFileSystem model. >> >> However, one of the core values of Hadoop is the model of "bring the >> computation to the data". This does not seem viable with an NFS-based >> NAS-model storage subsystem. Thus, while it will "work" for small clusters >> and small jobs, it is unlikely to scale with high performance to thousands >> of nodes and petabytes of data in the way Hadoop can scale with HDFS or S3. >> >> --Matt >> >> >> On Jun 28, 2011, at 10:41 AM, Darren Govoni wrote: >> >> I see. However, Hadoop is designed to operate best with HDFS because of its >> inherent striping and blocking strategy - which is tracked by Hadoop. >> Going outside of that mechanism will probably yield poor results and/or >> confuse Hadoop. >> >> Just my thoughts. >> >> On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote: >>> Darren, >>> Thanks, the last pt was basically about 10GbE potentially allowing the >>> use of a network file system e.g. via NFS as an alternative to HDFS, >>> the question is there any merit in this. Basically, I was exploring if >>> the commercial clustered NAS products offer any high-availability or >>> data management benefits for use with Hadoop? >>> >>> Saqib >>> >>> -----Original Message----- >>> From: Darren Govoni [mailto:[EMAIL PROTECTED]] >>> Sent: Tuesday, June 28, 2011 10:21 AM >>> To: [EMAIL PROTECTED] >>> Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop? >>> >>> Hadoop, like other parallel networked computation architectures is I/O >>> bound, predominantly. >>> This means any increase in network bandwidth is "A Good Thing" and can >>> have drastic positive effects on performance. All your points stem >>> from this simple realization. >>> >>> Although I'm confused by your #6. Hadoop already uses a distributed >>> file system. HDFS. >>> >>> On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote: >>>> Folks, >>>> >>>> I've been digging into the potential benefits of using >>>> >>>> 10 Gigabit Ethernet (10GbE) NIC server connections for >>>> >>>> Hadoop and wanted to run what I've come up with >>>> >>>> through initial research by the list for 'sanity check' >>>> >>>> feedback. I'd very much appreciate your input on >>>> >>>> the importance (or lack of it) of the following potential benefits of >>>> >>>> 10GbE server connectivity as well as other thoughts regarding >>>> >>>> 10GbE and Hadoop (My interest is specifically in the value >>>> >>>> of 10GbE server connections and 10GbE switching infrastructure,
-
Re: Sanity check re: value of 10GbE NICs for Hadoop?Mathias Herberts 2011-06-28, 23:05
On Wed, Jun 29, 2011 at 01:02, Matei Zaharia <[EMAIL PROTECTED]> wrote:
> Ideally, to evaluate whether you want to go for 10GbE NICs, you would profile your target Hadoop workload and see whether it's communication-bound. Hadoop jobs can definitely be communication-bound if you shuffle a lot of data between map and reduce, but I've also seen a lot of clusters that are CPU-bound (due to decompression, running python, or just running expensive user code) or disk-IO-bound. You might be surprised at what your bottleneck is. >From my experience, jobs that shuffle lots of data are also very often slowed down by the sort phase, compressing mappers' output is a first step to improve performance. Given the cost of a 10GbE infrastructure with no oversubscription I'd monitor bandwith usage very closely prior to investing in that kind of network gear.
-
Re: Sanity check re: value of 10GbE NICs for Hadoop?Russell Jurney 2011-06-28, 23:13
Price the cost of 1GbE->10GbE vs. more nodes, using data from monitoring
your cluster during peak load. It should be clear which is a better value. Russ On Tue, Jun 28, 2011 at 4:05 PM, Mathias Herberts < [EMAIL PROTECTED]> wrote: > On Wed, Jun 29, 2011 at 01:02, Matei Zaharia <[EMAIL PROTECTED]> > wrote: > > Ideally, to evaluate whether you want to go for 10GbE NICs, you would > profile your target Hadoop workload and see whether it's > communication-bound. Hadoop jobs can definitely be communication-bound if > you shuffle a lot of data between map and reduce, but I've also seen a lot > of clusters that are CPU-bound (due to decompression, running python, or > just running expensive user code) or disk-IO-bound. You might be surprised > at what your bottleneck is. > > From my experience, jobs that shuffle lots of data are also very often > slowed down by the sort phase, compressing mappers' output is a first > step to improve performance. Given the cost of a 10GbE infrastructure > with no oversubscription I'd monitor bandwith usage very closely prior > to investing in that kind of network gear. >
-
Re: Sanity check re: value of 10GbE NICs for Hadoop?Matt Davies 2011-06-29, 04:27
I would say this is quite a difficult choice. I've seen that our cluster
could use more bandwidth, but it wasn't to the nodes that made the big difference, it was getting better switches that had better backplanes - the fabric made the difference. I've also seen some workloads where job design is critical - i.e. if you are spinning through the data in your mappers you could easily overwhelm the namenode and jobtracker with big enough clusters. It is probably quite early for you to know such things about your workload. If this becomes a problem you may need adjustments to your apps. Overall, I think good quality Top Of Rack switches with good uplinks to distribution switches can make your cluster fly. That is relatively cheap compared to 10G throughout, and I've seen that more CPU's work well for _my_ workload (I always need more mappers and reducers, but it is quite rare that the network is saturated now). $0.02 -Matt On Tue, Jun 28, 2011 at 5:13 PM, Russell Jurney <[EMAIL PROTECTED]>wrote: > Price the cost of 1GbE->10GbE vs. more nodes, using data from monitoring > your cluster during peak load. It should be clear which is a better value. > > Russ > > On Tue, Jun 28, 2011 at 4:05 PM, Mathias Herberts < > [EMAIL PROTECTED]> wrote: > > > On Wed, Jun 29, 2011 at 01:02, Matei Zaharia <[EMAIL PROTECTED]> > > wrote: > > > Ideally, to evaluate whether you want to go for 10GbE NICs, you would > > profile your target Hadoop workload and see whether it's > > communication-bound. Hadoop jobs can definitely be communication-bound if > > you shuffle a lot of data between map and reduce, but I've also seen a > lot > > of clusters that are CPU-bound (due to decompression, running python, or > > just running expensive user code) or disk-IO-bound. You might be > surprised > > at what your bottleneck is. > > > > From my experience, jobs that shuffle lots of data are also very often > > slowed down by the sort phase, compressing mappers' output is a first > > step to improve performance. Given the cost of a 10GbE infrastructure > > with no oversubscription I'd monitor bandwith usage very closely prior > > to investing in that kind of network gear. > > >
-
Re: Sanity check re: value of 10GbE NICs for Hadoop?Bharath Mundlapudi 2011-06-29, 06:07
One could argue that its too early for 10Gb NIC in Hadoop Cluster. Certainly having extra bandwidth is good but at what price?
Please note that all the points you mentioned can work with 1Gb NICs today. Unless if you can back with price/performance data. Typically, Map output is compressed. If system is hitting peak network utilization, one can select high compression rate algorithms at the cost of CPU. Most of these machines comes with dual NIC cards, so one could do link bonding to push more bits. One area may have good benefit of 10Gb NIC is High Density Systems - 24 core and 3x12TB disks. This is the trend now and will continue. These systems can saturate the 1Gb NICs. -Bharath ________________________________ From: Saqib Jang -- Margalla Communications <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Tuesday, June 28, 2011 10:16 AM Subject: Sanity check re: value of 10GbE NICs for Hadoop? Folks, I've been digging into the potential benefits of using 10 Gigabit Ethernet (10GbE) NIC server connections for Hadoop and wanted to run what I've come up with through initial research by the list for 'sanity check' feedback. I'd very much appreciate your input on the importance (or lack of it) of the following potential benefits of 10GbE server connectivity as well as other thoughts regarding 10GbE and Hadoop (My interest is specifically in the value of 10GbE server connections and 10GbE switching infrastructure, over scenarios such as bonded 1GbE server connections with 10GbE switching). 1. HDFS Data Loading. The higher throughput enabled by 10GbE server and switching infrastructure allows faster processing and distribution of data. 2. Hadoop Cluster Scalability. High-performance for initial data processing and distribution directly impacts the degree of parallelism or scalability supported by the cluster. 3. HDFS Replication. Higher speed server connections allows faster file replication. 4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and latency directly impact the shuffle phase of a data set reduction especially for tasks that are at the document level (including large documents) and lots of metadata generated by those documents as well as video analytics and images. 5. Data Reporting. 10GbE server networking etwork performance can improve data reporting performance, especially if the Hadoop cluster is running multiple data reductions. 6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop could be reorganized to use a cluster or network file system. This would allow Hadoop even with its Java implementation to have higher performance I/O and not have to be so concerned with disk drive density in the same server. 7. Others? thanks, Saqib Saqib Jang Principal/Founder Margalla Communications, Inc. 1339 Portola Road, Woodside, CA 94062 (650) 274 8745 www.margallacomm.com
-
Re: Sanity check re: value of 10GbE NICs for Hadoop?Michel Segel 2011-06-29, 21:04
I'm not sure which point you are trying to make.
To answer to answer your question... With respect to price... 10GBe is cost effective. You have to consider 1GBe is not only you port speed but also there is going to be the speed of the uplink or trunk. So if you continue to build out, you run in to bandwidth issues between racks. So you end up doing 1GBe ports and then higher speed by either port bonding or bigger bandwidth for uplinks only. These switches are more expensive than simple 1GBe switches, but less than full 10GBe. Depending on vendor, number of ports, discount, you can get the switch for approx 10,000 and up. Think $550 to $600 a port for 10GBe. With Sandy Bridge, you will start to see 10GBe on the mother boards. If you're following discussion on the performance gains, you'll start to see the network being the bottleneck. If you are planning to build a new cluster... You should plan on 10gbe. Sent from a remote device. Please excuse any typos... Mike Segel On Jun 29, 2011, at 1:07 AM, Bharath Mundlapudi <[EMAIL PROTECTED]> wrote: > One could argue that its too early for 10Gb NIC in Hadoop Cluster. Certainly having extra bandwidth is good but at what price? > > > Please note that all the points you mentioned can work with 1Gb NICs today. Unless if you can back with price/performance data. Typically, Map output is compressed. If system is hitting peak network utilization, one can select high compression rate algorithms at the cost of CPU. Most of these machines comes with dual NIC cards, so one could do link bonding to push more bits. > > > One area may have good benefit of 10Gb NIC is High Density Systems - 24 core and 3x12TB disks. This is the trend now and will continue. These systems can saturate the 1Gb NICs. > > > -Bharath > > > > ________________________________ > From: Saqib Jang -- Margalla Communications <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Tuesday, June 28, 2011 10:16 AM > Subject: Sanity check re: value of 10GbE NICs for Hadoop? > > Folks, > > I've been digging into the potential benefits of using > > 10 Gigabit Ethernet (10GbE) NIC server connections for > > Hadoop and wanted to run what I've come up with > > through initial research by the list for 'sanity check' > > feedback. I'd very much appreciate your input on > > the importance (or lack of it) of the following potential benefits of > > 10GbE server connectivity as well as other thoughts regarding > > 10GbE and Hadoop (My interest is specifically in the value > > of 10GbE server connections and 10GbE switching infrastructure, > > over scenarios such as bonded 1GbE server connections with > > 10GbE switching). > > > > 1. HDFS Data Loading. The higher throughput enabled by 10GbE > > server and switching infrastructure allows faster processing and > > distribution of data. > > 2. Hadoop Cluster Scalability. High-performance for initial data > processing > > and distribution directly impacts the degree of parallelism or scalability > supported > > by the cluster. > > 3. HDFS Replication. Higher speed server connections allows faster > file replication. > > 4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and > latency directly impact the > > shuffle phase of a data set reduction especially for tasks that are at the > document level > > (including large documents) and lots of metadata generated by those > documents as well as video analytics and images. > > 5. Data Reporting. 10GbE server networking etwork performance can > > improve data reporting performance, especially if the Hadoop cluster is > running > > multiple data reductions. > > 6. Support of Cluster File Systems. With 10 GbE NICs, Hadoop could be > reorganized > > to use a cluster or network file system. This would allow Hadoop even with > its Java implementation > > to have higher performance I/O and not have to be so concerned with disk > drive density in the same server.
-
Re: Sanity check re: value of 10GbE NICs for Hadoop?Matthew Foley 2011-06-30, 04:04
I agree with Matei. Whether you will get good ROI on 10GigE depends very much on the types of jobs you run.
--Matt On Jun 28, 2011, at 4:02 PM, Matei Zaharia wrote: Ideally, to evaluate whether you want to go for 10GbE NICs, you would profile your target Hadoop workload and see whether it's communication-bound. Hadoop jobs can definitely be communication-bound if you shuffle a lot of data between map and reduce, but I've also seen a lot of clusters that are CPU-bound (due to decompression, running python, or just running expensive user code) or disk-IO-bound. You might be surprised at what your bottleneck is. Matei On Jun 28, 2011, at 3:06 PM, Saqib Jang -- Margalla Communications wrote: > Matt, > Thanks, this is helpful, I was wondering if you may have some thoughts > on the list of other potential benefits of 10GbE NICs for Hadoop > (listed in my original e-mail to the list)? > > regards, > Saqib > > -----Original Message----- > From: Matthew Foley [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, June 28, 2011 12:04 PM > To: [EMAIL PROTECTED] > Cc: Matthew Foley > Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop? > > Hadoop common provides an abstract FileSystem class, and Hadoop applications > should be designed to run on that. HDFS is just one implementation of a > valid Hadoop filesystem, and ports to S3 and KFS as well as OS-supported > LocalFileSystem are provided in Hadoop common. Use of NFS-mounted storage > would fall under the LocalFileSystem model. > > However, one of the core values of Hadoop is the model of "bring the > computation to the data". This does not seem viable with an NFS-based > NAS-model storage subsystem. Thus, while it will "work" for small clusters > and small jobs, it is unlikely to scale with high performance to thousands > of nodes and petabytes of data in the way Hadoop can scale with HDFS or S3. > > --Matt > > > On Jun 28, 2011, at 10:41 AM, Darren Govoni wrote: > > I see. However, Hadoop is designed to operate best with HDFS because of its > inherent striping and blocking strategy - which is tracked by Hadoop. > Going outside of that mechanism will probably yield poor results and/or > confuse Hadoop. > > Just my thoughts. > > On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote: >> Darren, >> Thanks, the last pt was basically about 10GbE potentially allowing the >> use of a network file system e.g. via NFS as an alternative to HDFS, >> the question is there any merit in this. Basically, I was exploring if >> the commercial clustered NAS products offer any high-availability or >> data management benefits for use with Hadoop? >> >> Saqib >> >> -----Original Message----- >> From: Darren Govoni [mailto:[EMAIL PROTECTED]] >> Sent: Tuesday, June 28, 2011 10:21 AM >> To: [EMAIL PROTECTED] >> Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop? >> >> Hadoop, like other parallel networked computation architectures is I/O >> bound, predominantly. >> This means any increase in network bandwidth is "A Good Thing" and can >> have drastic positive effects on performance. All your points stem >> from this simple realization. >> >> Although I'm confused by your #6. Hadoop already uses a distributed >> file system. HDFS. >> >> On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote: >>> Folks, >>> >>> I've been digging into the potential benefits of using >>> >>> 10 Gigabit Ethernet (10GbE) NIC server connections for >>> >>> Hadoop and wanted to run what I've come up with >>> >>> through initial research by the list for 'sanity check' >>> >>> feedback. I'd very much appreciate your input on >>> >>> the importance (or lack of it) of the following potential benefits of >>> >>> 10GbE server connectivity as well as other thoughts regarding >>> >>> 10GbE and Hadoop (My interest is specifically in the value >>> >>> of 10GbE server connections and 10GbE switching infrastructure, >>> >>> over scenarios such as bonded 1GbE server connections with
-
Re: Sanity check re: value of 10GbE NICs for Hadoop?Bharath Mundlapudi 2011-06-30, 18:49
However good the benchmark be, In benchmarking there is a saying 'Performance improvements depends on type of workload'. What matters is your workload. Design the network for your workloads.
From racks to uplink or trunk need 10GBe. But the question was are we there yet for per node 10GBe? I would plan only if your data is showing the network saturation. -Bharath ________________________________ From: Michel Segel <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Wednesday, June 29, 2011 2:04 PM Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop? I'm not sure which point you are trying to make. To answer to answer your question... With respect to price... 10GBe is cost effective. You have to consider 1GBe is not only you port speed but also there is going to be the speed of the uplink or trunk. So if you continue to build out, you run in to bandwidth issues between racks. So you end up doing 1GBe ports and then higher speed by either port bonding or bigger bandwidth for uplinks only. These switches are more expensive than simple 1GBe switches, but less than full 10GBe. Depending on vendor, number of ports, discount, you can get the switch for approx 10,000 and up. Think $550 to $600 a port for 10GBe. With Sandy Bridge, you will start to see 10GBe on the mother boards. If you're following discussion on the performance gains, you'll start to see the network being the bottleneck. If you are planning to build a new cluster... You should plan on 10gbe. Sent from a remote device. Please excuse any typos... Mike Segel On Jun 29, 2011, at 1:07 AM, Bharath Mundlapudi <[EMAIL PROTECTED]> wrote: > One could argue that its too early for 10Gb NIC in Hadoop Cluster. Certainly having extra bandwidth is good but at what price? > > > Please note that all the points you mentioned can work with 1Gb NICs today. Unless if you can back with price/performance data. Typically, Map output is compressed. If system is hitting peak network utilization, one can select high compression rate algorithms at the cost of CPU. Most of these machines comes with dual NIC cards, so one could do link bonding to push more bits. > > > One area may have good benefit of 10Gb NIC is High Density Systems - 24 core and 3x12TB disks. This is the trend now and will continue. These systems can saturate the 1Gb NICs. > > > -Bharath > > > > ________________________________ > From: Saqib Jang -- Margalla Communications <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Tuesday, June 28, 2011 10:16 AM > Subject: Sanity check re: value of 10GbE NICs for Hadoop? > > Folks, > > I've been digging into the potential benefits of using > > 10 Gigabit Ethernet (10GbE) NIC server connections for > > Hadoop and wanted to run what I've come up with > > through initial research by the list for 'sanity check' > > feedback. I'd very much appreciate your input on > > the importance (or lack of it) of the following potential benefits of > > 10GbE server connectivity as well as other thoughts regarding > > 10GbE and Hadoop (My interest is specifically in the value > > of 10GbE server connections and 10GbE switching infrastructure, > > over scenarios such as bonded 1GbE server connections with > > 10GbE switching). > > > > 1. HDFS Data Loading. The higher throughput enabled by 10GbE > > server and switching infrastructure allows faster processing and > > distribution of data. > > 2. Hadoop Cluster Scalability. High-performance for initial data > processing > > and distribution directly impacts the degree of parallelism or scalability > supported > > by the cluster. > > 3. HDFS Replication. Higher speed server connections allows faster > file replication. > > 4. Map/Reduce Shuffle Phase. Improved end-to-end throughput and > latency directly impact the > > shuffle phase of a data set reduction especially for tasks that are at the
-
Re: Sanity check re: value of 10GbE NICs for Hadoop?Geoff Howard 2011-07-01, 11:13
On Wed, Jun 29, 2011 at 12:27 AM, Matt Davies <[EMAIL PROTECTED]> wrote:
> ... I've seen that our cluster > could use more bandwidth, but it wasn't to the nodes that made the big > difference, it was getting better switches that had better backplanes - the > fabric made the difference. > Any recommendations on specific 1Gb switches for top of rack that have better backplanes? Geoff
-
RE: Sanity check re: value of 10GbE NICs for Hadoop?Jeff.Schmitz@... 2011-07-11, 14:20
Also there is info on this at Cloudera here
http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some- basic-hardware-recommendations/ -----Original Message----- From: Saqib Jang -- Margalla Communications [mailto:[EMAIL PROTECTED]] Sent: Tuesday, June 28, 2011 5:06 PM To: [EMAIL PROTECTED] Subject: RE: Sanity check re: value of 10GbE NICs for Hadoop? Matt, Thanks, this is helpful, I was wondering if you may have some thoughts on the list of other potential benefits of 10GbE NICs for Hadoop (listed in my original e-mail to the list)? regards, Saqib -----Original Message----- From: Matthew Foley [mailto:[EMAIL PROTECTED]] Sent: Tuesday, June 28, 2011 12:04 PM To: [EMAIL PROTECTED] Cc: Matthew Foley Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop? Hadoop common provides an abstract FileSystem class, and Hadoop applications should be designed to run on that. HDFS is just one implementation of a valid Hadoop filesystem, and ports to S3 and KFS as well as OS-supported LocalFileSystem are provided in Hadoop common. Use of NFS-mounted storage would fall under the LocalFileSystem model. However, one of the core values of Hadoop is the model of "bring the computation to the data". This does not seem viable with an NFS-based NAS-model storage subsystem. Thus, while it will "work" for small clusters and small jobs, it is unlikely to scale with high performance to thousands of nodes and petabytes of data in the way Hadoop can scale with HDFS or S3. --Matt On Jun 28, 2011, at 10:41 AM, Darren Govoni wrote: I see. However, Hadoop is designed to operate best with HDFS because of its inherent striping and blocking strategy - which is tracked by Hadoop. Going outside of that mechanism will probably yield poor results and/or confuse Hadoop. Just my thoughts. On 06/28/2011 01:27 PM, Saqib Jang -- Margalla Communications wrote: > Darren, > Thanks, the last pt was basically about 10GbE potentially allowing the > use of a network file system e.g. via NFS as an alternative to HDFS, > the question is there any merit in this. Basically, I was exploring if > the commercial clustered NAS products offer any high-availability or > data management benefits for use with Hadoop? > > Saqib > > -----Original Message----- > From: Darren Govoni [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, June 28, 2011 10:21 AM > To: [EMAIL PROTECTED] > Subject: Re: Sanity check re: value of 10GbE NICs for Hadoop? > > Hadoop, like other parallel networked computation architectures is I/O > bound, predominantly. > This means any increase in network bandwidth is "A Good Thing" and can > have drastic positive effects on performance. All your points stem > from this simple realization. > > Although I'm confused by your #6. Hadoop already uses a distributed > file system. HDFS. > > On 06/28/2011 01:16 PM, Saqib Jang -- Margalla Communications wrote: >> Folks, >> >> I've been digging into the potential benefits of using >> >> 10 Gigabit Ethernet (10GbE) NIC server connections for >> >> Hadoop and wanted to run what I've come up with >> >> through initial research by the list for 'sanity check' >> >> feedback. I'd very much appreciate your input on >> >> the importance (or lack of it) of the following potential benefits of >> >> 10GbE server connectivity as well as other thoughts regarding >> >> 10GbE and Hadoop (My interest is specifically in the value >> >> of 10GbE server connections and 10GbE switching infrastructure, >> >> over scenarios such as bonded 1GbE server connections with >> >> 10GbE switching). >> >> >> >> 1. HDFS Data Loading. The higher throughput enabled by 10GbE >> >> server and switching infrastructure allows faster processing and >> >> distribution of data. >> >> 2. Hadoop Cluster Scalability. High-performance for initial data >> processing >> >> and distribution directly impacts the degree of parallelism or >> scalability supported faster can could |