Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce, mail # user - Re: HDFS using SAN


+
Tom Deutsch 2012-10-17, 13:31
+
Pamecha, Abhishek 2012-10-18, 00:21
+
Luca Pireddu 2012-10-18, 12:32
+
Tom Deutsch 2012-10-18, 14:37
Copy link to this message
-
Re: HDFS using SAN
Pamecha, Abhishek 2012-10-18, 15:18
Yes, I have been reaching the same conclusions here. Tom would you care to spell out the 'obvious' io considerations? I would like to see if there are more that  are different than mine.

My  3 observations have been that
1. for full tables scan MR jobs, SAN approach is transporting entire dataset over the n/w to data nodes. Not good.
2. The shuffle s actually includes more n/w transfers when it could have been just intra-SAN transfer. Disadvantage.
3. SAN controller caches ( an additional stop in data transfer as opposed to das) may not be utilized as effectively because they are shared by multiple data nodes. ( frequent eviction)

So overall my conclusion is MR is not the best suited data processing method when data is stored in a SAN.

Btw, I thought SAN would do block level transfer and file system on top is your choice. I was surprised to see GPFS 'as' the SAN. Could you please clarify?

Any way you can share your cluster size?

Thanks
Abhishek
i Sent from my iPad with iMstakes

On Oct 18, 2012, at 7:41, "Tom Deutsch" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Agreed Luca, we do this to support existing customers that have requested it and it works fine within obvious IO considerations. But not a recommended way to do a green field deployment.

------------------------------------------------
Tom Deutsch
Program Director
Information Management
Big Data Technologies
IBM
3565 Harbor Blvd
Costa Mesa, CA 92626-1420
[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>

Twitter: @thomasdeutsch
Data Management Blog: ibmdatamag.com/author/tdeutsch/<http://ibmdatamag.com/author/tdeutsch/>
LinkedIn: http://www.linkedin.com/profile/view?id=833160
Quora: http://www.quora.com/Tom-Deutsch
Smarter Computing Blog: http://www.smartercomputingblog.com/contributorsprofile/?user_id=223
IBM Big Data Hub Blog: http://www.ibmbigdatahub.com/blog/author/tom-deutsch
Big Data for Business Executives Group: http://www.linkedin.com/groups?gid=4455695
<graycol.gif>Luca Pireddu ---10/18/2012 05:33:48 AM---On 10/18/2012 02:21 AM, Pamecha, Abhishek wrote: > Tom

From: Luca Pireddu <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>,
Date: 10/18/2012 05:33 AM
Subject: Re: HDFS using SAN

________________________________

On 10/18/2012 02:21 AM, Pamecha, Abhishek wrote:
> Tom
>
> Do you mean you are using GPFS instead of HDFS? Also, if you can share,
> are you deploying it as DAS set up or a SAN?
>
> Thanks,
>
> Abhishek
>
Though I don't think I'd buy a SAN for a new Hadoop cluster, we have a
SAN and are using it *instead of HDFS* with a small/medium Hadoop
MapReduce cluster (up to 100 nodes or so, depending on our need).  We
still use the local node disks for intermediate data (mapred local
storage).  Although this set-up does limit our possibility to scale to a
large number of nodes, that's not a concern for us.  On the plus, we
gain the flexibility to be able to share our cluster with non-Hadoop
users at our centre.
--
Luca Pireddu
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
09010 Pula (CA), Italy
Tel: +39 0709250452
+
Jitendra Kumar Singh 2012-10-18, 13:48
+
Michael Segel 2012-10-18, 13:58
+
Pamecha, Abhishek 2012-10-18, 15:08
+
seth 2012-10-18, 15:15
+
Zhani Pellumbi 2012-10-18, 15:46
+
Steve Loughran 2012-10-19, 08:06
+
Pamecha, Abhishek 2012-10-19, 00:29
+
Pamecha, Abhishek 2012-10-16, 18:28
+
Jeffrey Buell 2012-10-16, 21:24
+
lohit 2012-10-16, 22:26
+
Pamecha, Abhishek 2012-10-16, 23:28
+
Kevin Odell 2012-10-17, 13:25
+
Mohamed Riadh Trad 2012-10-17, 13:37
+
Pamecha, Abhishek 2012-10-18, 00:26