Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Re: HDFS using SAN


+
Tom Deutsch 2012-10-17, 13:31
+
Pamecha, Abhishek 2012-10-18, 00:21
+
Luca Pireddu 2012-10-18, 12:32
+
Tom Deutsch 2012-10-18, 14:37
Copy link to this message
-
Re: HDFS using SAN
Yes, I have been reaching the same conclusions here. Tom would you care to spell out the 'obvious' io considerations? I would like to see if there are more that  are different than mine.

My  3 observations have been that
1. for full tables scan MR jobs, SAN approach is transporting entire dataset over the n/w to data nodes. Not good.
2. The shuffle s actually includes more n/w transfers when it could have been just intra-SAN transfer. Disadvantage.
3. SAN controller caches ( an additional stop in data transfer as opposed to das) may not be utilized as effectively because they are shared by multiple data nodes. ( frequent eviction)

So overall my conclusion is MR is not the best suited data processing method when data is stored in a SAN.

Btw, I thought SAN would do block level transfer and file system on top is your choice. I was surprised to see GPFS 'as' the SAN. Could you please clarify?

Any way you can share your cluster size?

Thanks
Abhishek
i Sent from my iPad with iMstakes

On Oct 18, 2012, at 7:41, "Tom Deutsch" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Agreed Luca, we do this to support existing customers that have requested it and it works fine within obvious IO considerations. But not a recommended way to do a green field deployment.

------------------------------------------------
Tom Deutsch
Program Director
Information Management
Big Data Technologies
IBM
3565 Harbor Blvd
Costa Mesa, CA 92626-1420
[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>

Twitter: @thomasdeutsch
Data Management Blog: ibmdatamag.com/author/tdeutsch/<http://ibmdatamag.com/author/tdeutsch/>
LinkedIn: http://www.linkedin.com/profile/view?id=833160
Quora: http://www.quora.com/Tom-Deutsch
Smarter Computing Blog: http://www.smartercomputingblog.com/contributorsprofile/?user_id=223
IBM Big Data Hub Blog: http://www.ibmbigdatahub.com/blog/author/tom-deutsch
Big Data for Business Executives Group: http://www.linkedin.com/groups?gid=4455695
<graycol.gif>Luca Pireddu ---10/18/2012 05:33:48 AM---On 10/18/2012 02:21 AM, Pamecha, Abhishek wrote: > Tom

From: Luca Pireddu <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>,
Date: 10/18/2012 05:33 AM
Subject: Re: HDFS using SAN

________________________________

On 10/18/2012 02:21 AM, Pamecha, Abhishek wrote:
> Tom
>
> Do you mean you are using GPFS instead of HDFS? Also, if you can share,
> are you deploying it as DAS set up or a SAN?
>
> Thanks,
>
> Abhishek
>
Though I don't think I'd buy a SAN for a new Hadoop cluster, we have a
SAN and are using it *instead of HDFS* with a small/medium Hadoop
MapReduce cluster (up to 100 nodes or so, depending on our need).  We
still use the local node disks for intermediate data (mapred local
storage).  Although this set-up does limit our possibility to scale to a
large number of nodes, that's not a concern for us.  On the plus, we
gain the flexibility to be able to share our cluster with non-Hadoop
users at our centre.
--
Luca Pireddu
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
09010 Pula (CA), Italy
Tel: +39 0709250452
+
Jitendra Kumar Singh 2012-10-18, 13:48
+
Michael Segel 2012-10-18, 13:58
+
Pamecha, Abhishek 2012-10-18, 15:08
+
seth 2012-10-18, 15:15
+
Zhani Pellumbi 2012-10-18, 15:46
+
Steve Loughran 2012-10-19, 08:06
+
Pamecha, Abhishek 2012-10-19, 00:29
+
Pamecha, Abhishek 2012-10-16, 18:28
+
Jeffrey Buell 2012-10-16, 21:24
+
lohit 2012-10-16, 22:26
+
Pamecha, Abhishek 2012-10-16, 23:28
+
Kevin Odell 2012-10-17, 13:25
+
Mohamed Riadh Trad 2012-10-17, 13:37
+
Pamecha, Abhishek 2012-10-18, 00:26
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB