Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Mutiple dfs.data.dir vs RAID0


Copy link to this message
-
Re: Mutiple dfs.data.dir vs RAID0
On Mon, Feb 11, 2013 at 10:54 AM, Jean-Marc Spaggiari <
[EMAIL PROTECTED]> wrote:

> thanks all for your feebacks.
>
> I have updated with hdfs config to add another dfs.data.dir entry and
> restarted the node. Hadoop is starting to use the entry, but is not
> spreading the existing data over the 2 directories.
>
> Let's say you have a 2TB disk on /hadoop1, almost full. If you add
> another 2TB disk on /hadoop2 and add it on dfs.data.dir, hadoop will
> start to write into /hadoop1 and /hadoop2, but /hadoop1 will stay
> almost full. It will not balance the already existing data over the 2
> directories.
>
> I have deleted all the content of /hadoop1 and /hadoop2 and restarted
> the node and now the data is spread over the 2. Just need to wait for
> the replication to complete.
>
> So what I will do instead is, I will add 2 x 2TB drives, mount them as
> raid0 then move the existing data into this drive and remove the
> reprious one. That way hadoop will see still one directory under
> /hadoop1 but it will be 4TB instead of 2TB...
>
> Is there anywhere where I can read about hadoop vs the different kind
> of physical data storage configuration? (Book, web, etc.)
>

"Hadoop Operations" by E. Sammer:
http://shop.oreilly.com/product/0636920025085.do
>
> JM
>
> 2013/2/11, Ted Dunning <[EMAIL PROTECTED]>:
> > Typical best practice is to have a separate file system per spindle.  If
> > you have a RAID only controller (many are), then you just create one RAID
> > per spindle.  The effect is the same.
> >
> > MapR is unusual able to stripe writes over multiple drives organized
> into a
> > storage pool, but you will not normally be able to achieve that same
> level
> > of performance with ordinary Hadoop by using LVM over JBOD or controller
> > level RAID.  The problem is that the Java layer doesn't understand that
> the
> > storage is striped and the controller doesn't understand what Hadoop is
> > doing.  MapR schedules all of the writes to individual spindles via a
> very
> > fast state machine embedded in the file system.
> >
> > The comment about striping increasing the impact of a single disk drive
> is
> > exactly correct and it makes modeling the failure modes of the system
> > considerably more complex.  The net result of the modeling that I and
> > others have done is that moderate to large RAID groups in storage pools
> for
> > moderate sized clusters (< 2000 nodes or so) is just fine.  For large
> > clusters of up to 10,000 nodes, you should probably limit RAID groups to
> 4
> > drives or less.
> >
> > On Sun, Feb 10, 2013 at 7:39 PM, Marcos Ortiz <[EMAIL PROTECTED]> wrote:
> >
> >>  We have seen in several of our Hadoop clusters that LVM degrades
> >> performance of our M/R jobs, and I remembered a message where
> >> Ted Dunning was explaining something about this, and since
> >> that time, we don't use LVM for Hadoop data directories.
> >>
> >> About RAID volumes, the best performance that we have achieved
> >> is using RAID 10 for our Hadoop data directories.
> >>
> >>
> >>
> >> On 02/10/2013 09:24 PM, Michael Katzenellenbogen wrote:
> >>
> >> Are you able to create multiple RAID0 volumes? Perhaps you can expose
> >> each disk as its own RAID0 volume...
> >>
> >> Not sure why or where LVM comes into the picture here ... LVM is on
> >> the software layer and (hopefully) the RAID/JBOD stuff is at the
> >> hardware layer (and in the case of HDFS, LVM will only add unneeded
> >> overhead).
> >>
> >> -Michael
> >>
> >> On Feb 10, 2013, at 9:19 PM, Jean-Marc Spaggiari<
> [EMAIL PROTECTED]>
> >> <[EMAIL PROTECTED]> wrote:
> >>
> >>
> >>  The issue is that my MB is not doing JBOD :( I have RAID only
> >> possible, and I'm fighting for the last 48h and still not able to make
> >> it work... That's why I'm thinking about using dfs.data.dir instead.
> >>
> >> I have 1 drive per node so far and need to move to 2 to reduce WIO.
> >>
> >> What will be better with JBOD against dfs.data.dir? I have done some
> >> tests JBOD vs LVM and did not find any pros for JBOD so far.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB