Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: Mutiple dfs.data.dir vs RAID0


Copy link to this message
-
Re: Mutiple dfs.data.dir vs RAID0
You can also rebalance the disk using the steps describe in the FAQ
http://wiki.apache.org/hadoop/FAQ#On_an_individual_data_node.2C_how_do_you_balance_the_blocks_on_the_disk.3F

Olivier
On 11 February 2013 15:54, Jean-Marc Spaggiari <[EMAIL PROTECTED]>wrote:

> thanks all for your feebacks.
>
> I have updated with hdfs config to add another dfs.data.dir entry and
> restarted the node. Hadoop is starting to use the entry, but is not
> spreading the existing data over the 2 directories.
>
> Let's say you have a 2TB disk on /hadoop1, almost full. If you add
> another 2TB disk on /hadoop2 and add it on dfs.data.dir, hadoop will
> start to write into /hadoop1 and /hadoop2, but /hadoop1 will stay
> almost full. It will not balance the already existing data over the 2
> directories.
>
> I have deleted all the content of /hadoop1 and /hadoop2 and restarted
> the node and now the data is spread over the 2. Just need to wait for
> the replication to complete.
>
> So what I will do instead is, I will add 2 x 2TB drives, mount them as
> raid0 then move the existing data into this drive and remove the
> reprious one. That way hadoop will see still one directory under
> /hadoop1 but it will be 4TB instead of 2TB...
>
> Is there anywhere where I can read about hadoop vs the different kind
> of physical data storage configuration? (Book, web, etc.)
>
> JM
>
> 2013/2/11, Ted Dunning <[EMAIL PROTECTED]>:
> > Typical best practice is to have a separate file system per spindle.  If
> > you have a RAID only controller (many are), then you just create one RAID
> > per spindle.  The effect is the same.
> >
> > MapR is unusual able to stripe writes over multiple drives organized
> into a
> > storage pool, but you will not normally be able to achieve that same
> level
> > of performance with ordinary Hadoop by using LVM over JBOD or controller
> > level RAID.  The problem is that the Java layer doesn't understand that
> the
> > storage is striped and the controller doesn't understand what Hadoop is
> > doing.  MapR schedules all of the writes to individual spindles via a
> very
> > fast state machine embedded in the file system.
> >
> > The comment about striping increasing the impact of a single disk drive
> is
> > exactly correct and it makes modeling the failure modes of the system
> > considerably more complex.  The net result of the modeling that I and
> > others have done is that moderate to large RAID groups in storage pools
> for
> > moderate sized clusters (< 2000 nodes or so) is just fine.  For large
> > clusters of up to 10,000 nodes, you should probably limit RAID groups to
> 4
> > drives or less.
> >
> > On Sun, Feb 10, 2013 at 7:39 PM, Marcos Ortiz <[EMAIL PROTECTED]> wrote:
> >
> >>  We have seen in several of our Hadoop clusters that LVM degrades
> >> performance of our M/R jobs, and I remembered a message where
> >> Ted Dunning was explaining something about this, and since
> >> that time, we don't use LVM for Hadoop data directories.
> >>
> >> About RAID volumes, the best performance that we have achieved
> >> is using RAID 10 for our Hadoop data directories.
> >>
> >>
> >>
> >> On 02/10/2013 09:24 PM, Michael Katzenellenbogen wrote:
> >>
> >> Are you able to create multiple RAID0 volumes? Perhaps you can expose
> >> each disk as its own RAID0 volume...
> >>
> >> Not sure why or where LVM comes into the picture here ... LVM is on
> >> the software layer and (hopefully) the RAID/JBOD stuff is at the
> >> hardware layer (and in the case of HDFS, LVM will only add unneeded
> >> overhead).
> >>
> >> -Michael
> >>
> >> On Feb 10, 2013, at 9:19 PM, Jean-Marc Spaggiari<
> [EMAIL PROTECTED]>
> >> <[EMAIL PROTECTED]> wrote:
> >>
> >>
> >>  The issue is that my MB is not doing JBOD :( I have RAID only
> >> possible, and I'm fighting for the last 48h and still not able to make
> >> it work... That's why I'm thinking about using dfs.data.dir instead.
> >>
> >> I have 1 drive per node so far and need to move to 2 to reduce WIO.

Olivier Renault
Solution Engineer - Big Data - Hortonworks, Inc.
+44 7500 933 036
[EMAIL PROTECTED]
www.hortonworks.com
<http://hortonworks.com/products/hortonworks-sandbox/>