Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # general >> [ANN] Plasma MapReduce, PlasmaFS, version 0.4


Copy link to this message
-
Re: [ANN] Plasma MapReduce, PlasmaFS, version 0.4
On 12/10/11 17:31, Gerd Stolpmann wrote:
> Hi,
>
> This is about the release of Plasma-0.4, an alternate and independent
> implementation of map/reduce with its own dfs. This might also be
> interesting for Hadoop users and developers, because this project
> incorporates a number of new ideas. So far, Plasma has proven to work on
> smaller clusters and shows good signs of being scalable. The design of
> PlasmaFS is certainly superior to that of HDFS - I did not want a
> quick'n'dirty solution, so please have a look how to do it right.
>
> Concerning the features, these two pages compare Plasma and Hadoop:
>
> http://plasma.camlcity.org/plasma/dl/plasma-0.4/doc/html/Plasmafs_and_hdfs.html
>

- without block checksums your code contains assumptions about HDD
integrity that does not stand up to the classic works by Pinhero or
Schroeder. Essentially you appear to be assuming that HDDs don't corrupt
data, yet both HDD and their interconnects can play up. For a recent
summary of Hadoop integrity, I would point you at [Loughran2011]

http://www.slideshare.net/steve_l/did-you-reallywantthatdata

-Hadoop NNs benefit from SSD too.

-auth and security has improved recently, though I'd still run it in a
private subnet just to be sure
 >
http://plasma.camlcity.org/plasma/dl/plasma-0.4/doc/html/Plasmamr_and_hadoop.html
 >
 > I hope you see where the point is.

Again, support for small block size is relevant in small situations. In
larger clusters you will not only have larger block sizes, if you do
work on small blocks the sheer number of task trackers reporting back to
the JT can overload it.

>
> I have currently only limited resources for testing my implementation.
> If there is anybody interested in testing on bigger clusters, please let
> me know.

That's one of the issues with the Plasma design: I'm not sure how well
things like Posix semantics, esp. locking and writes with offsets scale.
That's why the very large filesystems, HDFS included, tend to drop them.
Look at how much effort it took to get Append to work reliably.

Without evidence of working at scale, I'm not sure how the claim "the
design of Plasma is certainly superior to HDFS" is defensible. Sorry.

That said, using SunOS RPC/NFS as an FS protocol is nice as it does make
mounting straightforward. And as NFS locking isn't guaranteed in NFS,
you may be able to get away without it.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB