Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # dev >> Re: sailfish


+
Sriram Rao 2012-05-08, 22:54
+
Sriram Rao 2012-05-08, 22:48
+
Otis Gospodnetic 2012-05-09, 16:00
+
M. C. Srivas 2012-05-11, 05:50
+
Sriram Rao 2012-05-11, 06:01
Hey Sriram,

We discussed this before, but for the benefit of the wider audience: :)

It seems like the requirements imposed on KFS by Sailfish are in most
ways much simplier than the requirements of a full distributed
filesystem. The one thing we need is atomic record append -- but we
don't need anything else, like filesystem metadata/naming,
replication, corrupt data scanning, etc. All of the data is
transient/short-lived and at replication count 1.

So I think building something specific to this use case would be
pretty practical - and my guess is it might even have some benefits
over trying to use a full DFS.

In the MR2 architecture, I'd probably try to build this as a service
plugin in the NodeManager (similar to the way that the ShuffleHandler
in the current implementation works)

-Todd

On Thu, May 10, 2012 at 11:01 PM, Sriram Rao <[EMAIL PROTECTED]> wrote:
> Srivas,
>
> Sailfish is builds upon record append (a feature not present in HDFS).
>
> The software that is currently released is based on Hadoop-0.20.2.  You use
> the Sailfish version of Hadoop-0.20.2, KFS for the intermediate data, and
> then HDFS (or KFS) for storing the job/input.  Since the changes are all in
> the handling of map output/reduce input, it is transparent to existing jobs.
>
> What is being proposed below is to bolt all the starting/stopping of the
> related deamons into YARN as a first step.  There are other approaches that
> are possible, which have a similar effect.
>
> Hope this helps.
>
> Sriram
>
>
> On Thu, May 10, 2012 at 10:50 PM, M. C. Srivas <[EMAIL PROTECTED]> wrote:
>
>> Sriram,   Sailfish depends on append. I just noticed the HDFS disabled
>> append. How does one use this with Hadoop?
>>
>>
>> On Wed, May 9, 2012 at 9:00 AM, Otis Gospodnetic <
>> [EMAIL PROTECTED]
>> > wrote:
>>
>> > Hi Sriram,
>> >
>> > >> The I-file concept could possibly be implemented here in a fairly self
>> > contained way. One
>> > >> could even colocate/embed a KFS filesystem with such an alternate
>> > >> shuffle, like how MR task temporary space is usually colocated with
>> > >> HDFS storage.
>> >
>> > >  Exactly.
>> >
>> > >> Does this seem reasonable in any way?
>> >
>> > > Great. Where do go from here?  How do we get a colloborative effort
>> > going?
>> >
>> >
>> > Sounds like a JIRA issue should be opened, the approach briefly
>> described,
>> > and the first implementation attempt made.  Then iterate.
>> >
>> > I look forward to seeing this! :)
>> >
>> > Otis
>> > --
>> >
>> > Performance Monitoring for Solr / ElasticSearch / HBase -
>> > http://sematext.com/spm
>> >
>> >
>> >
>> > >________________________________
>> > > From: Sriram Rao <[EMAIL PROTECTED]>
>> > >To: [EMAIL PROTECTED]
>> > >Sent: Tuesday, May 8, 2012 6:48 PM
>> > >Subject: Re: Sailfish
>> > >
>> > >Dear Andy,
>> > >
>> > >> From: Andrew Purtell <[EMAIL PROTECTED]>
>> > >> ...
>> > >
>> > >> Do you intend this to be a joint project with the Hadoop community or
>> > >> a technology competitor?
>> > >
>> > >As I had said in my email, we are looking for folks to colloborate
>> > >with us to help get us integrated with Hadoop.  So, to be explicitly
>> > >clear, we are intending for this to be a joint project with the
>> > >community.
>> > >
>> > >> Regrettably, KFS is not a "drop in replacement" for HDFS.
>> > >> Hypothetically: I have several petabytes of data in an existing HDFS
>> > >> deployment, which is the norm, and a continuous MapReduce workflow.
>> > >> How do you propose I, practically, migrate to something like Sailfish
>> > >> without a major capital expenditure and/or downtime and/or data loss?
>> > >
>> > >Well, we are not asking for KFS to replace HDFS.  One path you could
>> > >take is to experiment with Sailfish---use KFS just for the
>> > >intermediate data and HDFS for everything else.  There is no major
>> > >capex :).  While you get comfy with pushing intermediate data into a

Todd Lipcon
Software Engineer, Cloudera
+
Robert Evans 2012-05-11, 14:29