|
Steve Loughran
2013-02-11, 21:20
Eli Collins
2013-02-11, 21:36
Steve Loughran
2013-02-12, 08:55
Eli Collins
2013-02-12, 21:35
Steve Loughran
2013-02-12, 21:51
Eli Collins
2013-02-12, 22:09
Steve Loughran
2013-02-13, 09:44
Alejandro Abdelnur
2013-02-13, 20:07
Steve Loughran
2013-02-14, 14:05
Eric Baldeschwieler
2013-03-01, 05:02
Steve Loughran
2013-03-08, 14:43
Alejandro Abdelnur
2013-03-08, 16:15
Steve Loughran
2013-03-08, 16:57
Alejandro Abdelnur
2013-03-08, 17:07
Alejandro Abdelnur
2013-03-08, 18:47
Steve Loughran
2013-03-09, 11:36
Alejandro Abdelnur
2013-03-11, 19:15
|
-
where do side-projects go in trunk now that contrib/ is gone?Steve Loughran 2013-02-11, 21:20
I posted this to common-dev and got no answer, so I'm moving it to
-general, it being a general problem: where to put stuff that is part of Hadoop yet which isn't something you can just add to the src/ tree of one of the existing JAR files? ---------- Forwarded message ---------- From: Steve Loughran Date: 5 February 2013 19:32 Subject: now that contrib has gone, where do we put code that isn't going into hadoop-common.jar? To: [EMAIL PROTECTED] I've got two homeless projects looking to get in to Hadoop 1. the branch-1 HA canary monitor, which can also monitor arbitrary services with an HTTP port, including declared dependencies on HDFS being live (no timeouts reporting to vsphere or Linux HA while HDFS is offline or in safe mode) 2. the swiftfs filesystem driver https://github.com/hortonworks/Hadoop-and-Swift-integration docs: https://github.com/hortonworks/Hadoop-and-Swift-integration/blob/master/swift-file-system/src/site/md/swift-filesystem.md #1 I can just stick in contrib/ in branch 1, but #2 is homeless. It's a big patch, but is coupled to the FS contract tests (which I've been extending in HADOOP-9258 to be more rigorous) Where can things like this go, now that there is no contrib/? hadoop-tools? +
Steve Loughran 2013-02-11, 21:20
-
Re: where do side-projects go in trunk now that contrib/ is gone?Eli Collins 2013-02-11, 21:36
The idea of removing contrib was that this source would no longer go into
the Hadoop project. Where it goes is really up to the contributor. Some people have created separate incubator projects (eg MRUnit), some people are using Apache extras (hosted on Google), some could bake on github and then get contributed to Hadoop when they're ready to be fully supported in the code base (eg Hadoop auth and Httpfs). On Mon, Feb 11, 2013 at 1:20 PM, Steve Loughran <[EMAIL PROTECTED]> wrote: > I posted this to common-dev and got no answer, so I'm moving it to > -general, it being a general problem: where to put stuff that is part of > Hadoop yet which isn't something you can just add to the src/ tree of one > of the existing JAR files? > > ---------- Forwarded message ---------- > From: Steve Loughran > Date: 5 February 2013 19:32 > Subject: now that contrib has gone, where do we put code that isn't going > into hadoop-common.jar? > To: [EMAIL PROTECTED] > > > > I've got two homeless projects looking to get in to Hadoop > > 1. the branch-1 HA canary monitor, which can also monitor arbitrary > services with an HTTP port, including declared dependencies on HDFS > being > live (no timeouts reporting to vsphere or Linux HA while HDFS is > offline or > in safe mode) > 2. the swiftfs filesystem driver > https://github.com/hortonworks/Hadoop-and-Swift-integration > docs: > > https://github.com/hortonworks/Hadoop-and-Swift-integration/blob/master/swift-file-system/src/site/md/swift-filesystem.md > > #1 I can just stick in contrib/ in branch 1, but #2 is homeless. It's a big > patch, but is coupled to the FS contract tests (which I've been extending > in HADOOP-9258 to be more rigorous) > > Where can things like this go, now that there is no contrib/? > hadoop-tools? > +
Eli Collins 2013-02-11, 21:36
-
Re: where do side-projects go in trunk now that contrib/ is gone?Steve Loughran 2013-02-12, 08:55
On 11 February 2013 21:36, Eli Collins <[EMAIL PROTECTED]> wrote:
> The idea of removing contrib was that this source would no longer go into > the Hadoop project. Where it goes is really up to the contributor. Some > people have created separate incubator projects (eg MRUnit), some people > are using Apache extras (hosted on Google), some could bake on github and > then get contributed to Hadoop when they're ready to be fully supported in > the code base (eg Hadoop auth and Httpfs). > > I understand, and I recognise a limit of contrib/ was its low maintenance However, what the "no contrib" policy is saying is "no apache community around any extension code". I can -and do- have stuff in github, but I'm left choreographing merging with others -Rackspace, Mirantis -and as well as having the merge grief that comes from not having a shared repository, we are effectively excluding everyone else from participating. similarly: google code hosting != ASF. When you consider that Hadoop was a spin-off from lucene, Ant a spin-off from Tomcat, I think xalan started off in a corner of Xerces, etc, forcing things out of the ASF stops this incremental growth, not until things are already at the stage where it's considered ready for incubation -which also implies that this will be a long-lived project with independence from everything else. I can certainly see the value in having independent projects -but making it the starting place for all contributions to the hadoop codebase is too high a barrier. we need something less formal that is still part of the ASF -steve +
Steve Loughran 2013-02-12, 08:55
-
Re: where do side-projects go in trunk now that contrib/ is gone?Eli Collins 2013-02-12, 21:35
"extension code" is separate from "contrib". There's nothing
preventing extension code from being part of the project. Removing contrib is about saying removing not-yet-baked code, or un-maintained code, which causes problems with users (who don't know what contrib is, ie they can't tell what's supposed to work and what doesn't). IMO Hadoop extension code is often a better fit for either Apache extras (apache-extras.org, the community of open source projects related to Apache Software Foundation projects or based on their technology) or the project that's being integrated with (often it will be maintained by those developers, and therefore better to have it in a repo where they can more easily monitor/commit). In some cases I could see it living in the Hadoop source as well, but I don't think there's one right (TM) place for extension code. To answer your original question, there's already precedent for extension code in the project, eg file system extensions like s3 live in o.a.h.fs. The directory depends on the type of extension (eg 3rd party codec integration lives in o.a.h.io.compress). Probably best to discuss the particular case. On Tue, Feb 12, 2013 at 12:55 AM, Steve Loughran <[EMAIL PROTECTED]> wrote: > On 11 February 2013 21:36, Eli Collins <[EMAIL PROTECTED]> wrote: > >> The idea of removing contrib was that this source would no longer go into >> the Hadoop project. Where it goes is really up to the contributor. Some >> people have created separate incubator projects (eg MRUnit), some people >> are using Apache extras (hosted on Google), some could bake on github and >> then get contributed to Hadoop when they're ready to be fully supported in >> the code base (eg Hadoop auth and Httpfs). >> >> > I understand, and I recognise a limit of contrib/ was its low maintenance > > However, what the "no contrib" policy is saying is "no apache community > around any extension code". I can -and do- have stuff in github, but I'm > left choreographing merging with others -Rackspace, Mirantis -and as well > as having the merge grief that comes from not having a shared repository, > we are effectively excluding everyone else from participating. > > similarly: google code hosting != ASF. > > When you consider that Hadoop was a spin-off from lucene, Ant a spin-off > from Tomcat, I think xalan started off in a corner of Xerces, etc, forcing > things out of the ASF stops this incremental growth, not until things are > already at the stage where it's considered ready for incubation -which also > implies that this will be a long-lived project with independence from > everything else. > > I can certainly see the value in having independent projects -but making it > the starting place for all contributions to the hadoop codebase is too high > a barrier. > > we need something less formal that is still part of the ASF > > -steve +
Eli Collins 2013-02-12, 21:35
-
Re: where do side-projects go in trunk now that contrib/ is gone?Steve Loughran 2013-02-12, 21:51
On 12 February 2013 21:35, Eli Collins <[EMAIL PROTECTED]> wrote:
> > To answer your original question, there's already precedent for > extension code in the project, eg file system extensions like s3 live > in o.a.h.fs. The directory depends on the type of extension (eg 3rd > party codec integration lives in o.a.h.io.compress). Probably best to > discuss the particular case. > I know that s3n & S3 are in the main JAR, but I don't think that should be copied without good reason, because it's also added another dependency (jets3t) to everything. The case of where to put SwiftFS driver has cropped up in HADOOP-8545, but I don't think a single JIRA is the place to define policy like this -hopefully general is. -Steve +
Steve Loughran 2013-02-12, 21:51
-
Re: where do side-projects go in trunk now that contrib/ is gone?Eli Collins 2013-02-12, 22:09
I agree that the current place isn't a good one, for both the reasons
you mention on the jira (and because the people maintaining this code don't primarily work on Hadoop). IMO the SwiftFS driver should live in the swift source tree (as part of open stack). I'm not -1 on it living in-tree, it's just not my 1st choice. If you want to create a top-level directory for 3rd party (read non-local, non-hdfs file systems) file systems - go for it. It would be an improvement on the current situation (o.a.h.fs.ftp also brings in dependencies that most people don't need). I don't think we need to come up with a new top-level "kitchen sink" directory to handle all Hadoop extensions, there are a few well-defined extension points that can likely be handled independently so logically grouping them separately makes sense to me (and perhaps we'll decide some extensions are better in-tree and some not). On Tue, Feb 12, 2013 at 1:51 PM, Steve Loughran <[EMAIL PROTECTED]> wrote: > On 12 February 2013 21:35, Eli Collins <[EMAIL PROTECTED]> wrote: > >> >> To answer your original question, there's already precedent for >> extension code in the project, eg file system extensions like s3 live >> in o.a.h.fs. The directory depends on the type of extension (eg 3rd >> party codec integration lives in o.a.h.io.compress). Probably best to >> discuss the particular case. >> > > I know that s3n & S3 are in the main JAR, but I don't think that should be > copied without good reason, because it's also added another dependency > (jets3t) to everything. > > The case of where to put SwiftFS driver has cropped up in HADOOP-8545, but > I don't think a single JIRA is the place to define policy like this > -hopefully general is. > > -Steve +
Eli Collins 2013-02-12, 22:09
-
Re: where do side-projects go in trunk now that contrib/ is gone?Steve Loughran 2013-02-13, 09:44
On 12 February 2013 22:09, Eli Collins <[EMAIL PROTECTED]> wrote:
> I agree that the current place isn't a good one, for both the reasons > you mention on the jira (and because the people maintaining this code > don't primarily work on Hadoop). IMO the SwiftFS driver should live in > the swift source tree (as part of open stack). > If they could be persuaded to move beyond .py, it'd be tempting -because the FileSystem API is nominally stable. However, one thing I have noticed during this work is how the behaviour of FileSystem is underspecified -that's not an issue for HDFS, which gets stressed rigorously during the hdfs and mapred test runs, but it does matter for the rest. There's a lot of assumptions "files!=directories", mv / anything fails, and things that aren't tested (mv self self) returns true if self is file, false if a directory, what exception to raise if readFully goes past the end of a file (and the answer is?). We even make an implicit assumption that file operations are consistent: you get back what you wrote, which turns out to be an assumption not guaranteed by any of the blobstores in all circumstances. HADOOP-9258, HADOOP-9119 tighten the spec a bit, but if you look at what I've been doing for Swift testing, I've created a set of test suites, one per operation "ls", "read", "rename", with tests for scale, directory depth and width on my todo list: https://github.com/hortonworks/Hadoop-and-Swift-integration/tree/master/swift-file-system/src/test/java/org/apache/hadoop/fs/swift Then I want to extract those into tests that can be applied to all filesystems (say in o.a.g.fs.contract), with some per-FS metadata file providing details on what the FS supports (rename, append, case sensitivity, MAX_PATH, ...), so that we've got better test coverage (& being Junit4, you can skip tests in-code by throwing AssumptionViolatedExceptions; these get reported as skips), test coverage that can be applied to all the filesystems in the hadoop codebase. It's this expanded test coverage that will be the tightest coupling to hadoop. > > I'm not -1 on it living in-tree, it's just not my 1st choice. If you > want to create a top-level directory for 3rd party (read non-local, > non-hdfs file systems) file systems - go for it. It would be an > improvement on the current situation (o.a.h.fs.ftp also brings in > dependencies that most people don't need). I don't think we need to > come up with a new top-level "kitchen sink" directory to handle all > Hadoop extensions, there are a few well-defined extension points that > can likely be handled independently so logically grouping them > separately makes sense to me (and perhaps we'll decide some extensions > are better in-tree and some not). > Makes sense. That I will do in a JIRA +
Steve Loughran 2013-02-13, 09:44
-
Re: where do side-projects go in trunk now that contrib/ is gone?Alejandro Abdelnur 2013-02-13, 20:07
Steve,
I like the idea of testing all FS for expected behavior, in HttpFS we are already doing something along these lines testing HttpFS against HDFS and LocalFS. Also testing 2 WebHDFS clients. Regarding where these 'extensions' would go, well, we could have something like share/hadoop/common/filesystem-ext/s3 and whoever wants to use s3 would have to symlink those JARs into common/lib. Or having a way to activate via a HADOOP_COMMON_FS_EXT env which extension JARs to pick up. I guess the BigTop guys could help defining this magic. On Wed, Feb 13, 2013 at 1:44 AM, Steve Loughran <[EMAIL PROTECTED]>wrote: > On 12 February 2013 22:09, Eli Collins <[EMAIL PROTECTED]> wrote: > > > I agree that the current place isn't a good one, for both the reasons > > you mention on the jira (and because the people maintaining this code > > don't primarily work on Hadoop). IMO the SwiftFS driver should live in > > the swift source tree (as part of open stack). > > > > If they could be persuaded to move beyond .py, it'd be tempting -because > the FileSystem API is nominally stable. > > However, one thing I have noticed during this work is how the behaviour of > FileSystem is underspecified -that's not an issue for HDFS, which gets > stressed rigorously during the hdfs and mapred test runs, but it does > matter for the rest. > > There's a lot of assumptions "files!=directories", mv / anything fails, and > things that aren't tested (mv self self) returns true if self is file, > false if a directory, what exception to raise if readFully goes past the > end of a file (and the answer is?). > > We even make an implicit assumption that file operations are consistent: > you get back what you wrote, which turns out to be an assumption not > guaranteed by any of the blobstores in all circumstances. > > HADOOP-9258, HADOOP-9119 tighten the spec a bit, but if you look at what > I've been doing for Swift testing, I've created a set of test suites, one > per operation "ls", "read", "rename", with tests for scale, directory depth > and width on my todo list: > > > https://github.com/hortonworks/Hadoop-and-Swift-integration/tree/master/swift-file-system/src/test/java/org/apache/hadoop/fs/swift > > > Then I want to extract those into tests that can be applied to all > filesystems (say in o.a.g.fs.contract), with some per-FS metadata file > providing details on what the FS supports (rename, append, case > sensitivity, MAX_PATH, ...), so that we've got better test coverage (& > being Junit4, you can skip tests in-code by throwing > AssumptionViolatedExceptions; these get reported as skips), test coverage > that can be applied to all the filesystems in the hadoop codebase. > > It's this expanded test coverage that will be the tightest coupling to > hadoop. > > > > > I'm not -1 on it living in-tree, it's just not my 1st choice. If you > > want to create a top-level directory for 3rd party (read non-local, > > non-hdfs file systems) file systems - go for it. It would be an > > improvement on the current situation (o.a.h.fs.ftp also brings in > > dependencies that most people don't need). I don't think we need to > > come up with a new top-level "kitchen sink" directory to handle all > > Hadoop extensions, there are a few well-defined extension points that > > can likely be handled independently so logically grouping them > > separately makes sense to me (and perhaps we'll decide some extensions > > are better in-tree and some not). > > > > Makes sense. That I will do in a JIRA > -- Alejandro +
Alejandro Abdelnur 2013-02-13, 20:07
-
Re: where do side-projects go in trunk now that contrib/ is gone?Steve Loughran 2013-02-14, 14:05
On 13 February 2013 20:07, Alejandro Abdelnur <[EMAIL PROTECTED]> wrote:
> Steve, > > I like the idea of testing all FS for expected behavior, in HttpFS we are > already doing something along these lines testing HttpFS against HDFS and > LocalFS. Also testing 2 WebHDFS clients. > excellent. I look forward to your test contributions! > > Regarding where these 'extensions' would go, well, we could have something > like share/hadoop/common/filesystem-ext/s3 and whoever wants to use s3 > would have to symlink those JARs into common/lib. Or having a way to > activate via a HADOOP_COMMON_FS_EXT env which extension JARs to pick up. I > guess the BigTop guys could help defining this magic. > > > I was thinking of less of "where should it go at install time" and "where do we keep it in SVN" at install time you'd need the JAR + any dependencies on the daemon paths -if it is to be everywhere- or uploaded with a job into distributed cache. Testing that the latter works with filesystem.get() would be something to play with. & yes, bigtop could help there +
Steve Loughran 2013-02-14, 14:05
-
Re: where do side-projects go in trunk now that contrib/ is gone?Eric Baldeschwieler 2013-03-01, 05:02
I agree with where this is going.
Swift and S3 are compelling enough that they should be in the source tree IMO. Hadoop needs to play well with common platforms such as the major clouds. On the other hand, it would be great if we could segregate them enough that each builds is its own JAR and folks have the option of not pulling their dependancies in and not building / testing them in a clean way. On Feb 14, 2013, at 6:05 AM, Steve Loughran <[EMAIL PROTECTED]> wrote: > On 13 February 2013 20:07, Alejandro Abdelnur <[EMAIL PROTECTED]> wrote: > >> Steve, >> >> I like the idea of testing all FS for expected behavior, in HttpFS we are >> already doing something along these lines testing HttpFS against HDFS and >> LocalFS. Also testing 2 WebHDFS clients. >> > > excellent. I look forward to your test contributions! > >> >> Regarding where these 'extensions' would go, well, we could have something >> like share/hadoop/common/filesystem-ext/s3 and whoever wants to use s3 >> would have to symlink those JARs into common/lib. Or having a way to >> activate via a HADOOP_COMMON_FS_EXT env which extension JARs to pick up. I >> guess the BigTop guys could help defining this magic. >> >> >> I was thinking of less of "where should it go at install time" and "where > do we keep it in SVN" > > at install time you'd need the JAR + any dependencies on the daemon paths > -if it is to be everywhere- or uploaded with a job into distributed cache. > Testing that the latter works with filesystem.get() would be something to > play with. > > & yes, bigtop could help there +
Eric Baldeschwieler 2013-03-01, 05:02
-
Re: where do side-projects go in trunk now that contrib/ is gone?Steve Loughran 2013-03-08, 14:43
On 1 March 2013 05:02, Eric Baldeschwieler <[EMAIL PROTECTED]> wrote:
> I agree with where this is going. > > Swift and S3 are compelling enough that they should be in the source tree > IMO. Hadoop needs to play well with common platforms such as the major > clouds. > > On the other hand, it would be great if we could segregate them enough > that each builds is its own JAR and folks have the option of not pulling > their dependancies in and not building / testing them in a clean way. > > I've added a JIRA on setting up a bit of the src tree and subproject(s) for these : https://issues.apache.org/jira/browse/HADOOP-9385 Test plans go into https://issues.apache.org/jira/browse/HADOOP-9361, which can evolve at different rate -Steve +
Steve Loughran 2013-03-08, 14:43
-
Re: where do side-projects go in trunk now that contrib/ is gone?Alejandro Abdelnur 2013-03-08, 16:15
jumping a bit late into the discussion.
I'd argue that unless those filesystems are part of hadoop, their clients should not be distributed/build by hadoop. an analogy to this is not wanting Yarn to be the home for AM implementations. a key concern is testability and maintainability. still, i see bigtop as the integration point and the mean of making those jars avail to a setup. thanks Alejandro (phone typing) On Mar 8, 2013, at 6:43 AM, Steve Loughran <[EMAIL PROTECTED]> wrote: > On 1 March 2013 05:02, Eric Baldeschwieler <[EMAIL PROTECTED]> wrote: > >> I agree with where this is going. >> >> Swift and S3 are compelling enough that they should be in the source tree >> IMO. Hadoop needs to play well with common platforms such as the major >> clouds. >> >> On the other hand, it would be great if we could segregate them enough >> that each builds is its own JAR and folks have the option of not pulling >> their dependancies in and not building / testing them in a clean way. > I've added a JIRA on setting up a bit of the src tree and subproject(s) for > these : https://issues.apache.org/jira/browse/HADOOP-9385 > > Test plans go into https://issues.apache.org/jira/browse/HADOOP-9361, which > can evolve at different rate > > -Steve +
Alejandro Abdelnur 2013-03-08, 16:15
-
Re: where do side-projects go in trunk now that contrib/ is gone?Steve Loughran 2013-03-08, 16:57
On 8 March 2013 16:15, Alejandro Abdelnur <[EMAIL PROTECTED]> wrote:
> jumping a bit late into the discussion. > > yes. I started it in common-dev first, in the "where does contrib stuff go now", moved to general, where the conclusion was "except for special cases like FS clients, it isn't". Now I'm trying to lay down the location for FS stuff, both for openstack, and to handle so proposed dependency changes for s3n:// > I'd argue that unless those filesystems are part of hadoop, their clients > should not be distributed/build by hadoop. > > an analogy to this is not wanting Yarn to be the home for AM > implementations. > > a key concern is testability and maintainability. > We are already there with the S3 and Azure blobstores, as well as the FTP filesystem The testability is straightforward for blobstores precisely because all you need is some credentials and cluster time; there's no requirement to have some specific filesystem to hand. Any of those -very much in the vendors hand to do their own testing, especially if the "it's a replacement for HDFS" assertion is made. If you look at HADOOP-9361 you can see that I've been defining more rigorously than before what our FS expectations are, with HADOOP-9371 spelling it out "what happens when you try to readFully() past the end of a file, or call getBlockLocations("/")? HDFS has actions here, and downstream code depends on some things (e.g. getBlockLocations() behaviour on directories) https://issues.apache.org/jira/secure/attachment/12572328/HadoopFilesystemContract.pdf So far my initially blobstore-specific tests for the functional parts of the specification (not the consistency, concurrency, atomicity parts) are in github https://github.com/hortonworks/Hadoop-and-Swift-integration/tree/master/swift-file-system/src/test/java/org/apache/hadoop/fs/swift I've also added more tests to the existing FS contract test, and in doing so showed that s3 and s3n have some data-loss risks which need to be fixed -that's an argument in having favour of the (testable, low-maintenance cost) filesystems somewhere where any of us is free to fix. While we refine that spec better, I want to take those per-operation tests from the SwiftFS support, make them retargetable at other filesystems, and slowly apply them to all the distributed filesystems. Your colleague Andrew Wang is helping there by abstracting FileSystem and FileContext away, so we can test both. still, i see bigtop as the integration point and the mean of making those > jars avail to a setup. > > I plan to put integration -the tests that try to run Pig with arbitrary source and dest filesystems, same for hive, plus some scale tests -can we upload an 8GB file? What do you get back? can I create > 65536 entries in a single directory, and what happens to ls / performance? To summarise then 1. blobstores, ftpfilesystem & c could gradually move to a hadoop-common/hadoop-filesystem-clients 2. A stricter specification of compliance, for the benefit of everyone -us, other FS implementors and users of FS APIs 3. Lots of new functional tests for compliance -abstract in hadoop-common; FS-specific in hadoop-filesystem-clients.. 4. Integration & scale tests in bigtop 5. Anyone writing a "hadoop compatible FS" can grab the functional and integration tests and see what breaks -fixing their code. 6. The combination of (Java API files, specification doc, functional tests, HDFS implementation) define the expected behavior of a filesystem -Steve -Steve +
Steve Loughran 2013-03-08, 16:57
-
Re: where do side-projects go in trunk now that contrib/ is gone?Alejandro Abdelnur 2013-03-08, 17:07
> We are already there with the S3 and Azure blobstores, as well as the FTP
> filesystem I think this is not correct and we should plan moving them out. This is independent on the effort of straighten up the FS spec, which I think is great. Thx On Fri, Mar 8, 2013 at 8:57 AM, Steve Loughran <[EMAIL PROTECTED]>wrote: > On 8 March 2013 16:15, Alejandro Abdelnur <[EMAIL PROTECTED]> wrote: > > > jumping a bit late into the discussion. > > > > yes. I started it in common-dev first, in the "where does contrib stuff > go > now", moved to general, where the conclusion was "except for special cases > like FS clients, it isn't". > > Now I'm trying to lay down the location for FS stuff, both for openstack, > and to handle so proposed dependency changes for s3n:// > > > > I'd argue that unless those filesystems are part of hadoop, their clients > > should not be distributed/build by hadoop. > > > > an analogy to this is not wanting Yarn to be the home for AM > > implementations. > > > > a key concern is testability and maintainability. > > > > We are already there with the S3 and Azure blobstores, as well as the FTP > filesystem > > The testability is straightforward for blobstores precisely because all you > need is some credentials and cluster time; there's no requirement to have > some specific filesystem to hand. Any of those -very much in the vendors > hand to do their own testing, especially if the "it's a replacement for > HDFS" assertion is made. > > If you look at HADOOP-9361 you can see that I've been defining more > rigorously than before what our FS expectations are, with HADOOP-9371 > spelling it out "what happens when you try to readFully() past the end of a > file, or call getBlockLocations("/")? HDFS has actions here, and downstream > code depends on some things (e.g. getBlockLocations() behaviour on > directories) > > https://issues.apache.org/jira/secure/attachment/12572328/HadoopFilesystemContract.pdf > > So far my initially blobstore-specific tests for the functional parts of > the specification (not the consistency, concurrency, atomicity parts) are > in github > > https://github.com/hortonworks/Hadoop-and-Swift-integration/tree/master/swift-file-system/src/test/java/org/apache/hadoop/fs/swift > > > I've also added more tests to the existing FS contract test, and in doing > so showed that s3 and s3n have some data-loss risks which need to be fixed > -that's an argument in having favour of the (testable, low-maintenance > cost) filesystems somewhere where any of us is free to fix. > > While we refine that spec better, I want to take those per-operation tests > from the SwiftFS support, make them retargetable at other filesystems, and > slowly apply them to all the distributed filesystems. Your colleague Andrew > Wang is helping there by abstracting FileSystem and FileContext away, so we > can test both. > > still, i see bigtop as the integration point and the mean of making those > > jars avail to a setup. > > > > > I plan to put integration -the tests that try to run Pig with arbitrary > source and dest filesystems, same for hive, plus some scale tests -can we > upload an 8GB file? What do you get back? can I create > 65536 entries in a > single directory, and what happens to ls / performance? > > To summarise then > > 1. blobstores, ftpfilesystem & c could gradually move to a > hadoop-common/hadoop-filesystem-clients > 2. A stricter specification of compliance, for the benefit of everyone > -us, other FS implementors and users of FS APIs > 3. Lots of new functional tests for compliance -abstract in > hadoop-common; FS-specific in hadoop-filesystem-clients.. > 4. Integration & scale tests in bigtop > 5. Anyone writing a "hadoop compatible FS" can grab the functional and > integration tests and see what breaks -fixing their code. > 6. The combination of (Java API files, specification doc, functional > tests, HDFS implementation) define the expected behavior of a filesystem > Alejandro +
Alejandro Abdelnur 2013-03-08, 17:07
-
Re: where do side-projects go in trunk now that contrib/ is gone?Alejandro Abdelnur 2013-03-08, 18:47
I was chatting offline with Roman about this, his point is
1* segration of the FS impls into different modules makes sense 2* it should be OK if they have mock services for unittests 3* bigtop could do real integration testing 4* by doing this, the diff FileSystem impls would be there out of the box If we go down this path, I'm OK with it. Thoughts? On Fri, Mar 8, 2013 at 9:07 AM, Alejandro Abdelnur <[EMAIL PROTECTED]>wrote: > > > We are already there with the S3 and Azure blobstores, as well as the FTP > > filesystem > > I think this is not correct and we should plan moving them out. > > This is independent on the effort of straighten up the FS spec, which I > think is great. > > Thx > > On Fri, Mar 8, 2013 at 8:57 AM, Steve Loughran <[EMAIL PROTECTED]>wrote: > >> On 8 March 2013 16:15, Alejandro Abdelnur <[EMAIL PROTECTED]> wrote: >> >> > jumping a bit late into the discussion. >> > >> > yes. I started it in common-dev first, in the "where does contrib stuff >> go >> now", moved to general, where the conclusion was "except for special cases >> like FS clients, it isn't". >> >> Now I'm trying to lay down the location for FS stuff, both for openstack, >> and to handle so proposed dependency changes for s3n:// >> >> >> > I'd argue that unless those filesystems are part of hadoop, their >> clients >> > should not be distributed/build by hadoop. >> > >> > an analogy to this is not wanting Yarn to be the home for AM >> > implementations. >> > >> > a key concern is testability and maintainability. >> > >> >> We are already there with the S3 and Azure blobstores, as well as the FTP >> filesystem >> >> The testability is straightforward for blobstores precisely because all >> you >> need is some credentials and cluster time; there's no requirement to have >> some specific filesystem to hand. Any of those -very much in the vendors >> hand to do their own testing, especially if the "it's a replacement for >> HDFS" assertion is made. >> >> If you look at HADOOP-9361 you can see that I've been defining more >> rigorously than before what our FS expectations are, with HADOOP-9371 >> spelling it out "what happens when you try to readFully() past the end of >> a >> file, or call getBlockLocations("/")? HDFS has actions here, and >> downstream >> code depends on some things (e.g. getBlockLocations() behaviour on >> directories) >> >> https://issues.apache.org/jira/secure/attachment/12572328/HadoopFilesystemContract.pdf >> >> So far my initially blobstore-specific tests for the functional parts of >> the specification (not the consistency, concurrency, atomicity parts) are >> in github >> >> https://github.com/hortonworks/Hadoop-and-Swift-integration/tree/master/swift-file-system/src/test/java/org/apache/hadoop/fs/swift >> >> >> I've also added more tests to the existing FS contract test, and in doing >> so showed that s3 and s3n have some data-loss risks which need to be fixed >> -that's an argument in having favour of the (testable, low-maintenance >> cost) filesystems somewhere where any of us is free to fix. >> >> While we refine that spec better, I want to take those per-operation tests >> from the SwiftFS support, make them retargetable at other filesystems, and >> slowly apply them to all the distributed filesystems. Your colleague >> Andrew >> Wang is helping there by abstracting FileSystem and FileContext away, so >> we >> can test both. >> >> still, i see bigtop as the integration point and the mean of making those >> > jars avail to a setup. >> > >> > >> I plan to put integration -the tests that try to run Pig with arbitrary >> source and dest filesystems, same for hive, plus some scale tests -can we >> upload an 8GB file? What do you get back? can I create > 65536 entries in >> a >> single directory, and what happens to ls / performance? >> >> To summarise then >> >> 1. blobstores, ftpfilesystem & c could gradually move to a >> hadoop-common/hadoop-filesystem-clients >> 2. A stricter specification of compliance, for the benefit of everyone Alejandro +
Alejandro Abdelnur 2013-03-08, 18:47
-
Re: where do side-projects go in trunk now that contrib/ is gone?Steve Loughran 2013-03-09, 11:36
On 8 March 2013 18:47, Alejandro Abdelnur <[EMAIL PROTECTED]> wrote:
> I was chatting offline with Roman about this, his point is > > 1* segration of the FS impls into different modules makes sense > 2* it should be OK if they have mock services for unittests > not so much mock tests as live tests against individual features (rename, delete, mkdirs), but not full tests of MR jobs, Pig jobs, etc -which verify that real code works with it > 3* bigtop could do real integration testing > exactly -it's at the end of the dependency graph, and the best place to do that > 4* by doing this, the diff FileSystem impls would be there out of the box > > If we go down this path, I'm OK with it. > > > Thoughts? > > This is exactly what I've been thinking +
Steve Loughran 2013-03-09, 11:36
-
Re: where do side-projects go in trunk now that contrib/ is gone?Alejandro Abdelnur 2013-03-11, 19:15
sounds good, thx
On Sat, Mar 9, 2013 at 3:36 AM, Steve Loughran <[EMAIL PROTECTED]>wrote: > On 8 March 2013 18:47, Alejandro Abdelnur <[EMAIL PROTECTED]> wrote: > > > I was chatting offline with Roman about this, his point is > > > > 1* segration of the FS impls into different modules makes sense > > 2* it should be OK if they have mock services for unittests > > > > not so much mock tests as live tests against individual features (rename, > delete, mkdirs), but not full tests of MR jobs, Pig jobs, etc -which verify > that real code works with it > > > > 3* bigtop could do real integration testing > > > > exactly -it's at the end of the dependency graph, and the best place to do > that > > > > 4* by doing this, the diff FileSystem impls would be there out of the box > > > > If we go down this path, I'm OK with it. > > > > > > > > Thoughts? > > > > > This is exactly what I've been thinking > -- Alejandro +
Alejandro Abdelnur 2013-03-11, 19:15
|