|
Aleksandr Shulman
2013-01-14, 18:32
Ted Yu
2013-01-14, 19:01
Aleksandr Shulman
2013-01-14, 19:15
Andrew Purtell
2013-01-14, 23:15
Jonathan Hsieh
2013-01-15, 01:27
Andrew Purtell
2013-01-15, 02:47
Jonathan Hsieh
2013-01-15, 08:55
Nicolas Liochon
2013-01-15, 09:25
Jean-Marc Spaggiari
2013-01-15, 19:01
|
-
Let's discuss Snapshots Feature TestingAleksandr Shulman 2013-01-14, 18:32
Hi everyone,
I'd like to start a thread about Cloudera's testing efforts on the upcoming snapshots feature. This is a new feature and it's important that we explain our testing efforts and get the community's opinion on what we'd all like to see tested. My hope is that from this discussion, we can get more ideas about what needs to be tested and gain confidence in the testing we have in place. Before I begin, I'd like to introduce myself. I'm Aleks Shulman. I'm a software engineer at Cloudera, working primarily on HBase. Within HBase, I am focusing on the quality side of things. What this means to me is an conversation unto itself, but in brief, I will be writing tests and test frameworks. I will also be an advocate for the user experience, with particular focus on API compatibility and ease-of-use. So let's discuss snapshots: There are two main areas that should be tested and they correspond nicely into what can be done as unit tests and what is better left as Jenkins job or some other automation, unit testing and non-unit testing. We've been working on this for a bit, so there is already some progress in these areas: Unit testing - In progress or completed: 1. HBase Snapshots Repeatability and Idempotency Test: This test class verifies proper behavior with regards performing restore/clone operations on tables that themselves were created as a clone or restored from a snapshot. This is an interesting set of cases because of the way snapshots work. They work by pointing to the original HFiles. We can use these tests to verify correctness in the file system and test closure under deletion of the original table. 2. HBase Snapshots HTable Descriptor Test This test class verifies proper behavior with regards to changes to the information about the table itself before and after snapshotting in the 'before' table and the 'after' table. 3. HBase Snapshots HFileLink Test This test class inspects the correctness of the HFileLink files. It looks into their permissioning, the naming convention, and how they respond events. Events may include an HFile being deleted or moved. 4. HBase Snapshots Table Dimensions Test This test class inspects operations on tables that are empty, have only one row, have one or two CFs, etc. Basically if there is an edge scenario in what the table looks like, that may affect the way it snapshotted or restored/cloned. 5. HBase Snapshots Independence Test This test should verify that all aspects of table independence are guaranteed between the original table and the restored snapshot/clone. This includes things like data mutations, compactions, splits, etc. It also includes metadata changes. 6. HBase Snapshots Aborted or Failed Snapshot Cleanup Verifies that no cruft is left over after an attempt to snapshot a table fails or is aborted. We should be able to account for every file in the file system before and after. 7. HBase Snapshots HFile Archive Test This test task is to fill in any gaps in testing of archiving as it relates to snapshots. The snapshots relies on the HFileArchiver/LogArchiver with two new cleaners (SnapshotHFile/SnapshotLog Cleaners), so we'd need to go through and find out what needs to be tested between them. 8. HBase Snapshots Export Test This test should verify that export of a snapshot to another cluster works properly. Implemented as: mvn clean test -PlocalTests -Dtest=org.apache.hadoop.hbase.snapshot.TestExportSnapshot However, we need to add more test around chmod, chown and checksums 9. HBase Snapshots Concurrent Snapshots Test This test class will enforce proper behavior in situations where race conditions can occur. For example, if one process attempts to restore a table and another one tries to do so simultaneously, what happens? We need to know how dangerous this could be and whether it is possible for data to be lost. Covered in HBASE-7536. Unit testing - Lightly tested so far, or tests we are hoping to write soon: 1. HBase Snapshots File System Correctness Tests - This test class verifies proper behavior with regards to what the file system looks like. What the file system contains should be predictable after certain events, both snapshot-specific and environment-specific. For example, after a snapshot, we should expect there to be files in the /hbase/.snapshot/ folder. Also, after a split occurs on the base table and the underlying HFiles go through flux, we should be able to know beforehand where files move. In particular, this is important to test after repeated deletions and modifications. Also -- we want to make sure no cruft remains after various operations occur. 2. HBase Snapshots (Re)Naming Test [Note: Renaming snapshots is not supported yet!] These tests should verify valid/invalid names for snapshots. In particular, it should use the rename_snapshot command to attempt to rename to a table that already exists, or to a snapshot that already exists (or had existed but was deleted). Things like special characters or semantically-meaningful characters are important as well. Other things that need to be tested are what happens if a snapshot is created, deleted, the underlying table is modified, and then another snapshot is taken. The snapshot should contain the most recent data. 3. Snapshots logline test: Verifies that the proper loglines are generated for events. Manual testing for this might include making sure that spurious, misleading, or unnecessary log lines are not present. 4. HBase Snapshots Aborted or Failed Clone or Restore Verifies that no cruft is left over after an attempt to restore or clone a snapshotted table fails or is aborted and that further snapshots can take place. This may be tricky and could require writing some additional utilities. Non-unit testing: This area of testing is less straightforward and more exploratory in nature. It's open-ended but with some direction. Particularly, we want to test a lot of "what if this happens when we do something related snapshots". By "this happens",
-
Re: Let's discuss Snapshots Feature TestingTed Yu 2013-01-14, 19:01
Thanks for the write up.
Would the new tests be sub-tasks of HBASE-7290 ? Cheers On Mon, Jan 14, 2013 at 10:32 AM, Aleksandr Shulman <[EMAIL PROTECTED]>wrote: > Hi everyone, > > I'd like to start a thread about Cloudera's testing efforts on the upcoming > snapshots feature. This is a new feature and it's important that we explain > our testing efforts and get the community's opinion on what we'd all like > to see tested. My hope is that from this discussion, we can get more ideas > about what needs to be tested and gain confidence in the testing we have in > place. > > Before I begin, I'd like to introduce myself. I'm Aleks Shulman. I'm a > software engineer at Cloudera, working primarily on HBase. Within HBase, I > am focusing on the quality side of things. What this means to me is an > conversation unto itself, but in brief, I will be writing tests and test > frameworks. I will also be an advocate for the user experience, with > particular focus on API compatibility and ease-of-use. > > So let's discuss snapshots: > There are two main areas that should be tested and they correspond nicely > into what can be done as unit tests and what is better left as Jenkins job > or some other automation, unit testing and non-unit testing. We've been > working on this for a bit, so there is already some progress in these > areas: > > Unit testing - In progress or completed: > > 1. HBase Snapshots Repeatability and Idempotency Test: > This test class verifies proper behavior with regards performing > restore/clone operations on tables that themselves were created as a clone > or restored from a snapshot. This is an interesting set of cases because of > the way snapshots work. They work by pointing to the original HFiles. > We can use these tests to verify correctness in the file system and test > closure under deletion of the original table. > > 2. HBase Snapshots HTable Descriptor Test > This test class verifies proper behavior with regards to changes to the > information about the table itself before and after snapshotting in the > 'before' table and the 'after' table. > > 3. HBase Snapshots HFileLink Test > This test class inspects the correctness of the HFileLink files. It looks > into their permissioning, the naming convention, and how they respond > events. Events may include an HFile being deleted or moved. > > 4. HBase Snapshots Table Dimensions Test > This test class inspects operations on tables that are empty, have only one > row, have one or two CFs, etc. Basically if there is an edge scenario in > what the table looks like, that may affect the way it snapshotted or > restored/cloned. > > 5. HBase Snapshots Independence Test > This test should verify that all aspects of table independence are > guaranteed between the original table and the restored snapshot/clone. > This includes things like data mutations, compactions, splits, etc. It also > includes metadata changes. > > 6. HBase Snapshots Aborted or Failed Snapshot Cleanup > Verifies that no cruft is left over after an attempt to snapshot a table > fails or is aborted. We should be able to account for every file in the > file system before and after. > > 7. HBase Snapshots HFile Archive Test > This test task is to fill in any gaps in testing of archiving as it relates > to snapshots. The snapshots relies on the HFileArchiver/LogArchiver with > two new cleaners (SnapshotHFile/SnapshotLog Cleaners), so we'd need to go > through and find out what needs to be tested between them. > > 8. HBase Snapshots Export Test > This test should verify that export of a snapshot to another cluster works > properly. > Implemented as: mvn clean test -PlocalTests > -Dtest=org.apache.hadoop.hbase.snapshot.TestExportSnapshot > However, we need to add more test around chmod, chown and checksums > > 9. HBase Snapshots Concurrent Snapshots Test > This test class will enforce proper behavior in situations where race > conditions can occur. For example, if one process attempts to restore a > table and another one tries to do so simultaneously, what happens? We need
-
Re: Let's discuss Snapshots Feature TestingAleksandr Shulman 2013-01-14, 19:15
Yes, I am planning on filing a JIRA for that shortly.
-Aleks S. On Mon, Jan 14, 2013 at 11:01 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > Thanks for the write up. > > Would the new tests be sub-tasks of HBASE-7290 ? > > Cheers > > On Mon, Jan 14, 2013 at 10:32 AM, Aleksandr Shulman <[EMAIL PROTECTED] > >wrote: > > > Hi everyone, > > > > I'd like to start a thread about Cloudera's testing efforts on the > upcoming > > snapshots feature. This is a new feature and it's important that we > explain > > our testing efforts and get the community's opinion on what we'd all like > > to see tested. My hope is that from this discussion, we can get more > ideas > > about what needs to be tested and gain confidence in the testing we have > in > > place. > > > > Before I begin, I'd like to introduce myself. I'm Aleks Shulman. I'm a > > software engineer at Cloudera, working primarily on HBase. Within HBase, > I > > am focusing on the quality side of things. What this means to me is an > > conversation unto itself, but in brief, I will be writing tests and test > > frameworks. I will also be an advocate for the user experience, with > > particular focus on API compatibility and ease-of-use. > > > > So let's discuss snapshots: > > There are two main areas that should be tested and they correspond nicely > > into what can be done as unit tests and what is better left as Jenkins > job > > or some other automation, unit testing and non-unit testing. We've been > > working on this for a bit, so there is already some progress in these > > areas: > > > > Unit testing - In progress or completed: > > > > 1. HBase Snapshots Repeatability and Idempotency Test: > > This test class verifies proper behavior with regards performing > > restore/clone operations on tables that themselves were created as a > clone > > or restored from a snapshot. This is an interesting set of cases because > of > > the way snapshots work. They work by pointing to the original HFiles. > > We can use these tests to verify correctness in the file system and test > > closure under deletion of the original table. > > > > 2. HBase Snapshots HTable Descriptor Test > > This test class verifies proper behavior with regards to changes to the > > information about the table itself before and after snapshotting in the > > 'before' table and the 'after' table. > > > > 3. HBase Snapshots HFileLink Test > > This test class inspects the correctness of the HFileLink files. It looks > > into their permissioning, the naming convention, and how they respond > > events. Events may include an HFile being deleted or moved. > > > > 4. HBase Snapshots Table Dimensions Test > > This test class inspects operations on tables that are empty, have only > one > > row, have one or two CFs, etc. Basically if there is an edge scenario in > > what the table looks like, that may affect the way it snapshotted or > > restored/cloned. > > > > 5. HBase Snapshots Independence Test > > This test should verify that all aspects of table independence are > > guaranteed between the original table and the restored snapshot/clone. > > This includes things like data mutations, compactions, splits, etc. It > also > > includes metadata changes. > > > > 6. HBase Snapshots Aborted or Failed Snapshot Cleanup > > Verifies that no cruft is left over after an attempt to snapshot a table > > fails or is aborted. We should be able to account for every file in the > > file system before and after. > > > > 7. HBase Snapshots HFile Archive Test > > This test task is to fill in any gaps in testing of archiving as it > relates > > to snapshots. The snapshots relies on the HFileArchiver/LogArchiver with > > two new cleaners (SnapshotHFile/SnapshotLog Cleaners), so we'd need to go > > through and find out what needs to be tested between them. > > > > 8. HBase Snapshots Export Test > > This test should verify that export of a snapshot to another cluster > works > > properly. > > Implemented as: mvn clean test -PlocalTests > > -Dtest=org.apache.hadoop.hbase.snapshot.TestExportSnapshot Best Regards, Aleks Shulman 847.814.5804 Cloudera
-
Re: Let's discuss Snapshots Feature TestingAndrew Purtell 2013-01-14, 23:15
Thanks for the writeup. Looks very comprehensive.
On Mon, Jan 14, 2013 at 10:32 AM, Aleksandr Shulman <[EMAIL PROTECTED]>wrote: > Hi everyone, > > I'd like to start a thread about Cloudera's testing efforts on the upcoming > snapshots feature. This is a new feature and it's important that we explain > our testing efforts and get the community's opinion on what we'd all like > to see tested. My hope is that from this discussion, we can get more ideas > about what needs to be tested and gain confidence in the testing we have in > place. > > [...] > Non-unit testing: > [...] > Some of the things we have tried: > -Long running tests: Run repeated snapshots while verifying that all is > well. > > -Meanness tests: > 1. Killing the master > 2. Performing a compaction > 3. Table enable/disable > 4. Killing regionservers. 5. Killling datanodes. 6. Killing regionservers and datanodes together on a node. 7. During HA NN failover. -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
-
Re: Let's discuss Snapshots Feature TestingJonathan Hsieh 2013-01-15, 01:27
I think the killing data nodes and killing HA NN is out of scope form
an HBase point of view. I actually have been doing some system-level testing killing the meta RS and will later add a kill the root RS. Jon. On Mon, Jan 14, 2013 at 3:15 PM, Andrew Purtell <[EMAIL PROTECTED]> wrote: > Thanks for the writeup. Looks very comprehensive. > .. >> Some of the things we have tried: >> -Long running tests: Run repeated snapshots while verifying that all is >> well. >> >> -Meanness tests: >> 1. Killing the master >> 2. Performing a compaction >> 3. Table enable/disable >> > > 4. Killing regionservers. > > 5. Killling datanodes. > > 6. Killing regionservers and datanodes together on a node. > > 7. During HA NN failover. > > -- > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White) -- // Jonathan Hsieh (shay) // Software Engineer, Cloudera // [EMAIL PROTECTED]
-
Re: Let's discuss Snapshots Feature TestingAndrew Purtell 2013-01-15, 02:47
If a datanode goes down and it has an indirect bad effect on snapshots,
this would be useful to know. For the HA NN item, I threw that in there for completeness sake. Ideally a client like HBase wouldn't notice. On Mon, Jan 14, 2013 at 5:27 PM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > I think the killing data nodes and killing HA NN is out of scope form > an HBase point of view. > -- Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
-
Re: Let's discuss Snapshots Feature TestingJonathan Hsieh 2013-01-15, 08:55
My counter-argument here is that this would be a bug in HDFS as
opposed to HBase. It is good to know, but ideally shouldn't be exposed at the HBase level. This test won't really make sense if there was a different FS underneath. That said, if you insist we can add and and report on this (lower priority than the hbase-level problems though). Jon. On Mon, Jan 14, 2013 at 6:47 PM, Andrew Purtell <[EMAIL PROTECTED]> wrote: > If a datanode goes down and it has an indirect bad effect on snapshots, > this would be useful to know. > > For the HA NN item, I threw that in there for completeness sake. Ideally a > client like HBase wouldn't notice. > > On Mon, Jan 14, 2013 at 5:27 PM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > >> I think the killing data nodes and killing HA NN is out of scope form >> an HBase point of view. >> > > > -- > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White) -- // Jonathan Hsieh (shay) // Software Engineer, Cloudera // [EMAIL PROTECTED]
-
Re: Let's discuss Snapshots Feature TestingNicolas Liochon 2013-01-15, 09:25
I would be +1 on killing datanodes during the tests. I think we tend to
under analyze the impact on an HDFS error in HBase. See for example HBASE-6738<https://issues.apache.org/jira/browse/HBASE-6738>: in the distributed log, we were considering a task as dead if the split was not done in 25s. If you were going to the dead DN to read the WAL, 25s was far from enough, and we were ending up doing the same split on multiple computers. HDFS is a nice buddy, but it can't hide everything. On Tue, Jan 15, 2013 at 9:55 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > My counter-argument here is that this would be a bug in HDFS as > opposed to HBase. It is good to know, but ideally shouldn't be exposed > at the HBase level. This test won't really make sense if there was a > different FS underneath. > > That said, if you insist we can add and and report on this (lower > priority than the hbase-level problems though). > > Jon. > > On Mon, Jan 14, 2013 at 6:47 PM, Andrew Purtell <[EMAIL PROTECTED]> > wrote: > > If a datanode goes down and it has an indirect bad effect on snapshots, > > this would be useful to know. > > > > For the HA NN item, I threw that in there for completeness sake. Ideally > a > > client like HBase wouldn't notice. > > > > On Mon, Jan 14, 2013 at 5:27 PM, Jonathan Hsieh <[EMAIL PROTECTED]> > wrote: > > > >> I think the killing data nodes and killing HA NN is out of scope form > >> an HBase point of view. > >> > > > > > > -- > > Best regards, > > > > - Andy > > > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > > (via Tom White) > > > > -- > // Jonathan Hsieh (shay) > // Software Engineer, Cloudera > // [EMAIL PROTECTED] >
-
Re: Let's discuss Snapshots Feature TestingJean-Marc Spaggiari 2013-01-15, 19:01
That's a long list of tests. Impressive.
I will add this one: > -Meanness tests: > 1. Killing the master > 2. Performing a compaction > 3. Table enable/disable 4. Moving a/some region/s while snapshot is running. Killing the master or a RS will occur in some regions moved, but also, some bulk imports then load balancing can induct massive regions moves too... JM 2013/1/15, Nicolas Liochon <[EMAIL PROTECTED]>: > I would be +1 on killing datanodes during the tests. I think we tend to > under analyze the impact on an HDFS error in HBase. > See for example > HBASE-6738<https://issues.apache.org/jira/browse/HBASE-6738>: > in the distributed log, we were considering a task as dead if the split was > not done in 25s. If you were going to the dead DN to read the WAL, 25s was > far from enough, and we were ending up doing the same split on multiple > computers. > > HDFS is a nice buddy, but it can't hide everything. > > > On Tue, Jan 15, 2013 at 9:55 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > >> My counter-argument here is that this would be a bug in HDFS as >> opposed to HBase. It is good to know, but ideally shouldn't be exposed >> at the HBase level. This test won't really make sense if there was a >> different FS underneath. >> >> That said, if you insist we can add and and report on this (lower >> priority than the hbase-level problems though). >> >> Jon. >> >> On Mon, Jan 14, 2013 at 6:47 PM, Andrew Purtell <[EMAIL PROTECTED]> >> wrote: >> > If a datanode goes down and it has an indirect bad effect on snapshots, >> > this would be useful to know. >> > >> > For the HA NN item, I threw that in there for completeness sake. >> > Ideally >> a >> > client like HBase wouldn't notice. >> > >> > On Mon, Jan 14, 2013 at 5:27 PM, Jonathan Hsieh <[EMAIL PROTECTED]> >> wrote: >> > >> >> I think the killing data nodes and killing HA NN is out of scope form >> >> an HBase point of view. >> >> >> > >> > >> > -- >> > Best regards, >> > >> > - Andy >> > >> > Problems worthy of attack prove their worth by hitting back. - Piet >> > Hein >> > (via Tom White) >> >> >> >> -- >> // Jonathan Hsieh (shay) >> // Software Engineer, Cloudera >> // [EMAIL PROTECTED] >> > |