We have a pretty huge test bucket for WebSphere eXtreme Scale, it been growing now for a couple of years and it current takes a couple of days just for the auto FVT bucket to run and then we do the testing that we haven't automated yet. Quality is a big deal in the XTP/DataGrid business so it's always on my mind, more so than performance or function. Everyone wants perfect software with no bugs. We can strive towards that goal with larger and larger test case buckets written by developers and do code reviews etc but sometimes bugs still get out. The bugs are harder to come by these days and I'm finding the study of epidemics is an interesting approach to how testing happens.
This reminds me of a TED talk on how the World Health Organization eliminated Smallpox. It was too expensive to immunize everybody. The approach they took instead was to look for outbreaks and then surround the outbreak and vaccinate everybody within that area. Slowly but surely, the outbreaks faded away and today (hopefully) the world is small pox free.
So, you have a large piece of software, you have developers making test cases by the thousands and systems test scenarios as well. Despite all of this, every now and then a bug pops up. When this happens, we do forensics on the issue, code reviews on the code again, improve test cases but we also do scenario specific testing around the use case where the fault occurred. This is additional testing because an 'outbreak' occurred. We build a test and then stress the system in that area where the fault occurred.
Everyone wants perfect software, we could mathematically prove the software correct so that we know it's perfect but thats not a realistic goal. Even the space shuttle with a zero fail engineering approach has failed several times. I think the current approach we're taking is very like the small pox approach but we're starting with a very well tested system to begin with. Despite this, if we still find issues then we make test cases surrounding the use case where the failure occurred and then stress it and tease out anything else that may be lurking. We use tools to closely examine the system while it's being stressed hoping to find additional opportunities to improve the software.
The aircraft industry also uses epidemic based testing. They have an amazing quality control system during the design and operation of aircraft and DESPITE this, they still have crashes. When the crash occurs, they examine the wreckage, look for the fault and then improve the aircraft design around the area that failed. Then, they wait for the next fault...
This epidemic based test methodology is everywhere now whether we've called it this or not. This cycle of surround the event, stress it, improve it and then wait for next event seems to be the normal course of events now even for the most critical of systems that we create today.