Many years ago, I was fresh out of college and landed a junior role as a mechanical engineer in a company that developed and produced clinical laboratory instruments. Shortly after I started, my then manager told me that less than a year before, the company had started to receive plenty of customer calls about failures in the latest product they had launched. The company recognized that these failures were not isolated incidents but rather indicative of a more systemic issue. Instead of merely replacing components or subsystems reactively, they understood the need for a proactive approach to resolve this growing concern, which led them to delve into the discipline of reliability analysis and initiate a series of stress tests for the subsystems that were facing repeated issues in the market.

As it may be for most of the readers of this post, this whole reliability analysis was completely new for me. In college, I had learned about concepts like fatigue and cycle stressing, but this seemed like something beyond that. So after my manager had explained to me the main concepts about the experiments they were carrying on, and with the motivation of a young man in his first job, I started to dig into all the information I had available to me to learn about this new discipline.

I learned about correlation methods, distribution shapes, failures, suspensions... And shortly after that, I was traveling all around Europe to receive training from the leading institutions in the world. As you can imagine, getting to know all those new concepts while the experiments were running in the lab and being able to apply them firsthand was extremely exciting. The pressure was on, too: after having lost hundreds of thousands of Euros in field replacements, recalls, and image damage, this "reliability thing" had to work.

Back to the experiments, I continued monitoring the performance of the tested subsystems. We were seeing patterns of all kinds that were causing the failures: breaking belts, failing bearings, faulty temperature sensors... you name it. We were then implementing improvements to resolve those early failures: thicker belts, changes of materials, changes of part geometries, or even "simpler" changes like firmware updates, among others. And the beauty of it was to be able to make purely objective comparisons between versions, quantifying their reliability values and using them as one more input to ponder toward having the most optimal product.

For some subsystems, it was harder to find the direction in which we had to go to improve their performance. Typically, when a mechanical part keeps breaking, you just assume it was designed too flimsy and it just needs to be a bit sturdier, or have some reinforcements here and there. However, not all the design challenges were of this nature, nor could they be tackled following such a straight path, and a little more "debugging" and exploring were needed. But one by one, all the individual subsystems started to reach the reliability values we were aiming for. In some of the cases, that meant increasing their life expectancy beyond the required life of the whole product; in others, it meant ensuring their failure point would fall in a window of use cycles that optimized the design and ensured the component could be replaced during standard servicing of the device without causing any downtime to the customer.

But after all the effort, the expensive training, the time, and the money that went into assembling the test rigs and the samples of all the studied subsystems, was it really worth it? The answer is "Absolutely yes!".

Internally, those of us who carried the program out were beyond ourselves and felt a huge sense of accomplishment after turning around a product that had received such a negative reception from the market due to reliability issues. But as you can imagine, there's always someone in charge of staying on top of the economics of things like this. So an analysis was made, taking into account all the costs involved in the program: from training, to engineering hours, materials, samples, and so on. On the other side of the equation was only accounted the money spent in servicing and replacing the failed components up until the improvements from the program were implemented. The outcome of the analysis? It had been a great money investment! And bear in mind, other costs like the one that the company suffered as a result of letting down its customers and becoming an "unreliable brand," affecting its number of sales and losing some clients (potentially forever) were not accounted for.

So, as you can imagine, the program continued to run and eventually became a subdivision inside the engineering department. Not only was it useful to evaluate and improve failing subsystems that we received from customer claims and servicing inspections, in a reactive way, but most importantly, it became an important part of the development process. It was included in different stages of the design phase, providing great feedback for the optimization of systems as a whole. And needless to say, the debacle of an unreliable system like the one that launched just before I joined the company never happened again.