The difference between novice in reliable systems and experts !!!

Two or three weeks ago I read a post of Michael Nygard “Failover: Messy Realities”.
It’s very interesting to see that among normal operation state ,failure state , and normal state again there is a whole universe. By the way in my humble experience I had a case where I had to explain of my American colleagues similar problem where during surgical operation they had implemented restart of surgical operation hardware due to firmware failure! The result of this restart was memory leak and unloaded dynamic library. Hardware restart was necessary approximately on 10 to 15 minutes.  I can’t explain you how hard was for me to explain them that we have a memory leak and instantiation of dynamic libraries throwing data from the hardware to the software application. The application responsibility is to decide to signal to surgeon if the surgeon is close to a spinal cord nerve !!!!!!!!! As I know they had done an operation already in parallel with the old system. But I am not sure how safe was that combination between the old system and the new one.  Probably the advice is to know the dependency of the system and it’s behavior during failure.

