Computing Memoir: Bayesian Inference on program crash

One company consulting have been in trouble with abnormal program crash. The log says that it crashed just before accessing I/O board. And some SW engineers believed that the culprit is the I/O board and asked for replacement.

But it turned out that there are other steps before I/O which don't leave any log and caused the crash but left no trace. But when people doesn't see the IO log , they conclude that IO crashed the program so no log left on it. It is a classic case of 'Correleation is not the causation'

Thought about how to find out falut in systematic way and read about Bayesian inference and tried in here. Refer Beysian Inference at wiki

P( Bad IO | Crash ) = P( Crash | Bad IO ) * P( Bad IO ) / P( Crash )
P( Bad IO | Crash ) : The probability of Bad IO when Crash happens
P( Crash | Bad IO ) : The probability of Crash when IO goes nut
P( Bad IO ) : The probability of IO goes nut
P( Crash ) : The probability of crash

Say, there are 100 running of the program, 1 crash happens. And during the 100 running, IO probably accessed 1000 times and goes nut 1 time. If IO goes nut, it will definitely crash. Then,
P( Bad IO | Crash ) = 1 * 0.001 / 0.01 = 0.1
It is telling that the P( Bad IO ) is too small to tell that it is the culprit. One way to increase probability is to change hypothesis like 'Bad IO Writing'. IO writing may happen 500 times during 100 running and can go nut. Then P( Bad IO Writing ) can be 0.002 ( = 1 / 500 ) and the 'Bad IO writing' probability or confidence can be 0.2.

To increase confidence on a hypothesis, the hypothesis has to be very specific or the probability of the hypothesis - P ( Bad IO ) in here - is too small to be meaniningful. The process of refining hypothesis is the trouble shooting, I guess. More specific, more chance of having valid cause.

In the end, it is a keen eye to find out the bug - unsafe reentrant code in multi-threaded environment. This Beysian Inference is not strightforward to quantize - hard to tell what is probabilty of some hypothesis.

Computing Memoir

Monday, June 3, 2019

Bayesian Inference on program crash

No comments:

Blog Archive

About Me