-
Troubleshooting 101
Posted on January 24th, 2010 2 commentsIt doesn’t work…
Before you start: what are your primary motivations?
Very often the goals are antagonistic:
- Get it working again as quickly as possible.
- Diagnose the problem and find out exactly what went wrong.
Ideally you want to shut the whole factory down at the exact point of failure and then spend the next few days, futzing around with the system trying to figure out the problem.However I’ve noticed that this approach causes factory managers to turn funny colours and get all shouty. If you can live with that, sure, go right ahead.
You usually can only achieve one of these goals at a time. However with careful planning you can achieve both of these goals at the same time. This requires upfront time and effort designing your system to be robust in the face of failure and also leave enough breadcrumbs to find out what lead up to the failure. Convincing your company that this is important, I’ll leave as an exercise for the reader.
Some general troubleshooting rules of thumb:
- Test the simple things first
- Select the test that will provide you with the most information (eliminate as many other possibilities as possible)
Here are some small things to check:
- Is it plugged in?
- Is it turned on?
- Have you tried rebooting your computer?
- Be aware that rebooting the computer will most probably destroy the cause of the problem.
- You may have to reboot the computer for expediency, if lives are at risk or dollars will be lost.
- Can you easily reproduce the problem? Once you have reproduced the problem:
- Is it a hardware or a software problem?
- Try it on a different computer, if it still persists, it’s probably software, otherwise it’s important to eliminate the hardware cause as soon as possible.
- Troubleshooting the hardware
- If you have intermittent crashes or blue screens check your memory and power supply.
- A flaky power supply can cause very hard to pin down errors, so it’s best to eliminate it as soon as possible.
- If it persists, start swapping out cards methodically
- Finally replace the motherboard and cpu (with the same model)
- If you have intermittent crashes or blue screens check your memory and power supply.
- If it is software, then whose software is causing the problem? Is it your program, the operating system, drivers, libraries or the compiler? (It is almost always your program
)
- If it was working and now hasn’t, you need to find out what changed.
- Has the user installed a new version of your software?
- Is there a new service pack or hotfix that could be interfering with the your software?
- Is the user running some crapware or spyware or even legitimate software that is interfering with yours?
- Has the user changed their work process and are now hitting different parts of your application?
- If it was working and now hasn’t, you need to find out what changed.
- Once you have ascertained it is your software and you can reproduce the problem reliably, you now need to find it in your software and fix it.
- Apply the standard methods to find the bug in your source and fix it
Oh well I could write more about this, but I’m tired and I’m going to bed.
Good night.
2 responses to “Troubleshooting 101”
-
..and add heaps of logging!
-
Definitely!
Logging is an interesting topic:
What do you log?
How do you prevent the logging from affecting the systems performance?
Managing the log files on the system.
Getting the logs back to the developers for analysis.
etc. etc.
Leave a reply


