Noticing trouble

The more projects I finish, the more stuff can go wrong each day. At first it seemed a solution to send myself emails when problems arose … now I get so many error mails that they must be filtered to separate folders.

I also get error mails of projects I don’t have anything to do with which I need to discard regularly or my mailbox screams “overflow” (don’t ask me why I have to get those emails, I believe someone is working on a solution).

I rarely scan those error mails for real trouble because well there is so much to do and there are always some or other errors …

Last week one of my applications exited with an OOM and a core dump. I didn’t notice until 24 hours later my boss asked why there were no data in those images (the weird thing is that the web app running in the same VM was still up it seems though after the daily app restart via cron I may be wrong on that part …). Many people look at that application each day and it seems very weird, that nobody noticed it wasn’t running.

The funny thing is that this application is part of the monitoring we have in place for our other applications. Kinda stupid when the monitoring breaks. So now I have to find a simple and stupid way to monitor the monitoring and NOTICE when problems arise.

So my questions are how to simply monitor this (I’ll probably track some logfile activity or some other “I am alive”-signal with a cronjob -> the server itself is being monitored by the sysadmins) and secondly to find a way to reduce the number of REAL trouble mails I receive or find another way to decide which is real trouble and which can be safely ignored.

So: how do you find out about problems with your projects?