Squashing bugs in SpikeMark (part 2 of 3)

Ok, I’m actually jumping over a more intricate bug description in this post just because I find this one too much fun to pass up.
I discovered a bug in SpikeMark that would spew out erroneous log message, but only after the computer running SpikeMark had been running for 25 days or more without a reboot. Just think about that…I know I’m going to sound like an old-timer here, but can you imagine in the old days trying to keep a Windows 98 machine running for 28 days without a reboot!
Don’t get me wrong, I leave machines running for months at a time too, just not our SpikeMark test machines so I hadn’t caught this bugger until last night when I received a huge, strange, log file from a theatre.
The error occurred in the chunk of code that logs position changes for motors. If a motor hasn’t changed position in 1 second, we log that position and wait for it to move again. Why the heck do we do that? Well, I’m sure it’s hard to imagine just how abstract an experience it is to mentally recreate a show with nothing but a log file, but that’s what we have to do every time a show has some strange behavior and mangles a scene change. The theatre sends us a log file and I study it trying to figure out what happened. Sometimes it was an operator error, sometimes a hardware fault, and sometimes a software or firmware bug. But in order to sift through the information and piece back together the odd things that occurred during last night’s tech rehearsal, it helps to know when the motors moved and where the ended up when they stopped moving.
The method for tracking motor position is pretty simple. Windows keeps a timer running from the moment you boot up your computer and every millisecond it increments the “tick” counter. Each time we ask a motor for its current position, we check to see if 1000 ticks (1 second) has passed since the last update and then check to see if the position is unchanged. If it is unchanged, we write the motor position to the log file, if it has changed the motor is still moving and honestly we don’t care until it stops so we just wait another 1000 ticks and check again.
That’s all fine and dandy accept that after 24.9 days the tick counter rolls over and becomes the MIN_VALUE for an integer. When that happens and we perform a little subtraction on the counter value, suddenly we have a value that is lower than the MIN_VALUE allowed for a number in memory and an error is created. But since this is all happening quickly, thousands of errors are generated and the log file reaches its file size limit and stops recording. So we get a log file of roughly 15 minutes of useless error messages and no interesting information.
Easy to fix, it’s already done and waiting to be rolled out, I just find it funny that we never checked to make sure SpikeMark wouldn’t have any trouble running for a month continuously. Guess I’m just too pessimistic to imagine that scenario…