Monday, July 28, 2008

Doing Something "Just Because We Can!"

Working on a distributed system like Live Mesh is a lot of fun, but finding certain types of bugs can be interesting, to say the least. Going from writing code that executes locally on a workstation to writing code that is executed across several hundred machines is an extremely refreshing experience - you quickly learn to spend a good amount of your time and energy building a good logging infrastructure and leveraging it to diagnose problems. There are several other things you learn too, but that's worth a series of its own posts.

With a large amount of logs comes a large amount of interesting data, and an even larger amount of potentially interesting data. Good logs can go a long way in helping infer information about the scale, health, bottlenecks, usage etc of the system. Much time is spent automating such tasks so that useful reports can be generated from logs. Every so often, someone (usually a non-tecnnical person) asks for additional interesting data that would take time, effort and money (in terms of resource utilization) to extract. However, almost all such requests are missing information about what the requester hopes to conclude based on the retrieved data. My instinct is to always push back on such requests, forcing the requester to justify what questions the data will help answer or what conclusions it'll help lead to. If the data gathered does indeed help answer a crucial question, it's worth spending time and money mining the logs for it.

Very often, the reason for the request is "just because we can", in which case I believe the time/money/effort spent is unwarranted unless you have people/machines sitting around doing nothing, in which case you arguably have other problems.

Moral of the story - try not to ask someone else to do work (or spend your own valuable time doing work) without knowing what you intend to get out of it.

No comments: