Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Regarding "Seek Out Problems and Iterate", it's a bit of an understatement how important this is. I've invested a lot of time helping my coworkers understand the distinction between tasks and problems. The end goal being only tracking problems in the ticketing system. It's not easy to do this and it takes constant effort, but it pays off very quickly. I've yet to see a real "problem" ticket stay unresolved for a long time, whereas "task" tickets tend to stay around until they're either irrelevant or they get closed after getting kicked between a few people.

A good example of this is:

- Add worker thread for X to offload Y

When the actual problem is more along the lines of:

- Latency spikes on Tuesdays at 3pm in main thread

Which may be caused by a cronjob kicking off and hogging disk IO for a few minutes.

A good rule of thumb I've found is that task tickets tend to have exactly one way of solving them, whereas problem tickets can be solved in many ways.



Can you explain the 3pm on Tuesdays issue? My sister works for LLS and she said their servers get very slow at a precise time every Tuesday. Not saying it's the same bug, but what was the solution in your specific case?


The next sentence suggested that the cause of the problem in this probably hypothetical situation might be “a cronjob kicking off and hogging disk IO for a few minutes”.

So in that case, I guess either run the job with a lower priority and see if that helps, or execute the job more often so it doesn’t have to catch-up all at once one time per week, or rewrite it so that it performs I/O with smaller chunks of data at a time and sleeps for a little while in-between reading or writing chunks of data. Basically, do something so that you no longer have this one huge job consuming all of the IO bandwidth for several minutes every week.


I can't get into too much detail, but there were increased failure rates during a few jobs. In one case, we added ionice. In another it was a matter of adding a missing index to the DB (full table scan instead of looking at records from the last week).

There was one periodic job that we moved from the production server to work off the daily backups instead of the live server.


Database doing some housekeeping or backup; virus scan; perhaps automated check for windows updates (patch tuesday is every 2nd tuesday of every month so prob not that); completely separate task fighting the DB or other application layer you sis uses. Something else. Anything else.

It's not something anyone can diagnose from what you say, it could be anything, even weirdness such as a hardware fault kicked off by something else (office cleaner plugging something in?) causing power spike RF interference affecting the network causing mass packet drops and retries (ok, unlikely but it's not impossible, I've heard of such).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: