Friday, 2 May 2014

Watch 'Crazy House', on Friday, after 3 pm, at your work.

If you are lucky, while working on an open space, sometimes, you may see people running back and forth, impinging each other and behaving exactly like headless chickens - especially after 3 pm on Friday ;) The reason why you may see that somewhat peculiar picture is because of either somebody's birthday (i.e. cookies and doughnuts) or production issue ;) Check your mailbox and if there is no email about doughnuts, it has to be an issue on prod!


Usually, managers refrain from communicating a problem to business, before getting some initial knowledge about it. However, do not be mislead by lack of email. That state would not last long. That is actually the time, when managers start to create an Asterix and Obelix 'Crazy House' cartoon atmosphere, harass developers, L3 or any other support team. Of course, they think they do their best to behave in a calm and professional way, but subconsciously they do what all people do, when they are jeopardized - they are clustering into harassment groups behind the back of (un)lucky developer, creating totally unnecessary pressure. 

Then, there is a time for usual questions cannonade: where are we, what we know, what is the risk, what is happening, what are our options etc. You may have an ironic answer for all of them, but you must keep a poker face.



In fact, calmness is your bless. The more clam and methodic your approach would be, the sooner you would be at your home sipping whisky. So calm down and follow below steps:
  1. Ask for a little bit of context and clarification what they think is all that fuss about.
  2. Do not easily believe what 'headless chickens' are saying ;) Check it yourself. If they knew what is going on, they would not harass you ;) Simply ask for evidence based on which they raised an alert. It might be a monitoring screen shoot, DB query, logs, whatever. You may see something, they may have skipped. Also, it is pretty common that people misinterpret what they see and hit 'panic alarm' straight away. It happens, you must live with that.
    You may also ask a couple of context enriching questions like 'who said that' and 'based on what premises you thing so'.
  3. If you really can smell the issue, do an initial investigation and try to estimate:
    - impact
    - risk
    - also it would be marvelous, if you could apply some trends checking here.
    If possible, check whether same or similar situation was happening in the past. Perhaps, it is a regular behaviour.
  4. If there was something similar happening in the past, try to dig out the knowledge about that issue and its remediation. JIRA ticket, conversation with other peers in the project, basically anybody or anything might be helpful.
  5. If not, then it is a genuine issue, which has to be investigated.
  6. If you finally come across the solution and it involves a fix on prod, try to mitigate its risk, by doing the smallest possible change. 
    Remember: small change = small impact (according to stable system definition).
    Assess minimal, optimal and maximal downtime of the system vs. you SLAs. Communicate it clearly to your business stakeholders, prod services and managers.
    Also do not let them suggest solution, unless it has reasonable base.
  7. Test your solution on prod like environment (including starting and stopping your app) and write full sequence of steps. Do not let anybody add/change/remove any of steps, before agreeing it with devs, prod services and testing them properly.
    It's better to have a broken system, where you know what is going on, rather than possibly fixed one, where nobody is entirely sure, if and how it's gonna work.