Skip to main content

Posts

Showing posts from February, 2015

New relic and statuspage.io integration

Monitoring tools exposes a lot of data and we use Nagios, cacti, graphite,newrelic, mixpanel, flurry, boundary and many more tools.  But one of the ask for Support and marketing teams is how can they internally know if something is wrong. We cant expect them to wade through so many systems and so many applications to make sense of what is operational and what is not.  For e.g. we use a lot of services to serve the cloud filer server solution and this is the first page of status of our services in new relic and it spans 2 pages. Support team and management relies on Operations team to notify them if an issue is on going. To make this easy I did a Proof of concept integration application responsible for serving main website with Statuspage.io.  The idea is simple Create public metrics in statuspage.io that are human readable. Query new relic and various systems for application status. Map new relic green/red/yellow status and other system status to statuspage.io status. If the

Move fast break things but with monitoring

We run a complex system with multiple services and every 2 or 3 week we  update the Java applications.  I want to do it every week as most applications are stateless and can be patched anytime but the application serving the main website is using sticky session. We are working to make it failover sessions, once we do that, we can do mid week deployment and that will allow us to go faster than 3 weeks.  This week I pushed a huge infrastructure change related to user Id generation. I had asked ops team to check the status of new relic after the midnight deployment and it looked like this so everyone was happy. I woke up and checked new relic mobile app and things looked ok to me. After finishing my morning routines I ran my daily exception report and one thing that caught the eye was 90K exceptions in last 12 hours in one of the files I had changed.  To gauge the impact I went in new relic and it showed me an error rate of 0.07 in one of the app I then checked new relic and I

Email slavery

It seems I have become an EmailSlave. The first half of the day is spent in just answering to emails. There are so many emails where I am copied but I need not be. There are many emails  where its a 1-2 page email and somewhere down someone says @KP please answer this.  So it seems daily my work schedule is: Signin to newrelic and check anomalies for 15 min.  Check emails related production exception report and yes there are a ton of these report daily. Need a better tool here as this model is not scalable. I need to reduce the incoming data at me to only see relevant data like what newrelic does. May be I need to create a webapp out of these emails. Check emails for next few minutes before team calls Do team calls Then again back to checking emails until a I have taken a best shot at answering everyone waiting for my reply. Attend team meetings on Tue/Thu Being an architect and coder at heart I don't feel satisfied at end of the day if there is nothing tangible getting d