Skip to main content

Posts

Showing posts from 2014

Webhooks and Integrating Aseembla svn commits with Pivotal tracker stories

I am intrigued by this webhook concept and it seems a very nice way for B2B communication.  Webhooks are powerful and it eliminates Poll for integrating with third parties. All you need is to have some REST api registered that will be called when an event occurs. A good e.g. of webhook for a cloud storage provider can be  "automatically print this document on registered printers when a file is dropped in this folder". Lets assume all the customer needs to do is register a webhook "http://xyz.foo.com/printer/xxd344/print?token=authTokenXXX and the cloud storage provider can then call this url and POST the body of document in input. Recently I had a chance to play with webhooks when I was trying to move jenkins to EC2 for my f riend and as part of it I moved his svn to assembla.com.  I saw webhooks and I thought I can integrate commits into svn hosted by assembla.com to pivotaltracker.com tickets.  It took just 1 hour to do it but it was fun, apparently there are post c

Data driven performance issue and NewRelic

NewRelic really shines at discovering these data driven performance issues. Earlier we would find them late or these would be buried but now they seem so obvious if the engineer is paying attention.  I was casually trolling new relic and sorted all apps by avg time per api and one of our core application in one DC was taking twice the avg time for each call than all other DCs. I immediately compared that DC with other DCs and I saw was a graph like below in DC1 and I saw this in DC2 Clearly DC1 is spending abnormal amount of time in database. So I went to database view and saw this in DC1 and I saw this in DC2 Clearly something is weird in DC1 even though its same codebase.  309K queries per minute seems abnormal.  Within 5 min I found out its a n query problem. Aparently some customer has 4000 users  and he has created 3000 groups and the group_member table has 40K rows for this customer. Normally all of our customers would create 10-50 groups and there is a

AWS and rise of devops

I used to always wonder how Snapchat, Pinterest and Instagram were able to scale to millions of users with just 10-15 engineers.  I am a Java Architect but when it comes to networking, operations and other stuff I am a Noob beyond basic skills.  Recently our ops team did some subnet changes and some IP changes and added 10G network between some services, All this is grey area to me and I was like you really need to hire Operations for this so how come these other startups did without so many people.  One of my friend was after me for months to help him move his jenkins servers from Ukraine to EC2 as Ukraine is in turmoil. I have no ops expertise so this was tricky but here is how I got it done over 2 weekends as Dallas is freezing due to cold front and I dont have driver license due to immigration fiasco by USCIS. So this friend really got benefit due to it as I had nothing else to do on weekend. I took a vanilla CentOS AMI and launched an instance in EC2. But when launching it aske

A well intentioned public api can bring down a server

Apis are powerful creatures and people can use them to do tons of weird things. We had exposed a public api to create a link but our UI had a capability to select multiple files and generate links for them in bulk in call, so our public api mimicked this behavior. Today a server was running hot with full GC, I took a heap dump and restarted it. Upon analysing the heap dump in Eclipse Memory Analyzer I found that Sys log appender was choked and it had a queue of 10K messages with each being 2MB. I copied the value and found the class name in log message. Aparently whenever a link was created a log message was written that would iterate over each file and log a line for each file. There was a bug in the log message that  it would log entire message instead of that file. for (target in linkRequest.getTargets()) {  logger.info("queuing preview generation for {}", event) ; } QA/developers cant detect this and most people in code review focuses less on logger messages.

Paralysis by Analysis

A lot of times we run into Paralysis by Analysis . We use Lucene for indexing cloud storage data and its a very old home grown system, we are soon going to retire it and replace it with Elastic Search.  Recently we ran into Paralysis by Analysis situation where we were waiting for Elastic Search project to go live to implement Folder Search and we were debating over it for months while customers were cribbing for it. The move to Elastic Search requires migration of billions of files and PetaBytes of contents to be migrated which can take months. Finally I got approval that instead of waiting for ElasticSearch lets do a simple canonical prefix based search out of database directly. It took 2 days to implement server api and 1 week for UI,plumbing work and QA.  As we use Mysql Sharding, the solution scaled fine, all I needed to ensure was proper indexes were there. Two days I got email from Support that customers are in love with Folder search :). Offcourse this is not perfect solutio

Thinking of moving the blog to Quora or starting a new public blog

Reading a book on inbound marketing http://www.amazon.com/Inbound-Marketing-Attract-Delight-Customers/dp/1118896653 by Hubspot founders and it clarified some of my concepts. I intend to finish the book, its just that in the age of Quora, Twitter and Techcrunch you can get easily distracted so I am reading at turtle's pace. Even Netflix and cable TV has lost its charm for me. One reason I had ordered a hard copy instead of kindle so I don’t get distracted and can read it with focus, but now a days I feel like Quora is much more interesting than reading books. But books have their own importance, Quora and blogs are random reading with no pattern and you get some pointers whereas books are more structured reading so on any day I would prefer book over blogs if I have to read a concept in detail and call me old school but I still would prefer paper copy over ebook. Anyway coming back to the topic I still get close to 50K hits per year on this blog, so thinking how can I apply conce

Using email to manage work queue can be deceptive

I loathe on emails that have been replied more than 5-6 times. I am sometimes copied on emails where there are 20 replies. Today I have an email about post commit reviews which has 20 replies, Grrr.  I procrastinated whole day to read it and looks like I am not goint to read it today. Problem with 20+replies email is that by the time you end up reading all 20 of them you have lost context of what is the final picture. I saw this infographic from 2012 http://mashable.com/2012/11/27/email-stats-infographic/   that 28% of time is spent on reading emails.  I think for me its may be 50% or more as I work from home and most communication is over emails except very few which happens  over skype or phone calls. Would love to explore tools to save time.  I found 2 new tools in gmail so far, "Canned Response" and "Mark as Read" button, lets see if it helps. For e.g. I would be copied on email "23.7.3 CL- 120157 on all Appnodes and EOS nodes in SJC, AVL and AM2

How to discover JS errors in production code in 5 min

Today I was pulled into a customer call where some customer complained that browser froze for him and the UI kept showing Processing.... . Now as usual I was asked to hunt if it was a server issue or did this customer had any errors. I checked from new relic and from haproxy logs but didnt found any. Then it occured to me what if its a JS error, there was no way to find it out. Then I remembered that New relic had a Beta product to detect JS errors. I went an enabled it from Browser->AppName->settings  and within 10 min the errors started flowing. Good news is that there arent that many errors but in entire day there were almost 150 of them. But for a java guy this is good info and now I can hunt the UI team to fix them.  

subeclipse 1.6 and svn 1.8 upgrade

I used to use subeclipse to connect in addition to having subversion command line client to work on same repo.  Subeclipse used to be great when doing move/rename in eclipse and adding new files without moving to command line. Recently operations upgraded svn to 1.8 version and somehow subeclipse broke it kept giving org.tigris.subversion.javahl.ClientException: svn: URL 'svn+ssh://svn@svn.xxxx.com/repos/trunk' non-existent in revision '120,109' At first I thought its eclipse issue because I was able to connect to some other public svn repo fine. Then I found out that public repo was on 1.7 svn version and our svn server was upgraded to 1.8. I  had subeclipse 1.6 client connector. I upgraded eclipse and reinstalled subeclipse but no luck.  Finally gave up  on subeclipse and installed subversive with 1.8 connector. Even subversive worked fine but it wont show up Team->Share project for one repo. Restarting eclipse or other things wont help. Finally I figur

new relic speed index comparison

If you are a SAAS company then its good to know how you fare against market in terms of SLA, speed index, error rate and other attributes.  It seems new relic has a way to share your speed data anonymous with others to see how you compare with others. It seems our startup is in 95th percentile in terms of server speed and even though we had 99.979 % availability we are in 54th percentile and others are doing far better than us in SAAS B2B world. Error rate is high for an almost obsolete backup product is causing the numbers to skew so I would ignore it for time being. Although what stands out if web page response time for our single page app is in 11th percentile. It sucks, will chase what is going on in coming days.

transposing a video from horizontal to vertical

Wife got ALS challenge and I took a video of her drenching in ice using iphone. The only issue was the somehow I tilted the phone while taking it and it came horizontal. If it was an image then many image editors will transpose it for you. Now challenge is that I cant ask her to drench in ice again so instead of her getting mad at me to ruin her video (yaa I know the Facebook generation needs everything to be perfect if its posted on Facebook).  So now I need to transpose it, it seems you can do it using ffmpeg in ubuntu. First try was using ffmpeg -i ~/Desktop/IMG_0888.MOV -vf "transpose=1" -r 30 -sameq -acodec copy ~/Desktop/test.MOV but a 90MB file became 350MB which would be impossible to post on FB in 1 hour. So I gave a second try using ffmpeg -i ~/Desktop/IMG_0888.MOV -vf "transpose=1" -b 25000000 -sameq -acodec copy ~/Desktop/test1.MOV where some site said you take 1G and divide it by your no of seconds in video and that would be the bit rate y

Do we scale?

For sure we scale, I got this report from new relic out of 2 data centres and even if traffic is increased 5 times the response time only increases 40%. Having this kind of visibility is great for keeping 99.99% SLA

Proactive production support by trusting data

Five years ago when I joined my employer we were in reactive mode when it comes to production performance. Nodes were going down daily and we would jump to fix that fire and then move to next fire. Problem with this approach was that we would be always busy and even nightly calls for customers in Europe.  Similar pattern was happening when it came to production support, many issues were known only when a customer would report it. Again this would mean that we would get calls on weekend and we had to fix issues ASAP and sometimes that means putting a band-aid. Some times issues will linger in production unnoticed for 1-2 weeks and when customer would report it, its a lengthy and arcane process to hunt for issues in existing logs. More than debugging this would require  a lot of patience and persistence to look for trend and form a time-line of events to hunt for root cause. Only few engineers would be willing to do so. Over the course of last few years we became better at production s