Life with New Relic

Improving performance is awesome. As a web developer, there is a unique satisfaction knowing a page now loads in half the time, or spends less time accessing databases. However, improving a page takes time; time that could have been spent making new features or wowing the user in more obvious ways. In addition, optimizing pages can be tricky. Sometimes you get to be the hero, but then there are those days… You know the kind: you fix a particularly slow page only to learn that it now loads in 6.0 seconds instead of 6.1. Oops. Then, it turns out there were only 5 visits to the page in the last month. Double oops. Finally, in order to “fix” the page, the code mutated from something straightforward to a creation that is strikingly similar to spaghetti.

This was our world as of last year. We had home brewed approaches to gather basic performance information. We knew which pages saw the most use, and we knew if they were getting slower or faster. However, we didn’t know why our pages were slower or faster. We could improve our performance, but it was an involved process.

Enter New Relic!

About this time we started looking into New Relic. New Relic monitors application performance. Their code is easy to set up, and reports on a number of metrics. Some of our favorites were the slowest requests on the servers, the slowest database queries (and tables), traces of particularly slow web requests, and the aggregate errors being reported across our servers. There are plenty more features, but these were our favorites.


The overview of our site’s performance in New Relic

 

Initially, we had a couple reservations about using New Relic. Mainly we were concerned with how it would affect our website performance. After all, New Relic adds an additional wrapper around our code, and it is sending performance data from our server to the New Relic server. We did some investigations and we found that the performance hit of using New Relic should be less than a 5% hit to our memory usage and CPU usage. In addition, the data passed to New Relic is sent asynchronously. If the New Relic server is down, then the local New Relic monitor will store data for up to 15 minutes, and then start dropping data. With these answers in hand we were willing to take the New Relic app for a spin.

 

New Relic FTW!

New Relic won us over when it helped us catch an error we were not even aware of. We have some assistant servers that pre-render some of our pages, in some cases these assistant servers were throwing an error when they tried to write out a response. However, these errors were being caught (incorrectly) before we could log them. According to our logs, there were no errors. However, New Relic kept recording a disturbingly large number of errors on these servers. We investigated and fixed the root cause, but it was a tricky error. The fix could have taken a lot more time if we hadn’t know the exact urls that were causing the errors and seen the stack trace of the issue.


The New Relic error that helped us track down our silent server error

 

Is New Relic the Silver Bullet™?

New Relic is an extremely useful tool, but at the end of the day it’s a tool. What you get out of it depends often depends on how well you understand the inner workings of your application. For example, we have a database table, logins, which sees a lot of use. If our database hits high loads then logins tends to take up a lot of time. So when the database backs up it tends to look like the logins table is holding everything back. However, the slow login table queries are merely a symptom of a high database load and not the cause.


Misleading Login spike during DB slowdown

 

In addition, New Relic is built as a website monitoring tool. It has the ability to monitor jobs, but it takes some digging to find out how to configure the New Relic agent for a job. When we set up jobs we had to change the newrelic.yml file to have the following lines (these changes may be Java specific, and there must be exactly two spaces to the left of each line in the config):

  sync_startup: true

  send_data_on_exit: true

  send_data_on_exit_delay: 20

Stack traces on jobs also leave something to be desired. New Relic only traces the first ~2000 method calls. This is fine in a web transaction, but this will only scratch the surface of long running jobs. Nevertheless, it’s nice to get some insight into how our jobs are influencing our database loads.

 

Final Thoughts

New Relic helped us dive into our code in greater detail than ever before. Using their application, we could find out not only which pages were slow, but crucially we could find out precisely which queries and functions were slow. Now it is easier than ever to hunt down performance issues, which has led to an unanticipated benefit. Prior to New Relic, we had one dev who took on the mantle of performance guru. He let everyone know when pages were slowing down, and encouraged the teams in charge of those pages to fix them up. Some days it seemed he had a Sisyphean task. He could harp on the performance of a page, and it would slowly improve, but in the meantime other pages would slow down. Now that New Relic is in place there has been a noticeable shift. Teams have started to jump in and fix their own pages spontaneously. Even teams that don’t own code have been helping improve database performance.

 

We have had a great experience with New Relic. If you want to give it a spin, check out www.newrelic.com!

 

Note: the screenshots in this post have been slightly altered to exclude some internal information.

 

Discussion