The last few days at work were rough. My team was intensely focused on preparing the production environment to be stable. We’ve been having some serious issues on the production environment. One of the most serious production issues was a nasty memory leak.
The graph above displays the memory usage of 12 node applications on our production environment. The y-axis shows the memory usage in percentage and the x-axis shows the time span of 7 days. This narrow sawtooth pattern on memory usage is extremely bad. As you can see, all 12 servers are accumulating usage in memory rapidly. As a remedy for the memory leak, we had to periodically restart our servers. This was less than ideal, but because we had so many other really high priority items last few weeks, we just had to bite our tongue and suck it up. However, the frequency at which we had to restart our server started to increase. We had to restart our servers every few days, then every day, then it came to a point where we were restarting our servers every 4 to 5 hours. Thank god we have a globally distributed team (Croatia, and Argentina), it could’ve been a lot tougher without having team members in other timezones. I can’t emphasize this point enough, and I will take another opportunity to praise having globally distributed software development team in another post in the future.
After some pitfalls, we fixed the memory leak. As you can see below, the overall memory usage stays flat after 13:30 PM after our memory leak fix was applied.
We came up with several different strategies to fix the memory issue, but the working solution came down to comparing two memory dumps at different times and comparing their content. I used gcore on a production server to gather a memory dump soon after the server restart when the memory usage is around 30%. After around 3 hours when the memory usage for the server starts to hover around 60%, I took another memory dump. I used mdb on a local VM running Solaris 11 to analyze the two memory dumps.
Here is the memory dump of the production server at around 30% memory usage.
And here is the memory dump of the production server at around 80% around 4 hours after the server restart.
The most notable cause of the memory increase is
TraceSegment: name, transaction, ....
You can see that the number of
object count increased by 10 folds going
from 19917 to 209896. You can also see some
other object such as
increasing. This is because those are properties
Contrast this number with the genre-tags object
that stayed exactly the same
2 Object: id, genreIds at 8775. That object
is a json file parsed into memory at startup
of the app and it does not increase. This
suggests to me that the memory dump is indeed
accurate, and Node retains memory steady on
correctly coded objects.
It turns out
TraceSegment objects are created
by NewRelic. The fact that
I wasn’t able to recreate the severity of
memory leak convincingly on the development
environment from previous attempts also tells
me that the memory leak was caused by something
happening on the production environment.
We get some benefits of having NewRelic monitoring, but disabling NewRelic on production was no-brainer at this point. We had to do it. We all acknowledged that NewRelic is not the initial cause of the memory leak. There is still something in our code that is behaving naughtily with NewRelic, but we got to put our the damn fire on production first. So we did it. We disabled NewRelic on production and the memory usage stopped climbing and it remains very steady for now.
I want to thank Yunong Xiao for sharing Netflix’s experience on debugging Node.js in production. It informed me about useful tools in the wild and taught me how I can use those tools to find the cause of the memory leak. I also want to thank Bryce Neal for writing a detailed tutorial on setting up Solaris to use MDB.