The last few days at work were rough. My team was intensely focused on preparing the production environment to be stable. We’ve been having some serious issues on the production environment. One of the most serious production issues was a nasty memory leak.

The graph above displays the memory usage of 12 node applications on our production environment. The y-axis shows the memory usage in percentage and the x-axis shows the time span of 7 days. This narrow sawtooth pattern on memory usage is extremely bad. As you can see, all 12 servers are accumulating usage in memory rapidly. As a remedy for the memory leak, we had to periodically restart our servers. This was less than ideal, but because we had so many other really high priority items last few weeks, we just had to bite our tongue and suck it up. However, the frequency at which we had to restart our server started to increase. We had to restart our servers every few days, then every day, then it came to a point where we were restarting our servers every 4 to 5 hours. Thank god we have a globally distributed team (Croatia, and Argentina), it could’ve been a lot tougher without having team members in other timezones. I can’t emphasize this point enough, and I will take another opportunity to praise having globally distributed software development team in another post in the future.

After some pitfalls, we fixed the memory leak. As you can see below, the overall memory usage stays flat after 13:30 PM after our memory leak fix was applied.

We came up with several different strategies to fix the memory issue, but the working solution came down to comparing two memory dumps at different times and comparing their content. I used gcore on a production server to gather a memory dump soon after the server restart when the memory usage is around 30%. After around 3 hours when the memory usage for the server starts to hover around 60%, I took another memory dump. I used mdb on a local VM running Solaris 11 to analyze the two memory dumps.

Here is the memory dump of the production server at around 30% memory usage.

73ce2207ea9     4643        2 Arguments: 0, length
101b5ab4f71     5519        4 Object: albumId, albumNames, ...
 9719767799     6070        2 Object: localeCode, value
 971977c371     6966        1 Object: entry
1102bfc1dc31     8775        2 Object: id, genreIds
 971977b021     9086        2 Object: ids, href
 971973e051     9430        3 Array
    3015e01    16203        0 Object
 971974c9c9    19854        1 Object: nr_exclusive_duration_millis
 9719710199    19917       13 TraceSegment: name, transaction, ...
 971971cb11    20186        7 Timer: state, touched, duration, ...
    342d8d9    49769        0 Array
 971971cbb9    59489        1 Array
 971970f1c9    71743        2 Array

And here is the memory dump of the production server at around 80% around 4 hours after the server restart.

1102bfc1dc31   8775        2 Object: id, genreIds


110879a2c9    14099        6 ExponentialBackoffStrategy: ...
1108779561    15593        3 Array
45357fdf31    17059        2 Object: ids, href
   3015db1    34680        0 Object
110875b221    35287        4 Object: albumId, albumNames, ...
110875b2c9    38401        1 Object: entry
110875aff1    38481        2 Object: localeCode, value
1108708d51   198161        0 Array
1108706ae1   202984        1 Array
1108706a81   206039        1 Object: nr_exclusive_duration_millis
11087046d9   209896       13 TraceSegment: name, transaction, ...
1108704801   223481        7 Timer: state, touched, duration, ...
1108706aa1   507224        2 Array

The most notable cause of the memory increase is TraceSegment: name, transaction, .... You can see that the number of TraceSegment object count increased by 10 folds going from 19917 to 209896. You can also see some other object such as Timer and Array increasing. This is because those are properties of the TraceSegment object. Contrast this number with the genre-tags object that stayed exactly the same 2 Object: id, genreIds at 8775. That object is a json file parsed into memory at startup of the app and it does not increase. This suggests to me that the memory dump is indeed accurate, and Node retains memory steady on correctly coded objects.

It turns out TraceSegment objects are created by NewRelic. The fact that I wasn’t able to recreate the severity of memory leak convincingly on the development environment from previous attempts also tells me that the memory leak was caused by something happening on the production environment.

We get some benefits of having NewRelic monitoring, but disabling NewRelic on production was no-brainer at this point. We had to do it. We all acknowledged that NewRelic is not the initial cause of the memory leak. There is still something in our code that is behaving naughtily with NewRelic, but we got to put our the damn fire on production first. So we did it. We disabled NewRelic on production and the memory usage stopped climbing and it remains very steady for now.

I want to thank Yunong Xiao for sharing Netflix’s experience on debugging Node.js in production. It informed me about useful tools in the wild and taught me how I can use those tools to find the cause of the memory leak. I also want to thank Bryce Neal for writing a detailed tutorial on setting up Solaris to use MDB.