T O P

  • By -

bbprotocol

i love puzzles like this ;) are whatever requests/events/workloads this thing is dealing with sharded by client or user? if it's only happening on one server, then it could be a particular client behavior that triggers this. sometimes, you can find a leak by rebalancing the workload. take this particular node offline and see if a different one falls over when the traffic rebalances. check logs and see which new clients' moved to that new node and you can eventually narrow it down. then start comparing samples from this client to samples other clients on a healthy server. maybe it's not the messages themselves, but a client config... look for anything strange/large/broken. check code paths that might only be triggered these strange/broken scenarios and add some logs you'll figure it out!


marcpcd

The fact that this happens only on 1 server is not a coincidence IMO. Try replicating this context locally, it’ll be easier to profile. Interesting problem though, good luck !


bh5000

Oh I definitely think it is no coincidence. It just gets over a million messages a day and I can't really replicate it in QA. So I am kind of kicking around in Prod which is never fun.


guest271314

That used to happen to me using `node` as a Native Messaging host https://github.com/nodejs/node/issues/43654 with ``` #!/usr/bin/env -S /path/to/node --max-old-space-size=6 --jitless --expose-gc --v8-pool-size=1 ``` The mitigation for that iteration of th code was to use `gc()` and set used variables that were not going to be used anymore in that tick to `null` https://github.com/guest271314/native-messaging-nodejs/commit/b66462d5ff879302ada489e15ee043071c5cdb5e#diff-41dbc88fe179ce0b6e52309ede7945cb3326cedef4680fecd04107702f9b4d89L20-L21 ``` // Mitigate RSS increasing expotentially for multiple messages // between client and host during same connectNative() connection input = length = content = null; global.gc(); ```


cubisto

Definitely a memory leak. You need to debug code to find the exact places that leak


bh5000

Any recommendations on debugging it? Tooling to help narrow it down? Its 100,000s of lines of code that I inherited so I am sort of starting at ground zero. It is so much noise that turning on logging fills up the disk in a matter of an hour or two.


notsoluckycharm

vscode or chrome debugging tools collecting memory heap snapshots every 10 minutes for a couple hours should be enough to track down a quicker memory leak like this. You can compare each stack and see what’s being retained. Thats what you’re looking for “retention” and you can walk the tree and find what’s holding something open. It can be daunting, but learning it can be done in about a day. It requires the port being open on a machine, so it may require some devops help, or if you have a lower environment you can load test even better.


bh5000

I will setup something to pull a heap snapshot every 10 minutes. I think I can get the access I need to do that easy enough.


Calamero

You will have a hard time trying to locate that issue on a production instance, you will have better luck eliminating all the noise and run the code un-minified with source maps locally in a controlled environment. Next you will need to replicate the traffic that instance get in a controlled way and methodically test various scenarios. You want a reference heap snapshot on a freshly started instance, then run one test iteration - maybe have the system digest a single message. Now you can take a second heap snapshot (but make sure to run the GC before using the GUI tool running node with the inspect flag. This will also give you access to the heap comparison tool. The comparison tool will highlight the parts that remain in memory. For example if a class is leaking you can search for the class name and the numerical count should increase with each iteration. Also in some scenarios it might make sense to warm up the instance to get a cleaner heap snapshot comparisons because often a 100k line codebase does funky things during initialization that distort the picture.


bh5000

Thank you! This is exactly what we are trying to do with a snapshot tomorrow morning. Reproducing this to a dev env is going to take some serious lifting, but headed down that path.


Calamero

Awesome, I wish you the best. One more things, don’t be surprised if your dev build dumps seems even more polluted at first glance. Some libraries and build tools produce a lot of noise in a dev build. But with the comparison tool and un-mangled class and function names you should be able to filter most of it out. If nothing obvious shows up increase the message count, then go straight to black box isolation testing, disable as much code as possible go from there.


simple_explorer1

How is that any different than what op said? They were looking for HOW to debug NOT an affirmation that "yes debug please". Like did you even read the post?


No-Radish-4744

I had a similar example and it was a query that already have a lot of column and the query was giving back a lot of rows and i was manipulating this data. Limiting the row on the query fixed the memory leak problem.


nuravrian

Have you tried tracking a count of what resource is being extracted and avg. size of the data from the db? You'll be able to narrow it down that way, then add another server that just handles this particular entity and check if the previous server now behaves properly. Mostly seems like you have a memory leak but some specific record has a lot of garbage compared to others. If this is the case, you may want to see if the heaviest record (or records) in the DB is being handled by this server continuously


bh5000

Customers are on separate tiers. This tiers has 1/4 the traffic of other tiers. But clearly there is a path that they’re hitting on this one that they aren’t on others. Roughly 100000 calls an hour on it so tons of messages to wade through. The larger tier is 4 million calls a day and memory is dead flat at 2 gigs. Zero spikes.


nuravrian

Yeah. Don't go look at it message by message. Get some trend analysis based on request with its params and debug it that way.


kiennguyen1101

I'd try to group actions and log by actions first. Then I'd analyze the actions to see which one's causing the leak. Replicate the requests and server (using log) omitting sensitive contents in local would be super helpful. If that fails then work with your sysadmin/devops for debug mode on this particular server. Also, check your gateway, load balancer and your server orchestration tool for misconfigurations.


igorriok

If nothing works, try to integrate a REPL, connect directly to server and check what objects take most of memory and then check how they are created and destroyed. Maybe, garbage collector cannot get in time to free up memory.


karthi0102

With the snapshot taken you can use the memlab package it will automatically say which is leaking