Skip to content
This repository has been archived by the owner on Jul 22, 2024. It is now read-only.

Infinite "adding to backlog" #27

Open
so0k opened this issue Aug 16, 2019 · 0 comments
Open

Infinite "adding to backlog" #27

so0k opened this issue Aug 16, 2019 · 0 comments

Comments

@so0k
Copy link

so0k commented Aug 16, 2019

We are running node-drainer (sha-309d7dc) - using palantir/bouncer in canary mode, for our ASG with 6 desired nodes, 3 of them had old Launch template

So palantir/bouncer set ASG size to 9 (launching 3 new instances) and then sent autoscaling terminate instance-in-asg (should decrement desired count) for the 3 instances on the old launch template.

Some of them get properly drained and LCH completed is sent by node-drainer, but some seem to go into an infinite loop where I manually checked and confirmed the remaining Pods were part of DaemonSets (some of those pods have taint tolerations which only run on certain nodes so they aren't rescheduled....).

time="2019-08-16T09:19:03Z" level=info msg="Resolved Instance ID i-0c22f8c656c62a282 to Node Name ip-10-51-61-168.ap-southeast-1.compute.internal"
time="2019-08-16T09:19:03Z" level=info msg="Sending ASG heartbeat for instance i-0c22f8c656c62a282"
time="2019-08-16T09:19:03Z" level=info msg="Adding node ip-10-51-61-168.ap-southeast-1.compute.internal to the backlog"
... 
# forever (waited 1 hour)
....
# manually ran:
aws autoscaling complete-lifecycle-action --instance-id i-0c22f8c656c62a282 --lifecycle-hook-name swat-stage-bohr-compute-workers-nodedrainerLCH --auto-scaling-group-name swat-stage-bohr-compute-workers --lifecycle-action-result CONTINUE
...
time="2019-08-16T09:25:42Z" level=info msg="Draining next node ip-10-51-57-12.ap-southeast-1.compute.internal from backlog"
time="2019-08-16T09:25:42Z" level=warning msg="nodes \"ip-10-51-57-12.ap-southeast-1.compute.internal\" not found"
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant