{{ message }}
[#1631] fix(server): ShuffleTaskInfo may leak when app is removed#1632
Merged
Conversation
zuston
pushed a commit
to zuston/incubator-uniffle
that referenced
this pull request
May 27, 2024
…ed. (apache#1632) ### What changes were proposed in this pull request? In our cluster, delete pod is denied by web hook, even though all application is deleted for long time. When I curl http://host:ip/metrics/server, I found app_num_with_node is 1. The problem is some application is leaked. I see many duplicated logs `[INFO] ShuffleTaskManager.checkResourceStatus - Detect expired appId[appattempt_xxx_xx_xx] according to rss.server.app.expired.withoutHeartbeat`. When I jstack the server many times, clearResourceThread will be stuck forever, here is the call stack. ``` "clearResourceThread" apache#40 daemon prio=5 os_prio=0 cpu=3767.63ms elapsed=5393.50s tid=0x00007f24fe92e800 nid=0x8f waiting on condition [0x00007f24f7b33000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@11.0.22/Native Method) - parking to wait for <0x00007f28d5e29f20> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync) at java.util.concurrent.locks.LockSupport.park(java.base@11.0.22/LockSupport.java:194) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.22/AbstractQueuedSynchronizer.java:885) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.base@11.0.22/AbstractQueuedSynchronizer.java:917) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@11.0.22/AbstractQueuedSynchronizer.java:1240) at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(java.base@11.0.22/ReentrantReadWriteLock.java:959) at org.apache.uniffle.server.ShuffleTaskManager.removeResources(ShuffleTaskManager.java:756) at org.apache.uniffle.server.ShuffleTaskManager.lambda$new$0(ShuffleTaskManager.java:183) at org.apache.uniffle.server.ShuffleTaskManager$$Lambda$216/0x00007f24f824cc40.run(Unknown Source) at java.lang.Thread.run(java.base@11.0.22/Thread.java:829) ``` Apparently there's a lock that's not being released. Looking at the code, it's easy to see that the read lock in the flushBuffer is not released correctly. The log ` ShuffleBufferManager.flushBuffer - Shuffle[3066071] for app[appattempt_xxx] has already been removed, no need to flush the buffer` proved it. Fix: apache#1631 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? no test, obvious mistake
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

What changes were proposed in this pull request?
Fix: #1631
Does this PR introduce any user-facing change?
No.
How was this patch tested?
no test, obvious mistake