[#1631] fix(server): ShuffleTaskInfo may leak when app is removed by zhengchenyu · Pull Request #1632 · apache/uniffle · GitHub
Skip to content

[#1631] fix(server): ShuffleTaskInfo may leak when app is removed#1632

Merged
zuston merged 1 commit into
apache:masterfrom
zhengchenyu:1631
Apr 10, 2024
Merged

[#1631] fix(server): ShuffleTaskInfo may leak when app is removed#1632
zuston merged 1 commit into
apache:masterfrom
zhengchenyu:1631

Conversation

@zhengchenyu

Copy link
Copy Markdown
Collaborator

What changes were proposed in this pull request?

Fix: #1631

Does this PR introduce any user-facing change?

No.

How was this patch tested?

no test, obvious mistake

@github-actions

github-actions Bot commented Apr 9, 2024

Copy link
Copy Markdown

@zuston zuston changed the title [#1631] fix: ShuffleTaskInfo may leak when app is removed. [#1631] fix(server): ShuffleTaskInfo may leak when app is removed Apr 10, 2024

@zuston zuston left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Could you help review this? @rickyma

@rickyma rickyma left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@zuston zuston merged commit 1431c8a into apache:master Apr 10, 2024
zuston pushed a commit to zuston/incubator-uniffle that referenced this pull request May 27, 2024
…ed. (apache#1632)

### What changes were proposed in this pull request?

In our cluster, delete pod is denied by web hook, even though all application is deleted for long time.
When I curl http://host:ip/metrics/server, I found app_num_with_node is 1. 
The problem is some application is leaked. I see many duplicated logs `[INFO] ShuffleTaskManager.checkResourceStatus - Detect expired appId[appattempt_xxx_xx_xx] according to rss.server.app.expired.withoutHeartbeat`.
When I jstack the server many times, clearResourceThread will be stuck forever, here is the call stack.
```
"clearResourceThread" apache#40 daemon prio=5 os_prio=0 cpu=3767.63ms elapsed=5393.50s tid=0x00007f24fe92e800 nid=0x8f waiting on condition  [0x00007f24f7b33000]
   java.lang.Thread.State: WAITING (parking)
	at jdk.internal.misc.Unsafe.park(java.base@11.0.22/Native Method)
	- parking to wait for  <0x00007f28d5e29f20> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park(java.base@11.0.22/LockSupport.java:194)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@11.0.22/AbstractQueuedSynchronizer.java:885)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(java.base@11.0.22/AbstractQueuedSynchronizer.java:917)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(java.base@11.0.22/AbstractQueuedSynchronizer.java:1240)
	at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lock(java.base@11.0.22/ReentrantReadWriteLock.java:959)
	at org.apache.uniffle.server.ShuffleTaskManager.removeResources(ShuffleTaskManager.java:756)
	at org.apache.uniffle.server.ShuffleTaskManager.lambda$new$0(ShuffleTaskManager.java:183)
	at org.apache.uniffle.server.ShuffleTaskManager$$Lambda$216/0x00007f24f824cc40.run(Unknown Source)
	at java.lang.Thread.run(java.base@11.0.22/Thread.java:829)
```

Apparently there's a lock that's not being released. Looking at the code, it's easy to see that the read lock in the flushBuffer is not released correctly.  The log ` ShuffleBufferManager.flushBuffer - Shuffle[3066071] for app[appattempt_xxx] has already been removed, no need to flush the buffer` proved it.

Fix: apache#1631 

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

no test, obvious mistake
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] ShuffleTaskInfo may leak when app is removed.

3 participants