iframe-proxy

rickyma · 2024-02-12T07:30:42Z

What changes were proposed in this pull request?

Reuse ByteBuf when decoding shuffle blocks instead of reallocating it

Why are the changes needed?

A sub PR for: #1519

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing UTs.

…e blocks instead of reallocating it

codecov-commenter · 2024-02-12T07:49:48Z

github-actions · 2024-02-12T08:19:00Z

Test Results

2 287 files ±0 2 287 suites ±0 4h 30m 33s ⏱️ + 1m 33s
819 tests ±0 818 ✅ ±0 1 💤 ±0 0 ❌ ±0
9 086 runs ±0 9 073 ✅ ±0 13 💤 ±0 0 ❌ ±0

Results for commit f85291a. ± Comparison against base commit d87dc90.

♻️ This comment has been updated with latest results.

roryqi · 2024-02-13T02:04:31Z

    int dataLength = byteBuf.readInt();
-    ByteBuf data = NettyUtils.getNettyBufferAllocator().directBuffer(dataLength);
-    data.writeBytes(byteBuf, dataLength);
+    ByteBuf data = byteBuf.retain().readSlice(dataLength);


Will byteBuf be spitted into muliple parts? Every part will released multiple times? Will it bring errors?

ByteBuf will not be splitted into multiple parts. It will be used by a SendShuffleDataRequest as a whole.
It will not bring errors. Because we retain the ByteBuf(refCnf++) everytime when we do a readSlice.

public static SendShuffleDataRequest decode(ByteBuf byteBuf) { long requestId = byteBuf.readLong(); String appId = ByteBufUtils.readLengthAndString(byteBuf); int shuffleId = byteBuf.readInt(); long requireId = byteBuf.readLong(); Map<Integer, List<ShuffleBlockInfo>> partitionToBlocks = decodePartitionData(byteBuf); long timestamp = byteBuf.readLong(); return new SendShuffleDataRequest( requestId, appId, shuffleId, requireId, partitionToBlocks, timestamp); }

But it might slow down the flushing process.
Because it will not trigger the actual flushing process util all the ShufflePartitionedData is flushed(refCnt decreased to 0):

List<ShufflePartitionedData> shufflePartitionedData = toPartitionedData(sendShuffleDataRequest); ... for (ShufflePartitionedData spd : shufflePartitionedData) { ... ret = manager.cacheShuffleData(appId, shuffleId, isPreAllocated, spd); ... }

ByteBuf cannot be splitted, once splitted we have to allocate new ByteBufs.
~~So maybe we can hold this PR and find a better way to do this.~~
But it will speed up the decoding process on the other hand.

rickyma · 2024-02-13T18:28:22Z

I reopened this PR.

After stress testing the shuffle server without this PR, we will easily encounter OutOfDirectMemoryError, which means this PR is necessary.

[epollEventLoopGroup-3-45] [WARN] TransportChannelHandler.exceptionCaught - Exception in connection from /127.0.0.1:58767
io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 4194304 byte(s) of direct memory (used: 161061273600, max: 161061273600)
at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:843)
at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:772)
at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:710)
at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:685)
at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:212)
at io.netty.buffer.PoolArena.tcacheAllocateNormal(PoolArena.java:194)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:136)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:126)
at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:397)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)
at org.apache.uniffle.common.netty.protocol.Decoders.decodeShuffleBlockInfo(Decoders.java:50)
at org.apache.uniffle.common.netty.protocol.SendShuffleDataRequest.decodePartitionData(SendShuffleDataRequest.java:95)
at org.apache.uniffle.common.netty.protocol.SendShuffleDataRequest.decode(SendShuffleDataRequest.java:107)
at org.apache.uniffle.common.netty.protocol.Message.decode(Message.java:145)
at org.apache.uniffle.common.netty.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:77)

We can see that each time an out-of-direct-memory error occurs, it is caused by the code org.apache.uniffle.common.netty.protocol.Decoders.decodeShuffleBlockInfo(Decoders.java:50), which is ByteBuf data = NettyUtils.getNettyBufferAllocator().directBuffer(dataLength). This is the most direct trigger for insufficient direct memory.

Because when a large number of requests arrive simultaneously, there might be a brief period (before the TransportFrameDecoder has a chance to release the ByteBuf) during which the shuffle server has doubled the created ByteBuf. This means, for a very short time, the direct memory usage is doubled, which is extremely uncontrollable.
That is why it is very easy to cause an out-of-direct-memory error without this PR.

So, we need this PR anyway. It might slow down the flushing process a little bit, but the shuffle server will at least remain available during the whole stress test.

From the results of my stress tests, there doesn't seem to be any impact on performance. In fact, it may even be faster, as it can speed up the decoding process by not reallocating new ByteBufs on the other hand.
There have been no anomalies or performance issues caused by the slowing down of the flushing process. Eventually, all these buffers will be flushed, and all ByteBufs will be successfully released, with no memory leaks.

PTAL @jerqi

roryqi · 2024-02-14T02:22:42Z

I reopened this PR.

After stress testing the shuffle server without this PR, we will easily encounter OutOfDirectMemoryError, which means this PR is necessary.

[epollEventLoopGroup-3-45] [WARN] TransportChannelHandler.exceptionCaught - Exception in connection from /127.0.0.1:58767 io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 4194304 byte(s) of direct memory (used: 161061273600, max: 161061273600) at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:843) at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:772) at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:710) at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:685) at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:212) at io.netty.buffer.PoolArena.tcacheAllocateNormal(PoolArena.java:194) at io.netty.buffer.PoolArena.allocate(PoolArena.java:136) at io.netty.buffer.PoolArena.allocate(PoolArena.java:126) at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:397) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179) at org.apache.uniffle.common.netty.protocol.Decoders.decodeShuffleBlockInfo(Decoders.java:50) at org.apache.uniffle.common.netty.protocol.SendShuffleDataRequest.decodePartitionData(SendShuffleDataRequest.java:95) at org.apache.uniffle.common.netty.protocol.SendShuffleDataRequest.decode(SendShuffleDataRequest.java:107) at org.apache.uniffle.common.netty.protocol.Message.decode(Message.java:145) at org.apache.uniffle.common.netty.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:77)

We can see that each time an out-of-direct-memory error occurs, it is caused by the code org.apache.uniffle.common.netty.protocol.Decoders.decodeShuffleBlockInfo(Decoders.java:50), which is ByteBuf data = NettyUtils.getNettyBufferAllocator().directBuffer(dataLength). This is the most direct trigger for insufficient direct memory.

Because when a large number of requests arrive simultaneously, there might be a brief period (before the TransportFrameDecoder has a chance to release the ByteBuf) during which the shuffle server has double the created ByteBuf. This means, for a very short time, the direct memory usage is doubled, which is extremely uncontrollable. That is why it is very easy to cause an out-of-direct-memory error without this PR.

So, we need this PR anyway. It might slow down the flushing process a little bit, but the shuffle server will at least remain available during the whole stress test.

From the results of my stress tests, there doesn't seem to be any impact on performance. In fact, it may even be faster, as it can speed up the decoding process by not reallocating new ByteBufs on the other hand. There have been no anomalies or performance issues caused by the slowing down of the flushing process. Eventually, all these buffers will be flushed, and all ByteBufs will be successfully released, with no memory leaks.

PTAL @jerqi

Maybe we should modify our flush strategy, too. Now we will flush a larger reduce partition. But if the map partition contains a smaller reduce partition. The memory won't be released, too.

roryqi · 2024-02-14T02:34:49Z

I prefer adding a config option for this improvement.

rickyma · 2024-02-14T06:51:03Z

I reopened this PR.
After stress testing the shuffle server without this PR, we will easily encounter OutOfDirectMemoryError, which means this PR is necessary.
[epollEventLoopGroup-3-45] [WARN] TransportChannelHandler.exceptionCaught - Exception in connection from /127.0.0.1:58767 io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 4194304 byte(s) of direct memory (used: 161061273600, max: 161061273600) at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:843) at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:772) at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:710) at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:685) at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:212) at io.netty.buffer.PoolArena.tcacheAllocateNormal(PoolArena.java:194) at io.netty.buffer.PoolArena.allocate(PoolArena.java:136) at io.netty.buffer.PoolArena.allocate(PoolArena.java:126) at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:397) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188) at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179) at org.apache.uniffle.common.netty.protocol.Decoders.decodeShuffleBlockInfo(Decoders.java:50) at org.apache.uniffle.common.netty.protocol.SendShuffleDataRequest.decodePartitionData(SendShuffleDataRequest.java:95) at org.apache.uniffle.common.netty.protocol.SendShuffleDataRequest.decode(SendShuffleDataRequest.java:107) at org.apache.uniffle.common.netty.protocol.Message.decode(Message.java:145) at org.apache.uniffle.common.netty.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:77)
We can see that each time an out-of-direct-memory error occurs, it is caused by the code org.apache.uniffle.common.netty.protocol.Decoders.decodeShuffleBlockInfo(Decoders.java:50), which is ByteBuf data = NettyUtils.getNettyBufferAllocator().directBuffer(dataLength). This is the most direct trigger for insufficient direct memory.
Because when a large number of requests arrive simultaneously, there might be a brief period (before the TransportFrameDecoder has a chance to release the ByteBuf) during which the shuffle server has double the created ByteBuf. This means, for a very short time, the direct memory usage is doubled, which is extremely uncontrollable. That is why it is very easy to cause an out-of-direct-memory error without this PR.
So, we need this PR anyway. It might slow down the flushing process a little bit, but the shuffle server will at least remain available during the whole stress test.
From the results of my stress tests, there doesn't seem to be any impact on performance. In fact, it may even be faster, as it can speed up the decoding process by not reallocating new ByteBufs on the other hand. There have been no anomalies or performance issues caused by the slowing down of the flushing process. Eventually, all these buffers will be flushed, and all ByteBufs will be successfully released, with no memory leaks.
PTAL @jerqi

Maybe we should modify our flush strategy, too. Now we will flush a larger reduce partition. But if the map partition contains a smaller reduce partition. The memory won't be released, too.

Flushing strategy will be changed in the final PR.

rickyma · 2024-02-14T06:52:30Z

roryqi

LGTM, thanks @rickyma

…mory issue causing OOM (#1534) ### What changes were proposed in this pull request? When we use `UnpooledByteBufAllocator` to allocate off-heap `ByteBuf`, Netty directly requests off-heap memory from the operating system instead of allocating it according to `pageSize` and `chunkSize`. This way, we can obtain the exact `ByteBuf` size during the pre-allocation of memory, avoiding distortion of metrics such as `usedMemory`. Moreover, we have restored the code submission of the PR [#1521](#1521). We ensure that there is sufficient direct memory for the Netty server during decoding `sendShuffleDataRequest` by taking into account the `encodedLength` of `ByteBuf` in advance during the pre-allocation of memory, thus avoiding OOM during decoding `sendShuffleDataRequest`. Since we are not using `PooledByteBufAllocator`, the PR [#1524](#1524) is no longer needed. ### Why are the changes needed? A sub PR for: #1519 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs.

[apache#1472][part-2] fix(server): Reuse ByteBuf when decoding shuffl…

f85291a

…e blocks instead of reallocating it

rickyma force-pushed the issue-1472-part-2 branch from 45ace37 to f85291a Compare February 12, 2024 07:38

roryqi reviewed Feb 13, 2024

View reviewed changes

rickyma closed this Feb 13, 2024

rickyma reopened this Feb 13, 2024

rickyma requested a review from roryqi February 14, 2024 08:48

roryqi approved these changes Feb 14, 2024

View reviewed changes

roryqi merged commit 7fbe7c9 into apache:master Feb 15, 2024

roryqi mentioned this pull request Feb 16, 2024

[#1472][part-5] Inaccurate flow control leads to Shuffle server OOM when enabling Netty #1531

Closed

This was referenced Feb 18, 2024

[#1472] fix(server): Inaccurate flow control leads to Shuffle server OOM when enabling Netty #1519

Closed

[#1472][part-5] Use UnpooledByteBufAllocator to obtain accurate ByteBuf sizes to fix inaccurate usedMemory issue causing OOM #1534

Merged

rickyma deleted the issue-1472-part-2 branch May 5, 2024 08:34

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[#1472][part-2] fix(server): Reuse ByteBuf when decoding shuffle blocks instead of reallocating it#1521

[#1472][part-2] fix(server): Reuse ByteBuf when decoding shuffle blocks instead of reallocating it#1521
roryqi merged 1 commit into
apache:masterfrom
rickyma:issue-1472-part-2

rickyma commented Feb 12, 2024

Uh oh!

codecov-commenter commented Feb 12, 2024 •

edited

Loading

Uh oh!

github-actions Bot commented Feb 12, 2024 •

edited

Loading

Uh oh!

roryqi Feb 13, 2024

Uh oh!

rickyma Feb 13, 2024 •

edited

Loading

Uh oh!

rickyma commented Feb 13, 2024 •

edited

Loading

Uh oh!

roryqi commented Feb 14, 2024

Uh oh!

roryqi commented Feb 14, 2024

Uh oh!

rickyma commented Feb 14, 2024

Uh oh!

rickyma commented Feb 14, 2024 •

edited

Loading

Uh oh!

roryqi left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

rickyma commented Feb 12, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

codecov-commenter commented Feb 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Feb 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test Results

Uh oh!

roryqi Feb 13, 2024

Choose a reason for hiding this comment

Uh oh!

rickyma Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rickyma commented Feb 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roryqi commented Feb 14, 2024

Uh oh!

roryqi commented Feb 14, 2024

Uh oh!

rickyma commented Feb 14, 2024

Uh oh!

rickyma commented Feb 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roryqi left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Feb 12, 2024 •

edited

Loading

github-actions Bot commented Feb 12, 2024 •

edited

Loading

rickyma Feb 13, 2024 •

edited

Loading

rickyma commented Feb 13, 2024 •

edited

Loading

rickyma commented Feb 14, 2024 •

edited

Loading