apache-spark - 火花：解码器异常：java.lang.OutOfMemoryError

Question

我在具有 3 个工作节点的集群上运行 Spark 流应用程序。由于以下异常，有时作业会失败：

Job aborted due to stage failure: Task 0 in stage 4508517.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4508517.0 (TID 1376191, 172.31.47.126): io.netty.handler.codec.DecoderException: java.lang.OutOfMemoryError
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:333)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:319)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:787)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:130)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError
at sun.misc.Unsafe.allocateMemory(Native Method)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:127)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:306)
at io.netty.buffer.PoolArena$DirectArena.newUnpooledChunk(PoolArena.java:440)
at io.netty.buffer.PoolArena.allocateHuge(PoolArena.java:187)
at io.netty.buffer.PoolArena.allocate(PoolArena.java:165)
at io.netty.buffer.PoolArena.reallocate(PoolArena.java:277)
at io.netty.buffer.PooledByteBuf.capacity(PooledByteBuf.java:108)
at io.netty.buffer.AbstractByteBuf.ensureWritable(AbstractByteBuf.java:251)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:849)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:841)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:831)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:146)
... 10 more

我以客户端模式提交作业，没有任何特殊参数。master和workers都有15g的内存。Spark 版本是 1.4.0。

这可以通过调整配置来解决吗？

score 1 · Accepted Answer

我面临同样的问题，发现它可能是由 Spark 1.4 使用的 netty 版本 4.0.23.Final 中的内存泄漏引起的（参见https://github.com/netty/netty/issues/3837）

至少在使用 netty 4.0.29.Final的 Spark 1.5.0（参见https://issues.apache.org/jira/browse/SPARK-8101 ）中解决了它。

所以升级到最新的 Spark 版本应该可以解决这个问题。接下来的几天我会试试的。

此外，当前版本中的 Spark Jobserver 强制使用 netty 4.0.23.Final，因此它也需要修复。

编辑：我使用 netty 4.0.29.Final 升级到 Spark 1.6，但仍然使用 Spark Jobserver 获得直接缓冲区 OOM。

apache-spark - 火花：解码器异常：java.lang.OutOfMemoryError

1 回答 1

Related

Reference