-
Notifications
You must be signed in to change notification settings - Fork 4k
Zstd compression for thrift serialization (storm cluster state) #8653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
9a96de5
bd92192
75860c8
dbd1401
5f2d6cf
9efe44a
4f31c87
d78539c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -82,6 +82,7 @@ | |
|
|
||
| <!-- dependency versions --> | ||
| <commons-compress.version>1.28.0</commons-compress.version> | ||
| <zstd-jni.version>1.5.7-8</zstd-jni.version> | ||
| <commons-io.version>2.22.0</commons-io.version> | ||
| <commons-lang3.version>3.20.0</commons-lang3.version> | ||
| <commons-exec.version>1.6.0</commons-exec.version> | ||
|
|
@@ -514,6 +515,12 @@ | |
| <artifactId>commons-compress</artifactId> | ||
| <version>${commons-compress.version}</version> | ||
| </dependency> | ||
| <dependency> | ||
| <groupId>com.github.luben</groupId> | ||
| <artifactId>zstd-jni</artifactId> | ||
| <version>${zstd-jni.version}</version> | ||
| <scope>compile</scope> | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Also worth checking whether the explicit
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have removed the redundant compile tags as suggested. Regarding the zstd-jni dependency: I have confirmed that it is marked as |
||
| </dependency> | ||
| <dependency> | ||
| <groupId>org.apache.commons</groupId> | ||
| <artifactId>commons-exec</artifactId> | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -135,6 +135,15 @@ | |
| <artifactId>curator-test</artifactId> | ||
| <scope>test</scope> | ||
| </dependency> | ||
| <dependency> | ||
| <groupId>org.apache.commons</groupId> | ||
| <artifactId>commons-compress</artifactId> | ||
| </dependency> | ||
| <dependency> | ||
| <groupId>com.github.luben</groupId> | ||
| <artifactId>zstd-jni</artifactId> | ||
| <scope>compile</scope> | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same as in the root pom — |
||
| </dependency> | ||
| </dependencies> | ||
|
|
||
| <build> | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,65 @@ | ||
| /** | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Minor: rest of this package uses |
||
| * Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with | ||
| * this work for additional information regarding copyright ownership. The ASF licenses this file to you under the Apache License, Version | ||
| * 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions | ||
| * and limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.storm.serialization; | ||
|
|
||
| import java.util.Map; | ||
| import java.util.zip.GZIPInputStream; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Unused import. Its presence suggests the original intent was to fall back to Gzip rather than raw Thrift — see comment on the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am sorry, I have done copy/paste from the gzip bridge delegate. |
||
|
|
||
| /** | ||
| * Always writes Zstd out, but tests incoming bytes to determine the format. | ||
| * If Zstd magic is found, it uses {@link ZstdThriftSerializationDelegate}. | ||
| * If not, it falls back to {@link ThriftSerializationDelegate} for raw Thrift. | ||
| */ | ||
| public class ZstdBridgeThriftSerializationDelegate implements SerializationDelegate { | ||
|
|
||
| /** | ||
| * Zstandard magic number 0xFD2FB52. In a byte array (little-endian format): [0x28, 0xB5, 0x2F, 0xFD] | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The Zstandard magic number is |
||
| */ | ||
| private static final byte[] ZSTD_MAGIC = new byte[]{ 0x28, (byte) 0xB5, 0x2F, (byte) 0xFD }; | ||
|
|
||
| private final ThriftSerializationDelegate defaultDelegate = new ThriftSerializationDelegate(); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This needs to be With |
||
| private final ZstdThriftSerializationDelegate zstdDelegate = new ZstdThriftSerializationDelegate(); | ||
|
|
||
| @Override | ||
| public void prepare(Map<String, Object> topoConf) { | ||
| defaultDelegate.prepare(topoConf); | ||
| zstdDelegate.prepare(topoConf); | ||
| } | ||
|
|
||
| @Override | ||
| public byte[] serialize(Object object) { | ||
| // Always compress new data with Zstd | ||
| return zstdDelegate.serialize(object); | ||
| } | ||
|
|
||
| @Override | ||
| public <T> T deserialize(byte[] bytes, Class<T> clazz) { | ||
| if (isZstd(bytes)) { | ||
| return zstdDelegate.deserialize(bytes, clazz); | ||
| } else { | ||
| // Fallback for old data or non-compressed data | ||
| return defaultDelegate.deserialize(bytes, clazz); | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Checks the first 4 bytes of the array against the Zstd Magic Number. | ||
| */ | ||
| private boolean isZstd(byte[] bytes) { | ||
| if (bytes == null || bytes.length < 4) { | ||
| return false; | ||
| } | ||
|
|
||
| return bytes[0] == ZSTD_MAGIC[0] && bytes[1] == ZSTD_MAGIC[1] && bytes[2] == ZSTD_MAGIC[2] && bytes[3] == ZSTD_MAGIC[3]; | ||
| } | ||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,81 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, software | ||
| * distributed under the License is distributed on an "AS IS" BASIS, | ||
| * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| * See the License for the specific language governing permissions and | ||
| * limitations under the License. | ||
| */ | ||
|
|
||
| package org.apache.storm.serialization; | ||
|
|
||
| import java.util.Map; | ||
| import org.apache.storm.thrift.TBase; | ||
| import org.apache.storm.thrift.TDeserializer; | ||
| import org.apache.storm.thrift.TException; | ||
| import org.apache.storm.thrift.TSerializer; | ||
| import org.apache.storm.thrift.transport.TTransportException; | ||
| import org.apache.storm.utils.Utils; | ||
|
|
||
| /** | ||
| * Note, this assumes it's deserializing a gzip byte stream, and will err if it encounters any other serialization. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Stale copy from |
||
| */ | ||
| public class ZstdThriftSerializationDelegate implements SerializationDelegate { | ||
|
|
||
| // ThreadLocal with explicit exception handling for checked TTransportException | ||
| private static final ThreadLocal<TSerializer> SERIALIZER = ThreadLocal.withInitial(() -> { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Caching Suggest following the existing per-call pattern;
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for the detailed explanation. I didn't account for the |
||
| try { | ||
| return new TSerializer(); | ||
| } catch (TTransportException e) { | ||
| throw new RuntimeException("Failed to initialize Thrift Serializer", e); | ||
| } | ||
| }); | ||
|
|
||
| private static final ThreadLocal<TDeserializer> DESERIALIZER = ThreadLocal.withInitial(() -> { | ||
| try { | ||
| return new TDeserializer(); | ||
| } catch (TTransportException e) { | ||
| throw new RuntimeException("Failed to initialize Thrift Deserializer", e); | ||
| } | ||
| }); | ||
|
|
||
| @Override | ||
| public void prepare(Map<String, Object> topoConf) { | ||
| // No-op: Initialization happens lazily per thread | ||
| } | ||
|
|
||
| @Override | ||
| public byte[] serialize(Object object) { | ||
| if (!(object instanceof TBase)) { | ||
| throw new IllegalArgumentException("Object must be an instance of TBase"); | ||
| } | ||
| try { | ||
| byte[] thriftData = SERIALIZER.get().serialize((TBase<?, ?>) object); | ||
| return Utils.ZstdUtils.compress(thriftData); | ||
| } catch (TException e) { | ||
| throw new RuntimeException("Failed to serialize Thrift object", e); | ||
| } | ||
| } | ||
|
|
||
| @Override | ||
| public <T> T deserialize(byte[] bytes, Class<T> clazz) { | ||
| try { | ||
| byte[] decompressed = Utils.ZstdUtils.decompress(bytes); | ||
| TBase<?, ?> instance = (TBase<?, ?>) clazz.getDeclaredConstructor().newInstance(); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nit / defense-in-depth: |
||
| DESERIALIZER.get().deserialize(instance, decompressed); | ||
| return (T) instance; | ||
| } catch (Exception e) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| throw new RuntimeException("Failed to deserialize bytes to " + clazz.getName(), e); | ||
| } | ||
| } | ||
|
|
||
| } | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -72,6 +72,9 @@ | |
| import java.util.zip.ZipEntry; | ||
| import java.util.zip.ZipFile; | ||
| import javax.security.auth.Subject; | ||
| import org.apache.commons.compress.compressors.zstandard.ZstdCompressorInputStream; | ||
| import org.apache.commons.compress.compressors.zstandard.ZstdCompressorOutputStream; | ||
| import org.apache.commons.io.IOUtils; | ||
| import org.apache.storm.Config; | ||
| import org.apache.storm.blobstore.BlobStore; | ||
| import org.apache.storm.blobstore.ClientBlobStore; | ||
|
|
@@ -960,6 +963,75 @@ public static byte[] gunzip(byte[] data) { | |
| } | ||
| } | ||
|
|
||
| /** | ||
| * Static utility class for Zstandard (Zstd) compression and decompression. | ||
| */ | ||
| public static final class ZstdUtils { | ||
|
|
||
| private static final int BUFFER_SIZE = 64 * 1024; | ||
|
|
||
| /** | ||
| * Private constructor to prevent instantiation. | ||
| * @throws UnsupportedOperationException if an attempt is made to instantiate this class. | ||
| */ | ||
| private ZstdUtils() { | ||
| throw new UnsupportedOperationException("Utility class should not be instantiated."); | ||
| } | ||
|
|
||
| /** | ||
| * Compresses the provided byte array using Zstandard. | ||
| * | ||
| * <p>The output includes the standard Zstandard frame header, making it | ||
| * self-describing for the decompression phase.</p> | ||
| * | ||
| * @param data the raw byte array to compress. | ||
| * @return a compressed byte array, or the original array if null/empty. | ||
| * @throws RuntimeException wrapping an {@link IOException} if the compression fails. | ||
| */ | ||
| public static byte[] compress(byte[] data) { | ||
| if (data == null || data.length == 0) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Returning the raw input for empty data means the bridge's
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you. I am proposing an alternative to this method (now moved to the utility class). I found in the documentation a safe method to do the same comparison better: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/invoke/MethodHandles.html#byteArrayViewVarHandle(java.lang.Class,java.nio.ByteOrder) |
||
| return data; | ||
| } | ||
|
|
||
| try (ByteArrayOutputStream bos = new ByteArrayOutputStream(data.length)) { | ||
| try (ZstdCompressorOutputStream zstdOut = ZstdCompressorOutputStream.builder() | ||
| .setOutputStream(bos) | ||
| .setBufferSize(BUFFER_SIZE) // impacts on compression ratio | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reaching into the static
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you. I am fixing. |
||
| .setLevel(ConfigUtils.zstdCompressionLevel(localConf)) | ||
| .get()) { | ||
| zstdOut.write(data); | ||
| zstdOut.finish(); | ||
| } | ||
| return bos.toByteArray(); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Narrow this to |
||
| } catch (Exception e) { | ||
| throw new RuntimeException("Zstd compression failed", e); | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Decompresses a Zstandard-compressed byte array. | ||
| * | ||
| * @param data the compressed byte array (Zstd frame). | ||
| * @return the original decompressed byte array, or the input if null/empty. | ||
| * @throws RuntimeException wrapping an {@link IOException} if the decompression fails | ||
| * or if the data is not a valid Zstd frame. | ||
| */ | ||
| public static byte[] decompress(byte[] data) { | ||
| if (data == null || data.length == 0) { | ||
| return data; | ||
| } | ||
|
|
||
| try (ByteArrayInputStream bis = new ByteArrayInputStream(data); | ||
| ZstdCompressorInputStream zstdIn = new ZstdCompressorInputStream(bis); | ||
| ByteArrayOutputStream bos = new ByteArrayOutputStream()) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hardening suggestion — not a vulnerability under Storm's threat model since ZK sits inside the trusted boundary, but worth doing as defense-in-depth: this
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you. I am fixing. |
||
| IOUtils.copy(zstdIn, bos); | ||
| return bos.toByteArray(); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same as the compress path: narrow to |
||
| } catch (Exception e) { | ||
| throw new RuntimeException("Zstd decompression failed. Make sure the data is a valid Zstd frame.", e); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| public static List<String> getRepeat(List<String> list) { | ||
| List<String> rtn = new ArrayList<String>(); | ||
| Set<String> idSet = new HashSet<String>(); | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -803,6 +803,26 @@ public void validateField(String name, Object o) { | |
| } | ||
| } | ||
|
|
||
| public static class ZstdLevelValidator extends Validator { | ||
| private static final int MIN_LEVEL = 1; | ||
| private static final int MAX_LEVEL = 22; | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Levels 20–22 are Zstd "ultra" mode and require dramatically more working memory per call — rarely worth it for metadata-sized payloads, and they make the cluster easy to footgun with a single |
||
|
|
||
| @Override | ||
| public void validateField(String name, Object o) { | ||
| if (o == null) { | ||
| return; | ||
| } | ||
| SimpleTypeValidator.validateField(name, Integer.class, o); | ||
| int level = (Integer) o; | ||
| if (level < MIN_LEVEL || level > MAX_LEVEL) { | ||
| throw new IllegalArgumentException( | ||
| String.format("Field '%s' is invalid: %d. Zstd compression level must be between %d and %d.", | ||
| name, level, MIN_LEVEL, MAX_LEVEL) | ||
| ); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| public static class EventLoggerRegistryValidator extends Validator { | ||
|
|
||
| @Override | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switching the default to the pure Zstd delegate breaks rolling upgrades: every existing cluster has Gzip-compressed Thrift sitting in ZooKeeper, and
ZstdThriftSerializationDelegatecannot read it. The new bridge class is presumably designed for exactly this case — once its fallback is fixed (see comment on that file), please point the default here atZstdBridgeThriftSerializationDelegateinstead. Operators can opt into the pure Zstd delegate once they are sure no legacy Gzip state remains in ZK.