perf: Add `append_with` to string builders, use in `replace` by neilconway · Pull Request #22029 · apache/datafusion

neilconway · 2026-05-05T20:48:03Z

Which issue does this PR close?

Closes Introduce StringViewArrayBuilder::map to avoid duplication #21997 (potentially).

Rationale for this change

This PR adds two new APIs to GenericStringArrayBuilder and StringViewArrayBuilder:

append_with appends a row whose bytes are produced by invoking a closure that is passed a StringWriter
append_byte_map appends a row whose bytes are produced by mapping each byte of the input with a byte-to-byte map closure.

For StringViewArrayBuilder, StringWriter is an append-only string writer that switches between writing to a new inline view (for short strings) or to the in-progress data block automatically. For GenericStringArrayBuilder, StringWriter just appends to the value buffer directly.

(We need two new APIs because append_byte_map vectorizes a lot better than append_with, so callers that fit the byte-to-byte map pattern should prefer it.)

Both of these new APIs allow string UDFs to avoid creating an intermediate data copy in many cases. To illustrate this, this PR adopts the new APIs in replace.

Benchmarks (Arm64):

Group 1: ASCII single-byte fast path (StringArray)

size=1024 str_len=32 nulls=0.0 : 16.27 µs -> 12.83 µs (−21.1%)
size=1024 str_len=32 nulls=0.2 : 14.23 µs -> 12.10 µs (−15.0%)
size=1024 str_len=128 nulls=0.0 : 11.28 µs -> 8.21 µs (−27.3%)
size=1024 str_len=128 nulls=0.2 : 10.37 µs -> 7.79 µs (−24.9%)
size=4096 str_len=32 nulls=0.0 : 62.48 µs -> 49.50 µs (−20.8%)
size=4096 str_len=32 nulls=0.2 : 55.74 µs -> 46.66 µs (−16.3%)
size=4096 str_len=128 nulls=0.0 : 42.26 µs -> 29.06 µs (−31.2%)
size=4096 str_len=128 nulls=0.2 : 39.17 µs -> 28.52 µs (−27.2%)

Group 2: Multi-byte StringArray — general writer path

size=1024 str_len=32 nulls=0.0 : 23.58 µs -> 21.75 µs (−7.8%)
size=1024 str_len=32 nulls=0.2 : 18.92 µs -> 17.41 µs (−8.0%)
size=1024 str_len=128 nulls=0.0 : 37.56 µs -> 35.33 µs (−5.9%)
size=1024 str_len=128 nulls=0.2 : 29.62 µs -> 28.71 µs (−3.1%)
size=4096 str_len=32 nulls=0.0 : 97.15 µs -> 88.92 µs (−8.5%)
size=4096 str_len=32 nulls=0.2 : 77.03 µs -> 71.43 µs (−7.3%)
size=4096 str_len=128 nulls=0.0 : 173.66 µs -> 163.68 µs (−5.7%)
size=4096 str_len=128 nulls=0.2 : 134.98 µs -> 128.56 µs (−4.8%)

Group 3: Multi-byte StringViewArray — general writer path

size=1024 str_len=32 nulls=0.0 : 24.46 µs -> 22.18 µs (−9.3%)
size=1024 str_len=32 nulls=0.2 : 20.04 µs -> 17.71 µs (−11.7%)
size=1024 str_len=128 nulls=0.0 : 36.43 µs -> 35.79 µs (−1.8%)
size=1024 str_len=128 nulls=0.2 : 29.73 µs -> 28.70 µs (−3.5%)
size=4096 str_len=32 nulls=0.0 : 99.07 µs -> 89.68 µs (−9.5%)
size=4096 str_len=32 nulls=0.2 : 84.38 µs -> 72.46 µs (−14.1%)
size=4096 str_len=128 nulls=0.0 : 169.27 µs -> 164.80 µs (−2.6%, n.s.)
size=4096 str_len=128 nulls=0.2 : 133.79 µs -> 130.20 µs (−2.7%, n.s.)

Group 4: Empty-from StringArray

size=1024 str_len=32 : 87.75 µs -> 50.64 µs (−42.3%)
size=1024 str_len=128 : 313.00 µs -> 187.77 µs (−40.0%)

Group 5: Empty-from StringViewArray

size=1024 str_len=32 : 87.01 µs -> 50.10 µs (−42.4%)
size=1024 str_len=128 : 313.99 µs -> 190.17 µs (−39.4%)

What changes are included in this PR?

Add append_byte_map and append_with to both of the bulk-NULL string builders
Add unit tests
Adopt the new APIs in replace

Are these changes tested?

Yes; new tests added.

Are there any user-facing changes?

No.

neilconway · 2026-05-05T20:55:04Z

Other places where these APIs should be useful:

initcap
lower, upper: at least for the Unicode code path; for ASCII, we might not beat the hand-optimized code added in perf: Optimize lower, upper for ASCII inputs #21980
translate
reverse (might need a slightly different API)
to_char (might need a small API extension)
lpad, rpad (needs a closer look)

If we make the builders accessible outside the current crate, some of the Spark functions could use these APIs, as well as || for Utf8View values.

neilconway · 2026-05-05T20:56:58Z

My initial plan was to have an API where the closure is passed a caller-sized byte slice. That has two shortcomings:

caller needs to size the byte-slice in advance
for efficiency, we can't initialize the contents of the slice, so (a) this needs unsafe code (b) the closure must be careful to write to EXACTLY the specified number of bytes, no more and no less.

That seemed like a footgun, so I started with these safer APIs instead.

lyne7-sc · 2026-05-07T06:51:15Z

+        builder.append_with(|w| {
+            w.write_str(to);
+            for ch in string.chars() {
+                w.write_char(ch);


Wondering about the from.is_empty() branch — the per-write trade-off looks unusually unfavorable here. Is it worth adding a from_empty benchmark case?

FYI, I ran the benchmarks locally for the empty_from cases and it does look like this branch causes a regression.

group main pr-22029-bench ----- ---- -------------- replace size=1024/replace_string_empty_from [size=1024, str_len=128, nulls=0.2] 1.00 280.0±1.26µs ? ?/sec 1.41 394.3±2.06µs ? ?/sec replace size=1024/replace_string_empty_from [size=1024, str_len=128, nulls=0] 1.00 339.0±2.10µs ? ?/sec 1.48 502.4±11.19µs ? ?/sec replace size=1024/replace_string_empty_from [size=1024, str_len=32, nulls=0.2] 1.00 78.9±0.71µs ? ?/sec 1.27 100.1±0.64µs ? ?/sec replace size=1024/replace_string_empty_from [size=1024, str_len=32, nulls=0] 1.00 97.4±0.59µs ? ?/sec 1.33 129.8±2.01µs ? ?/sec replace size=1024/replace_string_view_empty_from [size=1024, str_len=128, nulls=0.2] 1.00 281.4±3.73µs ? ?/sec 1.41 396.2±6.00µs ? ?/sec replace size=1024/replace_string_view_empty_from [size=1024, str_len=128, nulls=0] 1.00 338.2±2.78µs ? ?/sec 1.47 498.1±4.54µs ? ?/sec replace size=1024/replace_string_view_empty_from [size=1024, str_len=32, nulls=0.2] 1.00 78.3±0.18µs ? ?/sec 1.28 100.5±0.28µs ? ?/sec replace size=1024/replace_string_view_empty_from [size=1024, str_len=32, nulls=0] 1.00 97.8±0.70µs ? ?/sec 1.31 128.0±1.11µs ? ?/sec replace size=4096/replace_string_empty_from [size=4096, str_len=128, nulls=0.2] 1.00 1140.6±6.91µs ? ?/sec 1.43 1625.6±11.65µs ? ?/sec replace size=4096/replace_string_empty_from [size=4096, str_len=128, nulls=0] 1.00 1411.9±17.41µs ? ?/sec 1.50 2.1±0.01ms ? ?/sec replace size=4096/replace_string_empty_from [size=4096, str_len=32, nulls=0.2] 1.00 317.6±2.31µs ? ?/sec 1.28 405.6±1.73µs ? ?/sec replace size=4096/replace_string_empty_from [size=4096, str_len=32, nulls=0] 1.00 396.4±3.03µs ? ?/sec 1.29 511.2±5.86µs ? ?/sec replace size=4096/replace_string_view_empty_from [size=4096, str_len=128, nulls=0.2] 1.00 1147.6±9.87µs ? ?/sec 1.42 1624.8±13.37µs ? ?/sec replace size=4096/replace_string_view_empty_from [size=4096, str_len=128, nulls=0] 1.00 1433.2±23.33µs ? ?/sec 1.46 2.1±0.01ms ? ?/sec replace size=4096/replace_string_view_empty_from [size=4096, str_len=32, nulls=0.2] 1.00 318.2±1.08µs ? ?/sec 1.29 409.3±2.62µs ? ?/sec replace size=4096/replace_string_view_empty_from [size=4096, str_len=32, nulls=0] 1.00 397.6±5.38µs ? ?/sec 1.30 517.0±6.47µs ? ?/sec

Really interesting! Thanks for raising this.

I didn't quite follow your comment about the per-write tradeoff: I don't think there's anything fundamental about append_with that should be slower for the many-small-writes case (and we still are able to avoid a final memcpy). However, I can reproduce the significant slowdown you suggested for the "empty from" case. I dug into why, and it seems that repeatedly extending an Arrow MutableBuffer is relatively slow: StringWriter::write_str() does MutableBuffer::extend_from_slice(&[byte]), which results in a libc memcpy for every call, which is obviously slower than per-char String::push(c), as the original code does. I don't think MutableBuffer::extend_from_slice(&[byte]) is inherently slow, but there was enough helper functions / abstractions here that LLVM didn't inline, which lead to the per-call memcpy.

A few options:

Ignore it, because replace with empty-from is a corner-case. But it's unfortunate to leave append_with with a performance footgun like this.

Fix it by reverting to a mut String buffer for the empty-from case. Fixes this specific workload but doesn't fix the underlyling issue in StringWriter.

Optimize StringWriter for small-string writes.

I've implemented (3). A combination of special-casing the string length and marking functions as #[inline(always)] appears to convince LLVM to skip memcpy for small to values, which is a nice win (30-40% faster than main for the empty-from benchmark with a 3-byte to value). All the inlining might in theory cause code block for other callers but it doesn't appear to regress any other replace benchmarks, at least.

…pend-writer # Conflicts: # datafusion/functions/src/string/replace.rs

.

77525bb

github-actions Bot added the functions Changes to functions implementation label May 5, 2026

lyne7-sc reviewed May 7, 2026

View reviewed changes

neilconway added 4 commits May 7, 2026 16:09

.

7b84ffe

Merge remote-tracking branch 'origin/main' into neilc/perf-builder-ap…

389e5c2

…pend-writer # Conflicts: # datafusion/functions/src/string/replace.rs

Comment cleanup

4dc4c57

Minor cleanup

78f2360

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Add `append_with` to string builders, use in `replace`#22029

perf: Add `append_with` to string builders, use in `replace`#22029
neilconway wants to merge 5 commits intoapache:mainfrom
neilconway:neilc/perf-builder-append-writer

neilconway commented May 5, 2026 •

edited

Loading

Uh oh!

neilconway commented May 5, 2026 •

edited

Loading

Uh oh!

neilconway commented May 5, 2026 •

edited

Loading

Uh oh!

lyne7-sc May 7, 2026 •

edited

Loading

Uh oh!

lyne7-sc May 7, 2026

Uh oh!

neilconway May 8, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

neilconway commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

neilconway commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neilconway commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lyne7-sc May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lyne7-sc May 7, 2026

Choose a reason for hiding this comment

Uh oh!

neilconway May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

neilconway commented May 5, 2026 •

edited

Loading

neilconway commented May 5, 2026 •

edited

Loading

neilconway commented May 5, 2026 •

edited

Loading

lyne7-sc May 7, 2026 •

edited

Loading

neilconway May 8, 2026 •

edited

Loading