perf: Add append_with to string builders, use in replace#22029
perf: Add append_with to string builders, use in replace#22029neilconway wants to merge 5 commits intoapache:mainfrom
append_with to string builders, use in replace#22029Conversation
|
Other places where these APIs should be useful:
If we make the builders accessible outside the current crate, some of the Spark functions could use these APIs, as well as |
|
My initial plan was to have an API where the closure is passed a caller-sized byte slice. That has two shortcomings:
That seemed like a footgun, so I started with these safer APIs instead. |
| builder.append_with(|w| { | ||
| w.write_str(to); | ||
| for ch in string.chars() { | ||
| w.write_char(ch); |
There was a problem hiding this comment.
Wondering about the from.is_empty() branch — the per-write trade-off looks unusually unfavorable here. Is it worth adding a from_empty benchmark case?
There was a problem hiding this comment.
FYI, I ran the benchmarks locally for the empty_from cases and it does look like this branch causes a regression.
group main pr-22029-bench
----- ---- --------------
replace size=1024/replace_string_empty_from [size=1024, str_len=128, nulls=0.2] 1.00 280.0±1.26µs ? ?/sec 1.41 394.3±2.06µs ? ?/sec
replace size=1024/replace_string_empty_from [size=1024, str_len=128, nulls=0] 1.00 339.0±2.10µs ? ?/sec 1.48 502.4±11.19µs ? ?/sec
replace size=1024/replace_string_empty_from [size=1024, str_len=32, nulls=0.2] 1.00 78.9±0.71µs ? ?/sec 1.27 100.1±0.64µs ? ?/sec
replace size=1024/replace_string_empty_from [size=1024, str_len=32, nulls=0] 1.00 97.4±0.59µs ? ?/sec 1.33 129.8±2.01µs ? ?/sec
replace size=1024/replace_string_view_empty_from [size=1024, str_len=128, nulls=0.2] 1.00 281.4±3.73µs ? ?/sec 1.41 396.2±6.00µs ? ?/sec
replace size=1024/replace_string_view_empty_from [size=1024, str_len=128, nulls=0] 1.00 338.2±2.78µs ? ?/sec 1.47 498.1±4.54µs ? ?/sec
replace size=1024/replace_string_view_empty_from [size=1024, str_len=32, nulls=0.2] 1.00 78.3±0.18µs ? ?/sec 1.28 100.5±0.28µs ? ?/sec
replace size=1024/replace_string_view_empty_from [size=1024, str_len=32, nulls=0] 1.00 97.8±0.70µs ? ?/sec 1.31 128.0±1.11µs ? ?/sec
replace size=4096/replace_string_empty_from [size=4096, str_len=128, nulls=0.2] 1.00 1140.6±6.91µs ? ?/sec 1.43 1625.6±11.65µs ? ?/sec
replace size=4096/replace_string_empty_from [size=4096, str_len=128, nulls=0] 1.00 1411.9±17.41µs ? ?/sec 1.50 2.1±0.01ms ? ?/sec
replace size=4096/replace_string_empty_from [size=4096, str_len=32, nulls=0.2] 1.00 317.6±2.31µs ? ?/sec 1.28 405.6±1.73µs ? ?/sec
replace size=4096/replace_string_empty_from [size=4096, str_len=32, nulls=0] 1.00 396.4±3.03µs ? ?/sec 1.29 511.2±5.86µs ? ?/sec
replace size=4096/replace_string_view_empty_from [size=4096, str_len=128, nulls=0.2] 1.00 1147.6±9.87µs ? ?/sec 1.42 1624.8±13.37µs ? ?/sec
replace size=4096/replace_string_view_empty_from [size=4096, str_len=128, nulls=0] 1.00 1433.2±23.33µs ? ?/sec 1.46 2.1±0.01ms ? ?/sec
replace size=4096/replace_string_view_empty_from [size=4096, str_len=32, nulls=0.2] 1.00 318.2±1.08µs ? ?/sec 1.29 409.3±2.62µs ? ?/sec
replace size=4096/replace_string_view_empty_from [size=4096, str_len=32, nulls=0] 1.00 397.6±5.38µs ? ?/sec 1.30 517.0±6.47µs ? ?/sec
There was a problem hiding this comment.
Really interesting! Thanks for raising this.
I didn't quite follow your comment about the per-write tradeoff: I don't think there's anything fundamental about append_with that should be slower for the many-small-writes case (and we still are able to avoid a final memcpy). However, I can reproduce the significant slowdown you suggested for the "empty from" case. I dug into why, and it seems that repeatedly extending an Arrow MutableBuffer is relatively slow: StringWriter::write_str() does MutableBuffer::extend_from_slice(&[byte]), which results in a libc memcpy for every call, which is obviously slower than per-char String::push(c), as the original code does. I don't think MutableBuffer::extend_from_slice(&[byte]) is inherently slow, but there was enough helper functions / abstractions here that LLVM didn't inline, which lead to the per-call memcpy.
A few options:
- Ignore it, because
replacewith empty-from is a corner-case. But it's unfortunate to leaveappend_withwith a performance footgun like this. - Fix it by reverting to a
mut Stringbuffer for the empty-from case. Fixes this specific workload but doesn't fix the underlyling issue inStringWriter. - Optimize
StringWriterfor small-string writes.
I've implemented (3). A combination of special-casing the string length and marking functions as #[inline(always)] appears to convince LLVM to skip memcpy for small to values, which is a nice win (30-40% faster than main for the empty-from benchmark with a 3-byte to value). All the inlining might in theory cause code block for other callers but it doesn't appear to regress any other replace benchmarks, at least.
…pend-writer # Conflicts: # datafusion/functions/src/string/replace.rs
Which issue does this PR close?
StringViewArrayBuilder::mapto avoid duplication #21997 (potentially).Rationale for this change
This PR adds two new APIs to
GenericStringArrayBuilderandStringViewArrayBuilder:append_withappends a row whose bytes are produced by invoking a closure that is passed aStringWriterappend_byte_mapappends a row whose bytes are produced by mapping each byte of the input with a byte-to-byte map closure.For
StringViewArrayBuilder,StringWriteris an append-only string writer that switches between writing to a new inline view (for short strings) or to the in-progress data block automatically. ForGenericStringArrayBuilder,StringWriterjust appends to the value buffer directly.(We need two new APIs because
append_byte_mapvectorizes a lot better thanappend_with, so callers that fit the byte-to-byte map pattern should prefer it.)Both of these new APIs allow string UDFs to avoid creating an intermediate data copy in many cases. To illustrate this, this PR adopts the new APIs in
replace.Benchmarks (Arm64):
Group 1: ASCII single-byte fast path (StringArray)
Group 2: Multi-byte StringArray — general writer path
Group 3: Multi-byte StringViewArray — general writer path
Group 4: Empty-from StringArray
Group 5: Empty-from StringViewArray
What changes are included in this PR?
append_byte_mapandappend_withto both of the bulk-NULL string buildersreplaceAre these changes tested?
Yes; new tests added.
Are there any user-facing changes?
No.