-
Notifications
You must be signed in to change notification settings - Fork 320
feat(experimental): ScalaUDF and Java UDF support via Janino codegen #4267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
mbutrovich
wants to merge
70
commits into
apache:main
Choose a base branch
from
mbutrovich:codegen_scala_udf
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from 38 commits
Commits
Show all changes
70 commits
Select commit
Hold shift + click to select a range
1746bcc
feat: Arrow-direct codegen dispatcher for Spark expressions and Scala…
mbutrovich 08d6b78
prettier, add new suites to CI checks.
mbutrovich 557752e
make format, fix shims for 4.0+
mbutrovich 896f61f
make format, fix shims for 4.0+
mbutrovich a82e160
Merge branch 'main' into codegen_scala_udf
mbutrovich 2a158f4
strengthen tests for composed expressions
mbutrovich 654bbad
make format, again.
mbutrovich 10df7e0
fix pr_benchmark_check.yml
mbutrovich 7afe69f
fix arrow shading issue in CI.
mbutrovich 0dc5855
fix Spark 4.0 collation expression shim
mbutrovich 43a7b0c
apply common subexpression elimination, add tests for subqueries in UDFs
mbutrovich 9640897
make format
mbutrovich f0c8296
decimal fast path. document 64KB limitation right now
mbutrovich 2173f40
pass through task context to get around tokio worker pool calling ove…
mbutrovich 2f9585b
fix compilation on scala 2.12, fix format issue
mbutrovich 582cd17
Merge branch 'main' into codegen_scala_udf
mbutrovich 22f3256
decimal output, utf8 output, non-nullable output optimizations
mbutrovich 7666715
optimization menu
mbutrovich 0a34636
estimate binaryview and binary size
mbutrovich e94b6db
fix "CSE collapses a repeated subtree to one evaluation in the genera…
mbutrovich d0f1f27
Merge remote-tracking branch 'origin/codegen_scala_udf' into codegen_…
mbutrovich 07e37ea
add some complex type support, remove #4239 code. update docs.
mbutrovich ebf77c4
split codegen input and output, basic struct WIP
mbutrovich 6836c30
split massive codegen file, handle recursive nested types
mbutrovich 5d91a8f
map input
mbutrovich 2a28aaf
more struct support
mbutrovich 0c6586a
revert some benchmark changes
mbutrovich 8497fe7
cleanup part 1
mbutrovich 8d703c3
cleanup part 2
mbutrovich 5ec0e3f
cleanup part 3
mbutrovich a22051e
remove view support, it's dead code right now
mbutrovich 421c60c
use cometplainvector part 1
mbutrovich 0705dff
use cometplainvector part 2
mbutrovich 9a00874
make generated class final
mbutrovich d7b43fc
clean up test names
mbutrovich 034e1f5
fix format
mbutrovich 317feaf
Merge branch 'main' into codegen_scala_udf
mbutrovich db1f1f2
Merge branch 'main' into codegen_scala_udf
mbutrovich caffed9
fix 2.12 mapvalues usage
mbutrovich 4be8144
Remove code related to #4239.
mbutrovich 6fcd81c
Merge remote-tracking branch 'apache/main' into codegen_scala_udf
mbutrovich 9f8aa07
fix after merging in upstream/main.
mbutrovich 17b2714
switch to taskid-keyed state for CometUDFs.
mbutrovich ff8ee79
Merge branch 'main' into codegen_scala_udf
mbutrovich 7ed806a
reduce the scope to just ScalaUDF instead of general spark expression…
mbutrovich 6ff5aa0
update docs
mbutrovich 935aec6
reorg codegen
mbutrovich cbf96df
more tests
mbutrovich 5966055
cleanup
mbutrovich 748f943
document optimizations
mbutrovich f9318d8
fix tests
mbutrovich 19ac9f6
try to trim comments a bit
mbutrovich 13270bf
update two tests
mbutrovich 1111c6f
revert unintended diff from main
mbutrovich 61ae5b7
add Java UDF test
mbutrovich 6643208
update stale TODO references
mbutrovich 965c2ba
better input fuzz coverage
mbutrovich 948f3b9
better input fuzz coverage
mbutrovich 41fc046
better input fuzz coverage
mbutrovich 25c2511
simplify input logic
mbutrovich a057687
fix format
mbutrovich 650f619
add fallback for too many args and a test, clean up printing code
mbutrovich b1e1c55
stronger tests
mbutrovich 0f6f68c
Merge branch 'main' into codegen_scala_udf
mbutrovich d967143
fix(udf): scope the dispatcher's compile cache per task to isolate bo…
mbutrovich 10da742
update docs
mbutrovich 23df354
add missing suite
mbutrovich b161169
synchronize per-task UDF evaluation
mbutrovich f86e70b
Merge branch 'main' into codegen_scala_udf
mbutrovich dca8b22
update spark diffs
mbutrovich File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
68 changes: 68 additions & 0 deletions
68
common/src/main/java/org/apache/comet/udf/CometBatchKernel.java
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,68 @@ | ||
| /* | ||
| * Licensed to the Apache Software Foundation (ASF) under one | ||
| * or more contributor license agreements. See the NOTICE file | ||
| * distributed with this work for additional information | ||
| * regarding copyright ownership. The ASF licenses this file | ||
| * to you under the Apache License, Version 2.0 (the | ||
| * "License"); you may not use this file except in compliance | ||
| * with the License. You may obtain a copy of the License at | ||
| * | ||
| * http://www.apache.org/licenses/LICENSE-2.0 | ||
| * | ||
| * Unless required by applicable law or agreed to in writing, | ||
| * software distributed under the License is distributed on an | ||
| * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY | ||
| * KIND, either express or implied. See the License for the | ||
| * specific language governing permissions and limitations | ||
| * under the License. | ||
| */ | ||
|
|
||
| package org.apache.comet.udf; | ||
|
|
||
| import org.apache.arrow.vector.FieldVector; | ||
| import org.apache.arrow.vector.ValueVector; | ||
|
|
||
| /** | ||
| * Abstract base extended by the Janino-compiled batch kernel emitted by {@code | ||
| * CometBatchKernelCodegen}. The generated subclass extends {@code CometInternalRow} (so Spark's | ||
| * {@code BoundReference.genCode} can call {@code this.getUTF8String(ord)} directly) and carries | ||
| * typed input fields baked at codegen time, one per input column. Expression evaluation plus Arrow | ||
| * read/write fuse into one method per expression tree. | ||
| * | ||
| * <p>Input scope: any {@code ValueVector[]}; the generated subclass casts each slot to the concrete | ||
| * Arrow type the compile-time schema specified. Output is a generic {@code FieldVector}; the | ||
| * generated subclass casts to the concrete type matching the bound expression's {@code dataType}. | ||
| * Widen input support by adding vector classes to the getter switch in {@code | ||
| * CometBatchKernelCodegen.emitTypedGetters}; widen output support by adding cases in {@code | ||
| * CometBatchKernelCodegen.allocateOutput} and {@code emitOutputWriter}. | ||
| */ | ||
| public abstract class CometBatchKernel extends CometInternalRow { | ||
|
|
||
| protected final Object[] references; | ||
|
|
||
| protected CometBatchKernel(Object[] references) { | ||
| this.references = references; | ||
| } | ||
|
|
||
| /** | ||
| * Process one batch. | ||
| * | ||
| * @param inputs Arrow input vectors; length and concrete classes must match the schema the kernel | ||
| * was compiled against | ||
| * @param output Arrow output vector; caller allocates to the expression's {@code dataType} | ||
| * @param numRows number of rows in this batch | ||
| */ | ||
| public abstract void process(ValueVector[] inputs, FieldVector output, int numRows); | ||
|
|
||
| /** | ||
| * Run partition-dependent initialization. The generated subclass overrides this to execute | ||
| * statements collected via {@code CodegenContext.addPartitionInitializationStatement}, for | ||
| * example reseeding {@code Rand}'s {@code XORShiftRandom} from {@code seed + partitionIndex}. | ||
| * Deterministic expressions leave this as a no-op. | ||
| * | ||
| * <p>The caller must invoke this before the first {@code process} call of each partition. The | ||
| * generated subclass is not thread-safe across concurrent {@code process} calls, so kernels are | ||
| * allocated per dispatcher invocation and init is run once on the fresh instance. | ||
| */ | ||
| public void init(int partitionIndex) {} | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.