feat(datetime): prototype JVM UDF path for Hour/Minute/Second (engine=java)#4321
Draft
andygrove wants to merge 15 commits into
Draft
feat(datetime): prototype JVM UDF path for Hour/Minute/Second (engine=java)#4321andygrove wants to merge 15 commits into
andygrove wants to merge 15 commits into
Conversation
Tests for comet-common code live in the spark module by convention; the spark module already has ScalaTest wiring, while the common module previously had only JUnit. This reverts the scalatest plumbing added to common/pom.xml and relocates DateTimeFieldUDFSuite next to the other Scala test suites.
DateTimeFieldUDFSuite has been relocated to the spark module (which already has ScalaTest), so common no longer needs scalatest dependencies or the scalatest-maven-plugin.
When engine=java, getSupportLevel previously returned Compatible unconditionally, relying on the Spark analyzer to reject non-timestamp input types for Hour/Minute/Second. If a future Spark version accepts a new timestamp-like input type, DateTimeFieldUDF would throw IllegalArgumentException at runtime instead of falling back cleanly. Now the java branch matches on TimestampType | TimestampNTZType and returns Unsupported otherwise, mirroring how the rust branch already gates on TimestampNTZType. Also: align Compatible() call form across both branches (was Compatible(None) in the java branch, Compatible() in the rust branch). [skip ci]
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Part of #4311.
Rationale for this change
Comet's native Rust implementations of
hour,minute, andsecondincorrectly apply timezone conversion toTimestampNTZTypeinputs (#3180). Today this means Comet falls back to Spark for the NTZ case, losing native execution for the rest of the stage.This PR prototypes the approach outlined in #4311: add an opt-in JVM UDF path that produces bit-exact Spark results for the affected expressions, in exchange for a JNI round-trip per native batch. It mirrors the engine-selection idiom proposed in the open regex PR (#4239), keeping the framing consistent across families. The default stays
rust; users opt intojavafor full Spark compatibility.The scope is intentionally narrow: three closely related expressions sharing a serde shape and a UDF base class. Once this engine-config model is reviewed, it extends naturally to
TruncTimestamp,DateFormat, and the field-extraction expressions in a follow-up.What changes are included in this PR?
spark.comet.exec.datetime.enginewith valuesrust(default) andjava. Lives inCometConf.scala.DateTimeFieldUDFabstract base incommon/src/main/scala/org/apache/comet/udf/, implementing the Arrow vector iteration, null preservation, and TZ/NTZ branching usingjava.time. Concrete subclassesHourUDF,MinuteUDF,SecondUDFeach supply a one-lineextract(LocalDateTime): Int.CometHour/CometMinute/CometSecond(inspark/src/main/scala/org/apache/comet/serde/datetime.scala): whenengine=java, the serde builds aJvmScalarUdfproto pointing at the matching UDF class; whenengine=rust, the existing native path is unchanged. A shared privateDateTimeFieldUdfHelper.buildProtofactors the proto construction.engine=javabranch ingetSupportLevelmatches explicitly onTimestampType | TimestampNTZTypeand returnsUnsupportedfor anything else, so a future input-type change in Spark falls back cleanly instead of crashing the UDF at runtime.docs/source/user-guide/latest/compatibility/expressions/datetime.md.No native Rust changes; the
JvmScalarUdfproto andJvmScalarUdfExprplanner integration were already merged in #4232.This PR was scaffolded with the project's
brainstormingandwriting-plansskills.How are these changes tested?
Three test suites, totaling 17 new tests:
DateTimeFieldUDFSuite(Scala unit, in the spark module): 8 tests exercising each UDF directly on Arrow vectors. Covers TZ + NTZ, multiple session timezones, and null preservation.CometDatetimeJvmSuite(Spark integration): 8 tests withengine=javaset insparkConf. Each ofhour/minute/secondis checked across UTC,America/Los_Angeles, andAsia/Tokyoon bothTimestampTypeandTimestampNTZ, plus a DST fall-back boundary test in LA and a plan-inspection test asserting native execution. All usecheckSparkAnswerAndOperatorto verify both correctness against Spark and that the operator ran natively.CometDatetimeEngineDefaultSuite(Spark integration): 1 test asserting the config defaults torustand that the rust path still runs natively under defaults. Regression guard for the existing path.Also verified locally:
spotless:applyis a no-op,cargo clippy --all-targets --workspace -- -D warningsis clean, andCometTemporalExpressionSuite(27 tests) still passes unchanged, confirming the rust default path is untouched.