Skip to content

[WIP][SPARK-56661] Introducing logical and physical planning nodes for language-agnostic Spark UDFs#55768

Draft
sven-weber-db wants to merge 2 commits intoapache:masterfrom
sven-weber-db:sven-weber_data/spark-56661-catalyst-and-udf
Draft

[WIP][SPARK-56661] Introducing logical and physical planning nodes for language-agnostic Spark UDFs#55768
sven-weber-db wants to merge 2 commits intoapache:masterfrom
sven-weber-db:sven-weber_data/spark-56661-catalyst-and-udf

Conversation

@sven-weber-db
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

This PR introduces new logical and physical Catalyst nodes for language-agnostic User Defined Functions (UDF) as part of SPIP SPARK-55278, which proposes language-agnostic UDFs.

As a first step towards the goal of language-agnostic UDFs, we want to target mapPartition UDFs like pyspark.sql.DataFrame.mapInArrow, pyspark.RDD.mapPartitions, or pyspark.sql.DataFrame.mapInArrow. The overarching goal is to deprecate the current, language-specific Catalyst nodes (like mapInArrow). However, for now, the new nodes will exist in addition to the old ones until the new framework has reach maturity.

In summary, this PR introduces:

  • A new Catalyst Expression, ExternalUDFExpression, which captures language-agnostic UDF properties (payload, name, etc.)
  • A new Catalyst logical node, ExternalUDF, which serves as a base class for all language-agnostic UDF nodes
  • A new Catalyst logical node, MapPartitionExternalUDF, which is the new, language-agnostic map partition node
  • Catalyst physical nodes for both logical nodes
  • WorkerDispatcherManager - A manager class which manages UDF Dispatchers based on the target UDFWorkerSpecification

None of the changes introduced above are currently consumed in Spark.

Why are the changes needed?

This is the first step toward language-agnostic UDF execution for Spark. Existing physical and logical planning nodes need to be replaced eventually to achieve this goal as they make language-specific assumptions.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New unit-tests were added.

Was this patch authored or co-authored using generative AI tooling?

Partially. However, the code was manually reviewed and adjusted.

session.close()
// TODO [SPARK-55278]: Stream rows to/from the worker
// via session.process().
rows
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe throws an exception here to be correct?


/**
* :: Experimental ::
* Builds a [[UDFWorkerSpecification]] for Python UDFs from a
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have tests that this converted worker spec can run with direct worker dispatcher?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to run udf, just worker starts/stops

* `[className]`. Useful for identifying which class produced a
* log line.
*/
def forClass(clazz: Class[_]): WorkerLogger = {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks a little bit weird to me - why we don't make this a static function from a singleton class, and make the prefix a required field in the WorkerLogger?

Dataset.ofRows(
sparkSession,
MapPartitionsExternalUDF(
workerSpec, udf, output, logicalPlan))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would need isBarrier, too

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants