[SPARK-34591][ML] Add decision tree pruning as a parameter by WeichenXu123 · Pull Request #55763 · apache/spark

WeichenXu123 · 2026-05-08T11:16:38Z

What changes were proposed in this pull request?

This PR adds a parameter to enable/disable a featuer where LearningNodes are merged after a RF model is trained.

This PR takes over #32813

Why are the changes needed?

2 Reasons:

In addition to basic classification, another use case for decision trees are the probabilities associated with predictions.
Once pruned, these predictions are lost and it makes the trees/predictions challenging to work with if not unusable.
It is not in line with the default behavior in sklearn. In sklearn, the trees are left unpruned by default.

Please see Jira ticket for more explanation.

Does this PR introduce any user-facing change?

New params:
adds a parameter pruneTree that is exposed to the Tree based classifiers. Will add tests here to ensure parameter is exposed correctly.

How was this patch tested?

Unit tests.

### What changes were proposed in this pull request? This PR disables a feature created in SPARK-3159 where LearningNodes are merged after a RF model is trained. ### Why are the changes needed? 2 Reasons: 1. In addition to basic classification, another use case for decision trees are the probabilities associated with predictions. Once pruned, these predictions are lost and it makes the trees/predictions challenging to work with if not unusable. 2. It is not in line with the default behavior in sklearn. In sklearn, the trees are left unpruned by default. ### Does this PR introduce _any_ user-facing change? No, it's dev-only. ### How was this patch tested? Locally ran `./build/mvn -pl mllib package` and verified tests passed Additionally, running through git workflow as described here: https://spark.apache.org/developer-tools.html#github-workflow-tests

This PR disables a feature created in SPARK-3159 where LearningNodes are merged after a RF model is trained. 2 Reasons: 1. In addition to basic classification, another use case for decision trees are the probabilities associated with predictions. Once pruned, these predictions are lost and it makes the trees/predictions challenging to work with if not unusable. 2. It is not in line with the default behavior in sklearn. In sklearn, the trees are left unpruned by default. Please see Jira ticket for more explanation. No, it's dev-only. I modified the two tests introduced with this change to verify postive/negative use of feature. I also added assertions for default behavior Locally ran `./build/mvn -pl mllib package` and verified tests passed Additionally, running through git workflow as described here: https://spark.apache.org/developer-tools.html#github-workflow-tests

…are merged after a RF model is trained. 2 Reasons: 1. In addition to basic classification, another use case for decision trees are the probabilities associated with predictions. Once pruned, these predictions are lost and it makes the trees/predictions challenging to work with if not unusable. 2. It is not in line with the default behavior in sklearn. In sklearn, the trees are left unpruned by default. Please see Jira ticket for more explanation. No, it's dev-only. I modified the two tests introduced with this change to verify postive/negative use of feature. I also added assertions for default behavior Locally ran `./build/mvn -pl mllib package` and verified tests passed Locally ran `./dev/scalafmt` which resulted in some minor cosmetic changes Additionally, running through git workflow as described here: https://spark.apache.org/developer-tools.html#github-workflow-tests

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Copilot

Pull request overview

Adds a new pruneTree parameter to Spark ML tree-based classification estimators to control post-training pruning (merging redundant subtrees), wiring the flag through the Scala training pipeline and exposing it in PySpark.

Changes:

Introduce pruneTree as a ML param (default true) and propagate it into the underlying old-API Strategy used by training.
Expose pruneTree in PySpark DecisionTreeClassifier / RandomForestClassifier (constructor, setParams, setter/getter).
Update Scala unit tests to toggle pruning via Strategy.pruneTree (instead of a testing-only prune argument).

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
python/pyspark/ml/tree.py	Adds `pruneTree` Param + getter in shared tree classifier param mixin.
python/pyspark/ml/classification.py	Wires `pruneTree` through PySpark DT/RF classifier defaults, signatures, and setters.
mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala	Updates pruning-related tests to use `Strategy.pruneTree`.
mllib/src/main/scala/org/apache/spark/mllib/tree/configuration/Strategy.scala	Adds `pruneTree` to the old-API `Strategy` configuration.
mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala	Adds the ML `pruneTree` param to `TreeClassifierParams`.
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala	Removes test-only `prune` arg; uses `strategy.pruneTree` for model materialization.
mllib/src/main/scala/org/apache/spark/ml/classification/RandomForestClassifier.scala	Exposes `setPruneTree` and propagates param to old `Strategy`.
mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala	Exposes `setPruneTree` and propagates param to old `Strategy`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

dismiss

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

bribiescas-carlos and others added 21 commits June 8, 2021 12:10

Merge branch 'master' into SPARK-34591

36a3527

Merge branch 'master' into SPARK-34591

b53c0e4

Merge branch 'master' into SPARK-34591

2fc33ec

Exposed pruning parameter accessible in Scala WIP

fb835db

Merge branch 'master' into SPARK-34591

a471d5e

Added to decision tree classifier and to python

43ee852

Merge branch 'master' into SPARK-34591

dcec830

Finished a TODO for comments in Strategy.scala

4bb58f6

Merge branch 'master' into SPARK-34591

ea028d4

Merge branch 'master' into SPARK-34591

f51bbae

merge master

86aa0c2

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

address conflicts

3ab5b2e

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

default pruneTree false

85f3da4

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

update test

e5a7896

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

address comments

84498d2

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

default prune true

6bb95f2

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

since 4.3.0

3a7945f

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Merge branch 'master' into SPARK-34591

d48462a

Copilot AI review requested due to automatic review settings May 8, 2026 11:16

WeichenXu123 mentioned this pull request May 8, 2026

[SPARK-34591][ML] Add decision tree pruning as a parameter #55728

Closed

Copilot started reviewing on behalf of WeichenXu123 May 8, 2026 11:17 View session

zhengruifeng previously approved these changes May 8, 2026

View reviewed changes

Copilot AI reviewed May 8, 2026

View reviewed changes

WeichenXu123 added 3 commits May 11, 2026 16:55

Merge branch 'master' into SPARK-34591

f38aed0

address comments

f03eba1

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

fix

a0d86fd

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

WeichenXu123 added 2 commits May 11, 2026 19:39

fix binary compatibility

10fb923

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

mima

a7088de

Signed-off-by: Weichen Xu <weichen.xu@databricks.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-34591][ML] Add decision tree pruning as a parameter#55763

[SPARK-34591][ML] Add decision tree pruning as a parameter#55763
WeichenXu123 wants to merge 26 commits into
apache:masterfrom
WeichenXu123:SPARK-34591

WeichenXu123 commented May 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

WeichenXu123 commented May 8, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants