What feature you would like to be added?
Add a new Prometheus metric spark_application_submit_latency_seconds to track SparkApplication submission latency, measuring the time from application creation to the submitted state. This provides visibility into operator submission performance and helps identify bottlenecks in the job submission pipeline.
Why is this needed?
The existing spark_application_start_latency_seconds includes external factors (K8s scheduler, resource availability, Yunikorn queues, image pulls, pod initialization).
Problem: When start latency is high, we can't tell if the operator is slow or if cluster resources are constrained.
Solution: By comparing both metrics:
submit_latency = 2s, start_latency = 5min → Infrastructure issue (scale cluster)
submit_latency = 4min, start_latency = 5min → Operator issue (tune operator)
This enables:
- Accurate operator SLA monitoring (separate from infrastructure)
- Root cause analysis (operator vs. K8s vs. queue saturation)
- Better capacity planning
Describe the solution you would like
Add spark_application_submit_latency_seconds metric:
- Measures: Creation → Submitted state (operator work only)
- Includes: Summary (percentiles) + Histogram (distribution)
- Buckets:
[0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256] seconds (exponential, typical range 0.5-8s)
- Flag:
--metrics-job-submit-latency-buckets (configurable)
- Records on first submission only (SubmissionAttempts == 1)
Describe alternatives you have considered
Use existing metrics only → Can't isolate operator performance
Parse operator logs for timestamps → Not suitable for dashboards/alerts
Additional context
No response
Love this feature?
Give it a 👍 We prioritize the features with most 👍
What feature you would like to be added?
Add a new Prometheus metric
spark_application_submit_latency_secondsto track SparkApplication submission latency, measuring the time from application creation to the submitted state. This provides visibility into operator submission performance and helps identify bottlenecks in the job submission pipeline.Why is this needed?
The existing
spark_application_start_latency_secondsincludes external factors (K8s scheduler, resource availability, Yunikorn queues, image pulls, pod initialization).Problem: When start latency is high, we can't tell if the operator is slow or if cluster resources are constrained.
Solution: By comparing both metrics:
submit_latency= 2s,start_latency= 5min → Infrastructure issue (scale cluster)submit_latency= 4min,start_latency= 5min → Operator issue (tune operator)This enables:
Describe the solution you would like
Add
spark_application_submit_latency_secondsmetric:[0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256]seconds (exponential, typical range 0.5-8s)--metrics-job-submit-latency-buckets(configurable)Describe alternatives you have considered
Use existing metrics only → Can't isolate operator performance
Parse operator logs for timestamps → Not suitable for dashboards/alerts
Additional context
No response
Love this feature?
Give it a 👍 We prioritize the features with most 👍