Skip to content

Add spark_application_submit_latency_seconds metric to measure operator submission performance #2911

@venkomirisetti

Description

@venkomirisetti

What feature you would like to be added?

Add a new Prometheus metric spark_application_submit_latency_seconds to track SparkApplication submission latency, measuring the time from application creation to the submitted state. This provides visibility into operator submission performance and helps identify bottlenecks in the job submission pipeline.

Why is this needed?

The existing spark_application_start_latency_seconds includes external factors (K8s scheduler, resource availability, Yunikorn queues, image pulls, pod initialization).

Problem: When start latency is high, we can't tell if the operator is slow or if cluster resources are constrained.

Solution: By comparing both metrics:

  • submit_latency = 2s, start_latency = 5min → Infrastructure issue (scale cluster)
  • submit_latency = 4min, start_latency = 5min → Operator issue (tune operator)

This enables:

  • Accurate operator SLA monitoring (separate from infrastructure)
  • Root cause analysis (operator vs. K8s vs. queue saturation)
  • Better capacity planning

Describe the solution you would like

Add spark_application_submit_latency_seconds metric:

  • Measures: Creation → Submitted state (operator work only)
  • Includes: Summary (percentiles) + Histogram (distribution)
  • Buckets: [0.5, 1, 2, 4, 8, 16, 32, 64, 128, 256] seconds (exponential, typical range 0.5-8s)
  • Flag: --metrics-job-submit-latency-buckets (configurable)
  • Records on first submission only (SubmissionAttempts == 1)

Describe alternatives you have considered

Use existing metrics only → Can't isolate operator performance
Parse operator logs for timestamps → Not suitable for dashboards/alerts

Additional context

No response

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions