Skip to content

HDDS-14927. Add Quasi-Closed Container Tracking in Recon.#10198

Draft
ArafatKhan2198 wants to merge 2 commits intoapache:masterfrom
ArafatKhan2198:HDDS-14927-new
Draft

HDDS-14927. Add Quasi-Closed Container Tracking in Recon.#10198
ArafatKhan2198 wants to merge 2 commits intoapache:masterfrom
ArafatKhan2198:HDDS-14927-new

Conversation

@ArafatKhan2198
Copy link
Copy Markdown
Contributor

@ArafatKhan2198 ArafatKhan2198 commented May 6, 2026

What changes were proposed in this pull request?

Please describe your PR in detail: What changes are proposed in the PR? and Why? This PR introduces a new feature in Apache Ozone Recon to track and display containers that are in the QUASI_CLOSED lifecycle state.

A container enters the QUASI_CLOSED state when it is locally closed by a DataNode (often due to a pipeline failure or interruption) but SCM has not yet finalized it as CLOSED due to a lack of quorum. While these containers may not always trigger existing unhealthy replication alerts (like missing or under-replicated), they are critical for debugging because a container stuck in QUASI_CLOSED for an extended period indicates a stalled pipeline finalization or lifecycle issue.

Previously, Recon only tracked replication health issues. This PR adds a dedicated, in-memory tracking path for QUASI_CLOSED containers, separating lifecycle state tracking from replication health tracking.

Approach used to solve the issue: To ensure minimal overhead and avoid unnecessary database writes, this feature is implemented entirely in-memory without introducing new Derby DB tables or background persistence tasks:

  1. Backend API (ContainerEndpoint.java): Added a new GET /api/v1/containers/quasiClosed endpoint. This endpoint utilizes cursor-based pagination and fetches QUASI_CLOSED containers directly from the in-memory ReconContainerManager (ContainerStateMap), which is an efficient O(limit) lookup.
  2. DTO (QuasiClosedContainersResponse.java): Created a lightweight response object that maps the in-memory ContainerInfo to UnhealthyContainerMetadata, allowing seamless integration with the existing frontend components.
  3. Frontend UI: Integrated a new "Quasi Closed" tab into the existing Unhealthy Containers page. It reuses the existing ContainerTable component, adds the quasi-closed count to the Highlights card, and overrides the column title to display "State Enter Time".

What is the link to the Apache JIRA

How was this patch tested?

@devmadhuu devmadhuu self-requested a review May 7, 2026 07:57
Copy link
Copy Markdown
Contributor

@devmadhuu devmadhuu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ArafatKhan2198 for the patch. Please see comments in line.

*/
@GET
@Path("/quasiClosed")
public Response getQuasiClosedContainers(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not use existing /unhealthy API endpoint, why need new API end point ?

@Path("/quasiClosed")
public Response getQuasiClosedContainers(
@DefaultValue(DEFAULT_FETCH_COUNT) @QueryParam(RECON_QUERY_LIMIT) int limit,
@DefaultValue(PREV_CONTAINER_ID_DEFAULT_VALUE) @QueryParam(RECON_QUERY_PREVKEY) long prevKey) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this new API endpoint uses prevKey as parameter, but frontend is sending minContainerId, so API will always fallback to default value of prevKey as 0 and pagination is broke here completely. Every "next page" click will return the first page again.

long lastKey = metaList.isEmpty() ? prevKey : metaList.get(metaList.size() - 1).getContainerID();
long total = containerManager.getContainerStateCount(HddsProtos.LifeCycleState.QUASI_CLOSED);

return Response.ok(new QuasiClosedContainersResponse(total, firstKey, lastKey, metaList)).build();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, this QuasiClosedContainersResponse object is used to wrap the response sent to frontend, but frontend uses ContainersPaginationResponse and two new fields added quasiClosedCount and totalCount there, but don't see totalCount field got added in backend response object - QuasiClosedContainersResponse

}
List<ContainerHistory> replicas = containerManager.getLatestContainerHistory(containerID, requiredNodes);

UnhealthyContainerMetadata metadata = new UnhealthyContainerMetadata(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend to create a dedicated QuasiClosedContainerMetadata that has fields like stateEnterTime, replicaCount, replicaDetails, else use of existing UnhealthyContainerMetadata fields doesn't feel semantically correct and confusing.

} catch (Exception e) {
LOG.warn("Could not get required nodes for container {}", containerID, e);
}
List<ContainerHistory> replicas = containerManager.getLatestContainerHistory(containerID, requiredNodes);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are calling this method inside a stream lambda and has no exception handling for this method when called here. If it throws a runtime exception, then entire response will fail. Better extract whole logic into a helper method (like toQuasiClosedMetadata) with proper try/catch and WebApplicationException wrapping. You can see how below is done:

private UnhealthyContainerMetadata toUnhealthyMetadata(
    ContainerHealthSchemaManager.UnhealthyContainerRecord record) {
  try {
    ...
  } catch (IOException ioEx) {
    throw new UncheckedIOException(ioEx);
  }
}

}

// Also include the quasi-closed count in the summary for the frontend Highlights tab
long quasiClosedCount = containerManager.getContainerStateCount(HddsProtos.LifeCycleState.QUASI_CLOSED);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is not good to inject here, because above all calls are fetching data from DB, so they are making DB call. But here, its a in-memory data to be retrieved and if user just wants to see QUASI CLOSED containers, we are simply making DB call also and retrieving all other unhealthy containers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants