HDDS-14927. Add Quasi-Closed Container Tracking in Recon.#10198
HDDS-14927. Add Quasi-Closed Container Tracking in Recon.#10198ArafatKhan2198 wants to merge 2 commits intoapache:masterfrom
Conversation
devmadhuu
left a comment
There was a problem hiding this comment.
Thanks @ArafatKhan2198 for the patch. Please see comments in line.
| */ | ||
| @GET | ||
| @Path("/quasiClosed") | ||
| public Response getQuasiClosedContainers( |
There was a problem hiding this comment.
Can we not use existing /unhealthy API endpoint, why need new API end point ?
| @Path("/quasiClosed") | ||
| public Response getQuasiClosedContainers( | ||
| @DefaultValue(DEFAULT_FETCH_COUNT) @QueryParam(RECON_QUERY_LIMIT) int limit, | ||
| @DefaultValue(PREV_CONTAINER_ID_DEFAULT_VALUE) @QueryParam(RECON_QUERY_PREVKEY) long prevKey) { |
There was a problem hiding this comment.
Also this new API endpoint uses prevKey as parameter, but frontend is sending minContainerId, so API will always fallback to default value of prevKey as 0 and pagination is broke here completely. Every "next page" click will return the first page again.
| long lastKey = metaList.isEmpty() ? prevKey : metaList.get(metaList.size() - 1).getContainerID(); | ||
| long total = containerManager.getContainerStateCount(HddsProtos.LifeCycleState.QUASI_CLOSED); | ||
|
|
||
| return Response.ok(new QuasiClosedContainersResponse(total, firstKey, lastKey, metaList)).build(); |
There was a problem hiding this comment.
Here, this QuasiClosedContainersResponse object is used to wrap the response sent to frontend, but frontend uses ContainersPaginationResponse and two new fields added quasiClosedCount and totalCount there, but don't see totalCount field got added in backend response object - QuasiClosedContainersResponse
| } | ||
| List<ContainerHistory> replicas = containerManager.getLatestContainerHistory(containerID, requiredNodes); | ||
|
|
||
| UnhealthyContainerMetadata metadata = new UnhealthyContainerMetadata( |
There was a problem hiding this comment.
I would recommend to create a dedicated QuasiClosedContainerMetadata that has fields like stateEnterTime, replicaCount, replicaDetails, else use of existing UnhealthyContainerMetadata fields doesn't feel semantically correct and confusing.
| } catch (Exception e) { | ||
| LOG.warn("Could not get required nodes for container {}", containerID, e); | ||
| } | ||
| List<ContainerHistory> replicas = containerManager.getLatestContainerHistory(containerID, requiredNodes); |
There was a problem hiding this comment.
We are calling this method inside a stream lambda and has no exception handling for this method when called here. If it throws a runtime exception, then entire response will fail. Better extract whole logic into a helper method (like toQuasiClosedMetadata) with proper try/catch and WebApplicationException wrapping. You can see how below is done:
private UnhealthyContainerMetadata toUnhealthyMetadata(
ContainerHealthSchemaManager.UnhealthyContainerRecord record) {
try {
...
} catch (IOException ioEx) {
throw new UncheckedIOException(ioEx);
}
}
| } | ||
|
|
||
| // Also include the quasi-closed count in the summary for the frontend Highlights tab | ||
| long quasiClosedCount = containerManager.getContainerStateCount(HddsProtos.LifeCycleState.QUASI_CLOSED); |
There was a problem hiding this comment.
Actually this is not good to inject here, because above all calls are fetching data from DB, so they are making DB call. But here, its a in-memory data to be retrieved and if user just wants to see QUASI CLOSED containers, we are simply making DB call also and retrieving all other unhealthy containers.
What changes were proposed in this pull request?
Please describe your PR in detail: What changes are proposed in the PR? and Why? This PR introduces a new feature in Apache Ozone Recon to track and display containers that are in the
QUASI_CLOSEDlifecycle state.A container enters the
QUASI_CLOSEDstate when it is locally closed by a DataNode (often due to a pipeline failure or interruption) but SCM has not yet finalized it asCLOSEDdue to a lack of quorum. While these containers may not always trigger existing unhealthy replication alerts (like missing or under-replicated), they are critical for debugging because a container stuck inQUASI_CLOSEDfor an extended period indicates a stalled pipeline finalization or lifecycle issue.Previously, Recon only tracked replication health issues. This PR adds a dedicated, in-memory tracking path for
QUASI_CLOSEDcontainers, separating lifecycle state tracking from replication health tracking.Approach used to solve the issue: To ensure minimal overhead and avoid unnecessary database writes, this feature is implemented entirely in-memory without introducing new Derby DB tables or background persistence tasks:
ContainerEndpoint.java): Added a newGET /api/v1/containers/quasiClosedendpoint. This endpoint utilizes cursor-based pagination and fetchesQUASI_CLOSEDcontainers directly from the in-memoryReconContainerManager(ContainerStateMap), which is an efficientO(limit)lookup.QuasiClosedContainersResponse.java): Created a lightweight response object that maps the in-memoryContainerInfotoUnhealthyContainerMetadata, allowing seamless integration with the existing frontend components.ContainerTablecomponent, adds the quasi-closed count to the Highlights card, and overrides the column title to display "State Enter Time".What is the link to the Apache JIRA
How was this patch tested?