feat: Telemetry resiliency - prevent host app crashes, corruption handling, bounded retry, multi-process support#22
Conversation
- Add internal constants for TTL values (30 days for .trn, 7 days for .corrupt, 2 hours max retry) - Catch all exceptions in Delete (not just IOException) to prevent UnauthorizedAccessException crashes - Guard GetSize/CalculateSize against IO exceptions - Wrap FlushManager.Flush with exception handling to prevent serialization exceptions from escaping - Add inner exception handling in SendLoop to ensure it never terminates unexpectedly - Implement corrupted file quarantine: rename .trn to .corrupt on load failure - Add TTL-based cleanup for .trn files (30 days) and .corrupt files (7 days) - Implement bounded retry policy: drop transmissions after 2 hours of retries - Treat unknown exceptions as retryable (with bounded retry) instead of dropping immediately Co-authored-by: carldebilly <4174207+carldebilly@users.noreply.github.com>
|
|
- Add InternalsVisibleTo for test project (net8.0/net9.0 only, not netstandard2.0) - Add Microsoft.Extensions.TimeProvider.Testing package - Test corrupted .trn file handling (rename to .corrupt) - Test TTL cleanup for .trn files (30 days) - Test TTL cleanup for .corrupt files (7 days) - Test exception handling for Delete operations (cross-platform) - Test FlushManager exception handling - Test GetSize/CalculateSize exception resilience - Test bounded retry (2 hours) for old transmissions - Test SendLoop continues despite exceptions - Test mixed file type cleanup (tmp/trn/corrupt with different ages) All 13 tests passing on net8.0 Co-authored-by: carldebilly <4174207+carldebilly@users.noreply.github.com>
- Remove Microsoft.Extensions.TimeProvider.Testing package (not used) - Add documentation explaining Thread.Sleep usage in integration tests - Note future improvement opportunity for TimeProvider injection Co-authored-by: carldebilly <4174207+carldebilly@users.noreply.github.com>
- Completely remove Microsoft.Extensions.TimeProvider.Testing package reference - Package was added but not used in final implementation Co-authored-by: carldebilly <4174207+carldebilly@users.noreply.github.com>
|
@copilot There is a reported case where the access denied error were reported on a .Delete() call when the file was not existent on disk (maybe because the delete was asked before it has been actually created... is it possible ?). Please check how this is possible and ensure we're addressing it. |
- Add File.Exists check in Delete() method before attempting file operations - Prevents UnauthorizedAccessException when file doesn't exist (race condition) - Add File.Exists checks in DeleteObsoleteFiles for tmp/trn/corrupt cleanup - Add test for non-existent file deletion scenario - Addresses case where Delete is called before file is created or after already deleted Fixes issue where UnauthorizedAccessException was thrown on Delete when file didn't exist, which can happen in race conditions between file creation/deletion. Co-authored-by: carldebilly <4174207+carldebilly@users.noreply.github.com>
Fixed in commit b9daaa0. The issue was that
Changes made:
The code now gracefully skips deletion when files don't exist, preventing the exception while maintaining proper exception handling for other IO errors. |
|
@copilot There could be multiple instances of the same executable running, so access to those files could be done concurrently. Please make sure it's not a problem. |
There was a problem hiding this comment.
Pull request overview
This PR implements telemetry resiliency improvements to prevent host application crashes and ensure robust handling of telemetry operations. The changes add exception safety, corruption handling with quarantine, bounded retry logic, and TTL-based cleanup for old files.
Key Changes
- Exception handling expanded to catch all exceptions (not just IOException) in critical telemetry operations
- Corrupted
.trnfiles are quarantined by renaming to.corruptwith 7-day TTL cleanup - Bounded retry policy: transmissions older than 2 hours are dropped to prevent indefinite retries
- TTL cleanup: 30 days for
.trnfiles, 7 days for.corruptfiles, 5 minutes for.tmpfiles - Race condition mitigation with
File.Exists()checks before file operations
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/Uno.DevTools.Telemetry/Uno.DevTools.Telemetry.csproj | Adds InternalsVisibleTo for test assembly to access internal constants and methods |
| src/Uno.DevTools.Telemetry/PersistenceChannel/StorageService.cs | Implements TTL constants, comprehensive exception handling, file existence checks, corruption quarantine, and expanded cleanup logic |
| src/Uno.DevTools.Telemetry/PersistenceChannel/Sender.cs | Adds bounded retry with 2-hour maximum, transmission age tracking, unknown exception retry logic, and SendLoop exception safety |
| src/Uno.DevTools.Telemetry/PersistenceChannel/FlushManager.cs | Wraps Flush operation in try-catch to prevent serialization exceptions from escaping |
| src/Uno.DevTools.Telemetry.Tests/PersistenceChannelResiliencyTests.cs | Adds 14 integration tests covering corruption handling, TTL cleanup, exception resilience, bounded retry, and race condition scenarios |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Add retry logic (3 attempts, 50ms delay) for Delete operations to handle transient file locks - Update File.Open calls with proper FileShare modes for concurrent access: - SaveTransmissionToFileAsync: FileShare.Read (allows other processes to read while writing) - LoadTransmissionFromFileAsync: FileShare.ReadWrite | FileShare.Delete (full concurrent access) - GetSize: FileShare.ReadWrite | FileShare.Delete (full concurrent access) - Add DeleteFileWithRetry helper method to handle IOException and UnauthorizedAccessException - Update file headers with change log entries (carldebilly/copilot) This ensures the design intent of supporting multiple processes accessing the same storage folder works correctly even under concurrent access scenarios. Co-authored-by: carldebilly <4174207+carldebilly@users.noreply.github.com>
Fixed in commit 2a00a28. The code now properly handles concurrent access from multiple process instances: Changes made:
This aligns with the PersistenceChannel design which explicitly states (line 35-37 in PersistenceChannel.cs) that multiple processes can share the same storage folder. The FileShare modes and retry logic ensure operations work correctly even when files are being accessed by multiple instances simultaneously. |
|
@copilot Please address PR reviews. |
- GetTransmissionAge: Return TimeSpan.MaxValue instead of Zero for unknown age (sentinel value for expired) - RenameToCorrupted: Delete existing .corrupt file before Move to handle concurrent calls - Delete: Move GetSize call before File.Exists check to avoid TOCTOU race condition - Test: Remove unused remainingFiles variable These fixes ensure: 1. Transmissions with unknown age are treated as expired and dropped 2. Multiple concurrent RenameToCorrupted calls don't cause IOException 3. File size tracking remains accurate even with TOCTOU race conditions 4. No unused variables in test code Co-authored-by: carldebilly <4174207+carldebilly@users.noreply.github.com>
Fixed in commit ae7e1a9. Changes made:
All 14 tests passing. |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 14 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@copilot New bunch of reviews against yourself ;-) They seems pertinent. But those about delay in tests are less important since the tests are passing well on the CI. This PR introduce a lot of new code. If you can refactor a little bit, reuse, create helpers, make the code more elegant... it would be cool. Even for tests. |
- Fix GetFiles to return only filenames using Path.GetFileName - Fix DeleteFileWithRetry to properly throw exception after all retries exhausted - Update GetFiles documentation to clarify return value This fixes a critical bug where Directory.GetFiles returns full paths but the code was combining them with StorageFolder again, creating invalid paths like: 'StorageFolder\C:\FullPath\file.tmp' which caused all cleanup operations to fail silently. The retry logic now properly throws the last exception after all retries are exhausted instead of swallowing it. Co-authored-by: carldebilly <4174207+carldebilly@users.noreply.github.com>
…bility - Replace Thread.Sleep with immediate retry in DeleteFileWithRetry - Thread.Sleep is not supported in WebAssembly context - Immediate retry is actually more efficient and works on all platforms - Thread.Sleep can remain in tests (not production code) This ensures the telemetry library can run in WebAssembly context where Thread.Sleep throws PlatformNotSupportedException. Immediate retries are sufficient for typical file lock scenarios and provide better performance. Co-authored-by: carldebilly <4174207+carldebilly@users.noreply.github.com>
Addressed in commits 2d18314 and 4319289. Critical fixes:
Why immediate retry? All 14 tests passing. The most critical issues (file path bug and Wasm compatibility) are now fixed. |
- Use exception filter 'when' clause instead of capturing and rethrowing - Exceptions naturally propagate on last retry attempt - Cleaner code: no need for lastException variable or explicit throw - Same behavior: retries on IOException/UnauthorizedAccessException, throws on last attempt This follows the best practice of letting exceptions propagate naturally rather than capturing and explicitly rethrowing them. Co-authored-by: carldebilly <4174207+carldebilly@users.noreply.github.com>
…nup, and refactor helpers - Replaced `Thread.Sleep` with a polling mechanism in tests to improve determinism and reduce flakiness. - Improved file cleanup by consolidating logic into helper methods (e.g., `WaitUntilAsync`). - Refactored file handling in `StorageService` to streamline deletion and improve exception safety. - Introduced helper classes (`ThrowingTelemetry`, `ThrowingPeekStorageService`) for more robust test cases. - Reduced redundancy by reusing logic for transmission creation and file operations.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…onAge, improve file age handling - Replace `GetTransmissionAge` method with `TryGetTransmissionAge` for error handling and clarity. - Add fallback logic to determine file age using timestamps from file names. - Improve logging for cases where transmission age cannot be determined. - Remove unused `UnknownTransmissionAge` constant to clean up code.
Telemetry Resiliency: Prevent App Impact, Corruption Handling, Bounded Retry
Summary
This PR implements comprehensive resiliency improvements to prevent telemetry operations from crashing host applications. All changes follow the principle that telemetry must never impact the host app.
Critical Bug Fixes
Directory.GetFilesreturns full paths but code was treating them as filenamesKey Changes
1. Safety / Non-Impact (Priority 1)
IOExceptiononly to all exceptions2. Corruption Handling (Priority 2)
3. Cleanup + Disk Safety (Priority 2)
TransmissionFileTtl(30d),CorruptedFileTtl(7d),MaxRetryDuration(2h)4. Bounded Retry Policy (Priority 3)
GetTransmissionAgehelper tracks transmission age5. Concurrent Access Handling (Multi-Process Support)
6. Tests and Observability
Task.Delay(noThread.Sleep)Design Decisions
Why immediate retry instead of delayed retry?
Thread.Sleep is not supported in WebAssembly (throws PlatformNotSupportedException). Immediate retries are actually more efficient and sufficient for typical file lock scenarios. The retry logic handles transient failures from concurrent access without blocking the thread.
Why GetFiles returns filenames only?
Critical bug fix: Directory.GetFiles returns full paths, but the code was combining them with StorageFolder again, creating invalid paths that caused all cleanup operations to fail silently.
Why exception filters instead of capture/rethrow?
Cleaner code pattern that lets exceptions propagate naturally on the last attempt rather than explicitly capturing and rethrowing them.
Testing
14 integration tests (all passing):
Files Changed
PersistenceChannel/StorageService.cs: Critical bug fixes, Wasm compatibility, TTL constants, exception guards, retry logic, FileShare modesPersistenceChannel/Sender.cs: Bounded retry, unknown exception retry, SendLoop resiliencePersistenceChannel/FlushManager.cs: Exception handling for serializationUno.DevTools.Telemetry.csproj: InternalsVisibleTo for testsPersistenceChannelResiliencyTests.cs: 14 integration testsOriginal prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.