Skip to content

Replace usage of Fatale with context cancellation#1639

Open
ggilder wants to merge 14 commits intogithub:masterfrom
ggilder:remove-fatale-usage
Open

Replace usage of Fatale with context cancellation#1639
ggilder wants to merge 14 commits intogithub:masterfrom
ggilder:remove-fatale-usage

Conversation

@ggilder
Copy link
Contributor

@ggilder ggilder commented Mar 5, 2026

Related issue: #1635

Description

This PR replaces log.Fatale error handling in listenOnPanicAbort() with context-based cancellation and proper error propagation, enabling gh-ost to be used as a Go library while maintaining the same CLI behavior.

The previous implementation called Fatale in a separate goroutine when fatal errors occurred. This had several issues:

  • Could not be recovered by the caller
  • Library users had no way to handle errors gracefully
  • Process would crash instead of returning errors
  • Violates Go best practices ("libraries should return errors, not panic")

I've implemented a context-based cancellation system with proper error propagation:

  1. Error Storage: Added AbortError field to MigrationContext with thread-safe accessor methods that implement "first error wins" semantics
  2. Context Cancellation: Added context.Context to MigrationContext for coordinated shutdown signaling across all goroutines
  3. Graceful Shutdown: Updated listenOnPanicAbort() to store errors and cancel context instead of exiting the process
  4. Abort Checks: Added checkAbort() helper that checks for stored errors at key points in Migrate() and Revert()
  5. Goroutine Integration: Updated all long-running goroutines (heartbeat, streaming, throttling, row iteration) to respect context cancellation

This also makes abort scenarios more testable.

  • contributed code is using same conventions as original code
  • script/cibuild returns with no formatting errors, build errors or unit test errors.

This patch modifies gh-ost to use a cancellable context instead of
log.Fatale() in listenOnPanicAbort. When using gh-ost as a library, this
allows the calling application to recover from aborts (e.g. log the
failure reason) instead of having the entire process terminate via
os.Exit(). Now we store the error and cancel a context to signal all
goroutines to stop gracefully.
@ggilder ggilder force-pushed the remove-fatale-usage branch from c35276d to 81ec000 Compare March 5, 2026 22:56
@ggilder ggilder marked this pull request as ready for review March 5, 2026 22:58
Copilot AI review requested due to automatic review settings March 5, 2026 22:58
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces abort handling that previously exited the process (log.Fatale) with context cancellation + “first error wins” error propagation, improving gh-ost’s usability as a library while preserving CLI behavior.

Changes:

  • Added context.Context + abort error storage to MigrationContext, and cancelled context on abort.
  • Updated long-running loops and abort send sites to respect context cancellation.
  • Added unit tests for abort propagation and first-error-wins behavior.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
go/base/context.go Adds cancellable context + abort error storage APIs to MigrationContext.
go/base/context_test.go Tests for first-error-wins abort error storage and thread-safety.
go/logic/migrator.go Replaces Fatale with error storage + context cancellation; adds checkAbort() and integrates it into migration flow.
go/logic/migrator_test.go New tests validating abort propagation and checkAbort() behavior.
go/logic/applier.go Heartbeat loop now exits on context cancellation; abort send is context-aware.
go/logic/throttler.go Throttle checks now exit on context cancellation; abort sends are context-aware.
go/logic/streamer.go Streaming loop now checks for context cancellation.
go/logic/server.go “panic” command abort send is context-aware.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

@meiji163
Copy link
Contributor

meiji163 commented Mar 6, 2026

This is great!

meiji163
meiji163 previously approved these changes Mar 6, 2026
@meiji163
Copy link
Contributor

meiji163 commented Mar 6, 2026

@GGlider Looks like the migration tests are hanging for some reason. I'll take a look later if I get the chance

@ggilder
Copy link
Contributor Author

ggilder commented Mar 7, 2026

@meiji163 hmm, interesting. I did see some tests hanging locally before, and I thought I had resolved that. However I just tried running locally and it's hanging at the same spot. I'll take another look on Monday.

@ggilder
Copy link
Contributor Author

ggilder commented Mar 9, 2026

@meiji163 think I found the spot that was deadlocking, and I added a helper to encapsulate the non-blocking channel send pattern. I believe the replica tests should pass now

@ggilder
Copy link
Contributor Author

ggilder commented Mar 9, 2026

Looks like it's still hanging somewhere 😢 I'll keep testing, maybe I can run the tests in a container to more closely approximate CI.

@ggilder
Copy link
Contributor Author

ggilder commented Mar 10, 2026

@meiji163 could you approve the workflow run again? I think this latest commit will fix it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants