Refactor notification-sending logic and add asynchronous notification delivery

Description

Background / problem

Notifications (approval call-for-action, reminders, reassignment, watcher, expiration; across email, Slack, Jira native, and Confluence comment-mention) were sent synchronously, fire-and-forget, inside the approval transaction. That meant:

  • external send latency and flakiness were on the critical path of the approval action;

  • a failed/slow provider could block or slow the transaction;

  • no retry and no visibility — if delivery failed, it was silently lost;

  • delivery logic was spread across many ad-hoc senders/dispatchers/orchestrators per product and channel, hard to reason about and extend.

What changed

The synchronous send path is replaced with an asynchronous, durable delivery pipeline on the transactional-outbox pattern, and the whole notification subsystem is restructured behind it. As part of that restructure, the per-product Slack stack is unified into a single channel-agnostic chat layer, and the code is prepared for MS Teams as a drop-in second chat channel — new beans only, no orchestration edits (see §2, §6).

1. Asynchronous delivery via a transactional outbox

  • Content is rendered on the request thread (so it reflects the approval state at decision time) and the fully-rendered rows are written to the notification_outboxtable in the same transaction as the approval action — nothing external happens on the critical path.

  • A background NotificationDispatchWorker claims due PENDING rows in fair, per-tenant batches (FOR UPDATE SKIP LOCKED) and performs the external send off the request thread.

  • New DB (Liquibase, jOOQ-generated)notification_outbox + Postgres enums:

    • outbox_status (PENDING, CLAIMED, SENT, FAILED, SENT_WITH_FALLBACK, SUPERSEDED)

    • notification_mechanism_type (DIRECT_EMAIL, JIRA_NOTIFICATION, CONFLUENCE_COMMENT_MENTION, SLACK, EXTERNAL_APPROVER_EMAIL)

    • notification_recipient_type (USER, GROUP, EXTERNAL_EMAIL, ISSUE_WATCHERS)

    • notification_template_type (22 values, mirrors EmailTemplateType)

    • atlassian_product (JIRA, CONFLUENCE)

    • The migration also amends other tables: webhook_call_log (pre-call outbox), instance_error (approval/step binding + storm batching), approval_step(error_message), approval (last_reminded_at).

  • The full NotificationMessage is serialized into a payload jsonb (source of truth on read); queryable columns (product, mechanism, template_type, recipient_type + recipient ids, dispatch_id/dispatch_group_id, host/approval/step, critical_for_step, fallback_chain) are denormalized so the claim/query path and the recipient CHECK constraint work without parsing the payload.

2. Delivery mechanisms + fallback

Pluggable NotificationMechanism strategies, selected per recipient by MechanismSelector / MechanismRegistry (keyed by (product, type)), wired in NotificationMechanismConfig. Five concrete mechanisms:

  • DirectEmailMechanism — wraps an internal template fragment and emails any resolved address (incl. reminder/reassignment external approvers).

  • ExternalApproverEmailMechanism — the email-step external-approver send: registers the external contact, then emails already-wrapped external-template content; critical, no fallback.

  • JiraNotificationMechanism — Jira native notification.

  • ConfluenceCommentMentionMechanism — Confluence footer-comment @-mention.

  • ChatMechanism — a single channel-agnostic mechanism for chat delivery. It delegates the actual post to a ChatSendService SPI; Slack is the one concrete channel today (SlackSendService), and MS Teams is designed to slot in additively (see §6). Registered once per product in NotificationMechanismConfig.

Fallback chains: e.g. a Slack channel post with unresolved @-mentions falls back to direct-email / native follow-ups (NativeFollowUpStrategyJiraNativeFollowUp / ConfluenceCommentFollowUp), recorded as SENT_WITH_FALLBACK.

3. Reliability

  • Retry with backoff: transient-vs-permanent classification (TransientFailures); transient failures are rescheduled to PENDING with exponential backoff via OutboxRetryPolicy + a next_attempt_at gate (claim skips not-yet-due rows). Permanent failures are recorded as a host instance_error via NotificationFailureRecorder.

  • At-least-once + crash recovery: each claim stamps a per-claim claimed_by token (guards every state transition so a row reclaimed mid-flight can't be double-written) and a picked_at timestamp. OutboxRecoverySweep reclaims rows stuck in CLAIMED past a threshold (crashed worker) and abandons after the retry budget; the budget boundary is shared with the worker's retry path. No transaction spans external I/O, so delivery is explicitly at-least-once (rare duplicate possible; no channel exposes a cheap idempotency key).

  • Retention: OutboxRetentionJob prunes old terminal (SENT/SENT_WITH_FALLBACK/SUPERSEDED/FAILED) rows.

  • Storm-collapse: a newer dispatch for the same step supersedes still-PENDING older rows (SUPERSEDED) to avoid duplicate/stale sends.

  • Step-error stamping: critical failures are reconciled onto the step (declarative step_error_stamped + partial index) so each failed critical row stamps its step exactly once, regardless of which failure path produced it.

  • Mixed-version-deploy safety (payload_schema_version): additive payload changes are tolerated (lenient deserialization, no bump); a backward-incompatible change bumps the version, and an old pod defers (never abandons/misdelivers) a too-new row to a newer pod.

  • instance_error storm batching: repeated identical errors collapse into one row (count + last_seen, 5-minute sliding window, advisory-lock-serialized) instead of flooding the admin Instance Errors view.

4. Delivery-status visibility (frontend)

  • After an approval action the UI polls dispatch status (useDispatchStatus / usePostActionPolling) and shows a NotificationDispatchIndicator (pending / sent / failed); a bulk variant covers group actions (dispatch_group_id).

  • NotificationFailuresBanner + per-step badges surface delivery failures instead of losing them silently (useNotificationFailures per-approval and useNotificationFailuresBatch for the list — one batched fetch, no per-row N+1).

  • Backed by new status/ REST controllers, access-gated through a single NotificationVisibilityHelper (admin-or-originator + per-reference access checks, JSM lockout).

5. Structural refactor of the build side (behaviour-preserving, snapshot-gated)

  • The two product-adapter monoliths were split into per-trigger-family builders (CallForAction, Reminder, Reassignment, Notification, + Watcher for Jira) over a per-product *NotificationSupport engine.

  • Shared bases extracted: AbstractProductNotificationAdapter (build pipeline + lifted EmailStepDispatch handling), AbstractNotificationSupport (assemble + single-recipient plan), AbstractNotificationDataFetcher (async fetch skeleton: mergeAccountIds + resolveCompleteEmailMap), AbstractTriggerBuilder(the shared support / content-builder / chat-plan-builder / EmailStepRowBuilder quartet for reminder/reassignment builders).

  • Chat delivery unified + made channel-agnostic: the per-product Slack stack (orchestrators, notifiers, the slack/dispatch/** dispatchers, JiraSlack…/ConfluenceSlack…) collapsed into one ChatMechanism behind a ChatSendService SPI, plus channel-neutral SPIs ChatChannel / ChannelBinding / ChatChannelRowAssembler / Chat{Lifecycle,Decision}UpdateService. Slack is now just the first concrete implementation; the engine no longer references "Slack" specifically. This is what turns "add MS Teams" into an additive change (see §6).

  • Channel-neutral plan building: a shared ChatPlanBuilder base owns the recipient-graph shaping (DM fan-out, delegate intents, group/channel recipients); MessageAssembler turns RecipientIntents into outbox rows via the ChatChannelRowAssembler SPI (one ChannelPlan per plan). Delegate emission is consistent across Jira/Confluence (both filter to resolved delegates).

  • Decision-update model extracted to a decision/ package (DecisionUpdateAssembler, StepDecisionUpdate, DecisionEntry, Vote*, ProgressInfo, …), channel-neutral so any chat channel renders it.

  • Self-contained collaborators pulled out: EmailStepRowBuilder, JsmNotificationLinkModeResolver, MailNotification / ExternalEmailLinks (email links), CompleteEmailMap.

  • Package layout: model/path/notification/{adapter, chat, mechanism, outbox, status, decision, email, reminder, slack, jira, confluence}.

  • DB-changeset hygiene: the incremental development changesets are consolidated (column/index folded into the create-table; the instance-error changes merged); a redundant pending index dropped; discriminator enums/columns follow the schema's _type convention (recipient_type, template_type) and the product enum is atlassian_product.

  • Targeted de-duplication, reduced null-passing (no-channel assemble overloads), and comment/Javadoc accuracy fixes. This half is behaviour-preserving (snapshot-gated).

6. MS-Teams readiness (channel-agnostic by construction)

The chat path is built so a second chat channel is purely additive — new beans, no orchestration edits:

  • ChatMechanism + ChatSendService SPI: a TeamsSendService + a Teams ChatChannel/ChannelBinding + 2 mechanism beans is the whole delivery surface.

  • ChatLifecycleUpdateDispatcher / ChatDecisionUpdateDispatcher — registry-style composites (mirroring ChatChannels / MechanismRegistry) that fan out to all registered Chat*UpdateService impls, so a second impl doesn't break the engine's by-type injection. Engine call sites (ApprovalProcess, ExpirationScheduler, DynamicStepUpdateService) inject the dispatchers.

  • The channel-neutral plan/render/targeting types (ChannelPlan, ChatRenderInputs, ChatTargeting, ChannelBinding) carry no Slack-specific types; the Slack specifics live behind the SPI implementations.

7. Automatic reminders (rewritten for cross-instance correctness)

AutomaticReminderScheduler (5-minute cron) anchors on a new approval.last_reminded_at column via a compare-and-set claim (claimAutomaticReminder): exactly one reminder per interval across multiple app instances, with catch-up on a missed/late tick — replacing stateless wall-clock-grid firing that could duplicate (multi-pod) or skip reminders.

8. Removed

The synchronous senders/dispatchers: the sync NotificationService, Slack(Notification)Helper / Orchestrator, the slack/dispatch/** package, JiraSlackDispatcher / Notifier, ConfluenceSlackDispatcher / Notifier, SlackMessageSender, etc. (SlackNotificationService was slimmed + renamed SlackLifecycleUpdateService — the Slack ChatLifecycleUpdateService impl.)

Architecture

  • Async delivery flow — request thread renders + enqueues in-transaction → worker claims, sends, retries/falls-back, and recovers.

  • Build-side decomposition — per product, mirrored Jira + Confluence, over shared bases.

  • InternalSystemNotification (admin workflow-failure signals) is the one trigger that bypasses this pipeline — delivered synchronously by a thin Jira-only sender (no fan-out, no outbox, best-effort).

Scope

  • Whole-branch: the entire model/path/notification/** package rewritten + reorganized; old synchronous senders removed; the channel-agnostic chat layer and the FE status/failures UI added.

  • The build-side decomposition shrank the adapter monoliths from ~1200–1460 lines each (Jira / Confluence *NotificationAdapter) to the low hundreds, with the logic moved into focused per-trigger builders + shared bases.

Verification

  • Snapshot oracles (NotificationSnapshotJiraTest + NotificationSnapshotConfluenceTest) assert exact NotificationMessage output for every trigger × host-configuration; zero-diff confirms the build-side refactor is byte-identical.

  • Enum-parity tests guard the manually-synced enums: every EmailTemplateTypenotification_template_type, and the model mechanism type ↔ jOOQ enum (so adding a value without the matching ALTER TYPE fails a unit test, not production).

  • Dispatcher delegation test proves Chat*UpdateDispatcher fans a call to every registered impl (the Teams-additive contract).

  • ~150+ notification test files (mechanisms, outbox DAO / worker / recovery / retention, snapshot oracles, adapters, status). Public constructors / build contract unchanged → no Spring/test wiring churn. Full Java suite green.

Risk & rollout

  • Behavioural change of intent: notifications are delivered asynchronously (with retry + failure recording) instead of synchronously inside the transaction. The content users receive is unchanged; when/how reliably it arrives changes (off the critical path, retried, no longer silently lost). Automatic reminders are de-duplicated across instances.

  • Requires the DB migration — the outbox table + enums, plus the amendments to webhook_call_log, instance_error, approval_step, and approval — and the background workers running: NotificationDispatchWorker, OutboxRecoverySweep, OutboxRetentionJob, and AutomaticReminderScheduler.

  • The refactor half is low-risk (snapshot-gated, constructors unchanged). The async half is the substantive feature and should get the QA focus below.

QA notes

Exercise across both products and all host configs (Slack-enabled, direct-email, comment-mention / Jira-native):

  • All trigger types: call-for-action (user / moderator / group / vote, with delegates), reminders (automatic / custom / creator), step reassignment (added / removed, incl. email steps), plain notifications, approval expiration, watcher notifications (Jira).

  • Async behaviour: notification arrives shortly after the action (not blocking it); dispatch indicator transitions pending → sent; the failures banner appears when a provider rejects a send; retries occur on transient failures (with backoff); a crashed/abandoned delivery is recovered by the sweep; old outbox rows are pruned; a Slack channel post with unresolved @-mentions falls back (SENT_WITH_FALLBACK).

  • Automatic reminders: fire at most once per interval even with multiple app instances; a missed tick catches up; no duplicate reminders.

  • Admin visibility: repeated identical failures collapse into one instance_error row (count + last seen), not a flood.