Refactor notification-sending logic and add asynchronous notification delivery
Description
Background / problem
Notifications (approval call-for-action, reminders, reassignment, watcher, expiration; across email, Slack, Jira native, and Confluence comment-mention) were sent synchronously, fire-and-forget, inside the approval transaction. That meant:
-
external send latency and flakiness were on the critical path of the approval action;
-
a failed/slow provider could block or slow the transaction;
-
no retry and no visibility — if delivery failed, it was silently lost;
-
delivery logic was spread across many ad-hoc senders/dispatchers/orchestrators per product and channel, hard to reason about and extend.
What changed
The synchronous send path is replaced with an asynchronous, durable delivery pipeline on the transactional-outbox pattern, and the whole notification subsystem is restructured behind it. As part of that restructure, the per-product Slack stack is unified into a single channel-agnostic chat layer, and the code is prepared for MS Teams as a drop-in second chat channel — new beans only, no orchestration edits (see §2, §6).
1. Asynchronous delivery via a transactional outbox
-
Content is rendered on the request thread (so it reflects the approval state at decision time) and the fully-rendered rows are written to the
notification_outboxtable in the same transaction as the approval action — nothing external happens on the critical path. -
A background
NotificationDispatchWorkerclaims due PENDING rows in fair, per-tenant batches (FOR UPDATE SKIP LOCKED) and performs the external send off the request thread. -
New DB (Liquibase, jOOQ-generated) —
notification_outbox+ Postgres enums:-
outbox_status(PENDING, CLAIMED, SENT, FAILED, SENT_WITH_FALLBACK, SUPERSEDED) -
notification_mechanism_type(DIRECT_EMAIL, JIRA_NOTIFICATION, CONFLUENCE_COMMENT_MENTION, SLACK, EXTERNAL_APPROVER_EMAIL) -
notification_recipient_type(USER, GROUP, EXTERNAL_EMAIL, ISSUE_WATCHERS) -
notification_template_type(22 values, mirrorsEmailTemplateType) -
atlassian_product(JIRA, CONFLUENCE) -
The migration also amends other tables:
webhook_call_log(pre-call outbox),instance_error(approval/step binding + storm batching),approval_step(error_message),approval(last_reminded_at).
-
-
The full
NotificationMessageis serialized into apayloadjsonb (source of truth on read); queryable columns (product, mechanism, template_type, recipient_type+ recipient ids,dispatch_id/dispatch_group_id,host/approval/step,critical_for_step,fallback_chain) are denormalized so the claim/query path and the recipientCHECKconstraint work without parsing the payload.
2. Delivery mechanisms + fallback
Pluggable NotificationMechanism strategies, selected per recipient by MechanismSelector / MechanismRegistry (keyed by (product, type)), wired in NotificationMechanismConfig. Five concrete mechanisms:
-
DirectEmailMechanism— wraps an internal template fragment and emails any resolved address (incl. reminder/reassignment external approvers). -
ExternalApproverEmailMechanism— the email-step external-approver send: registers the external contact, then emails already-wrapped external-template content; critical, no fallback. -
JiraNotificationMechanism— Jira native notification. -
ConfluenceCommentMentionMechanism— Confluence footer-comment @-mention. -
ChatMechanism— a single channel-agnostic mechanism for chat delivery. It delegates the actual post to aChatSendServiceSPI; Slack is the one concrete channel today (SlackSendService), and MS Teams is designed to slot in additively (see §6). Registered once per product inNotificationMechanismConfig.
Fallback chains: e.g. a Slack channel post with unresolved @-mentions falls back to direct-email / native follow-ups (NativeFollowUpStrategy → JiraNativeFollowUp / ConfluenceCommentFollowUp), recorded as SENT_WITH_FALLBACK.
3. Reliability
-
Retry with backoff: transient-vs-permanent classification (
TransientFailures); transient failures are rescheduled to PENDING with exponential backoff viaOutboxRetryPolicy+ anext_attempt_atgate (claim skips not-yet-due rows). Permanent failures are recorded as a hostinstance_errorviaNotificationFailureRecorder. -
At-least-once + crash recovery: each claim stamps a per-claim
claimed_bytoken (guards every state transition so a row reclaimed mid-flight can't be double-written) and apicked_attimestamp.OutboxRecoverySweepreclaims rows stuck inCLAIMEDpast a threshold (crashed worker) and abandons after the retry budget; the budget boundary is shared with the worker's retry path. No transaction spans external I/O, so delivery is explicitly at-least-once (rare duplicate possible; no channel exposes a cheap idempotency key). -
Retention:
OutboxRetentionJobprunes old terminal (SENT/SENT_WITH_FALLBACK/SUPERSEDED/FAILED) rows. -
Storm-collapse: a newer dispatch for the same step supersedes still-PENDING older rows (
SUPERSEDED) to avoid duplicate/stale sends. -
Step-error stamping: critical failures are reconciled onto the step (declarative
step_error_stamped+ partial index) so each failed critical row stamps its step exactly once, regardless of which failure path produced it. -
Mixed-version-deploy safety (
payload_schema_version): additive payload changes are tolerated (lenient deserialization, no bump); a backward-incompatible change bumps the version, and an old pod defers (never abandons/misdelivers) a too-new row to a newer pod. -
instance_errorstorm batching: repeated identical errors collapse into one row (count+last_seen, 5-minute sliding window, advisory-lock-serialized) instead of flooding the admin Instance Errors view.
4. Delivery-status visibility (frontend)
-
After an approval action the UI polls dispatch status (
useDispatchStatus/usePostActionPolling) and shows aNotificationDispatchIndicator(pending / sent / failed); a bulk variant covers group actions (dispatch_group_id). -
NotificationFailuresBanner+ per-step badges surface delivery failures instead of losing them silently (useNotificationFailuresper-approval anduseNotificationFailuresBatchfor the list — one batched fetch, no per-row N+1). -
Backed by new
status/REST controllers, access-gated through a singleNotificationVisibilityHelper(admin-or-originator + per-reference access checks, JSM lockout).
5. Structural refactor of the build side (behaviour-preserving, snapshot-gated)
-
The two product-adapter monoliths were split into per-trigger-family builders (
CallForAction,Reminder,Reassignment,Notification, +Watcherfor Jira) over a per-product*NotificationSupportengine. -
Shared bases extracted:
AbstractProductNotificationAdapter(build pipeline + liftedEmailStepDispatchhandling),AbstractNotificationSupport(assemble + single-recipient plan),AbstractNotificationDataFetcher(async fetch skeleton:mergeAccountIds+resolveCompleteEmailMap),AbstractTriggerBuilder(the shared support / content-builder / chat-plan-builder /EmailStepRowBuilderquartet for reminder/reassignment builders). -
Chat delivery unified + made channel-agnostic: the per-product Slack stack (orchestrators, notifiers, the
slack/dispatch/**dispatchers,JiraSlack…/ConfluenceSlack…) collapsed into oneChatMechanismbehind aChatSendServiceSPI, plus channel-neutral SPIsChatChannel/ChannelBinding/ChatChannelRowAssembler/Chat{Lifecycle,Decision}UpdateService. Slack is now just the first concrete implementation; the engine no longer references "Slack" specifically. This is what turns "add MS Teams" into an additive change (see §6). -
Channel-neutral plan building: a shared
ChatPlanBuilderbase owns the recipient-graph shaping (DM fan-out, delegate intents, group/channel recipients);MessageAssemblerturnsRecipientIntents into outbox rows via theChatChannelRowAssemblerSPI (oneChannelPlanper plan). Delegate emission is consistent across Jira/Confluence (both filter to resolved delegates). -
Decision-update model extracted to a
decision/package (DecisionUpdateAssembler,StepDecisionUpdate,DecisionEntry,Vote*,ProgressInfo, …), channel-neutral so any chat channel renders it. -
Self-contained collaborators pulled out:
EmailStepRowBuilder,JsmNotificationLinkModeResolver,MailNotification/ExternalEmailLinks(email links),CompleteEmailMap. -
Package layout:
model/path/notification/{adapter, chat, mechanism, outbox, status, decision, email, reminder, slack, jira, confluence}. -
DB-changeset hygiene: the incremental development changesets are consolidated (column/index folded into the create-table; the instance-error changes merged); a redundant pending index dropped; discriminator enums/columns follow the schema's
_typeconvention (recipient_type,template_type) and the product enum isatlassian_product. -
Targeted de-duplication, reduced null-passing (no-channel assemble overloads), and comment/Javadoc accuracy fixes. This half is behaviour-preserving (snapshot-gated).
6. MS-Teams readiness (channel-agnostic by construction)
The chat path is built so a second chat channel is purely additive — new beans, no orchestration edits:
-
ChatMechanism+ChatSendServiceSPI: aTeamsSendService+ a TeamsChatChannel/ChannelBinding+ 2 mechanism beans is the whole delivery surface. -
ChatLifecycleUpdateDispatcher/ChatDecisionUpdateDispatcher— registry-style composites (mirroringChatChannels/MechanismRegistry) that fan out to all registeredChat*UpdateServiceimpls, so a second impl doesn't break the engine's by-type injection. Engine call sites (ApprovalProcess,ExpirationScheduler,DynamicStepUpdateService) inject the dispatchers. -
The channel-neutral plan/render/targeting types (
ChannelPlan,ChatRenderInputs,ChatTargeting,ChannelBinding) carry no Slack-specific types; the Slack specifics live behind the SPI implementations.
7. Automatic reminders (rewritten for cross-instance correctness)
AutomaticReminderScheduler (5-minute cron) anchors on a new approval.last_reminded_at column via a compare-and-set claim (claimAutomaticReminder): exactly one reminder per interval across multiple app instances, with catch-up on a missed/late tick — replacing stateless wall-clock-grid firing that could duplicate (multi-pod) or skip reminders.
8. Removed
The synchronous senders/dispatchers: the sync NotificationService, Slack(Notification)Helper / Orchestrator, the slack/dispatch/** package, JiraSlackDispatcher / Notifier, ConfluenceSlackDispatcher / Notifier, SlackMessageSender, etc. (SlackNotificationService was slimmed + renamed SlackLifecycleUpdateService — the Slack ChatLifecycleUpdateService impl.)
Architecture
-
Async delivery flow — request thread renders + enqueues in-transaction → worker claims, sends, retries/falls-back, and recovers.
-
Build-side decomposition — per product, mirrored Jira + Confluence, over shared bases.
-
InternalSystemNotification(admin workflow-failure signals) is the one trigger that bypasses this pipeline — delivered synchronously by a thin Jira-only sender (no fan-out, no outbox, best-effort).
Scope
-
Whole-branch: the entire
model/path/notification/**package rewritten + reorganized; old synchronous senders removed; the channel-agnostic chat layer and the FE status/failures UI added. -
The build-side decomposition shrank the adapter monoliths from ~1200–1460 lines each (Jira / Confluence
*NotificationAdapter) to the low hundreds, with the logic moved into focused per-trigger builders + shared bases.
Verification
-
Snapshot oracles (
NotificationSnapshotJiraTest+NotificationSnapshotConfluenceTest) assert exactNotificationMessageoutput for every trigger × host-configuration; zero-diff confirms the build-side refactor is byte-identical. -
Enum-parity tests guard the manually-synced enums: every
EmailTemplateType↔notification_template_type, and the model mechanism type ↔ jOOQ enum (so adding a value without the matchingALTER TYPEfails a unit test, not production). -
Dispatcher delegation test proves
Chat*UpdateDispatcherfans a call to every registered impl (the Teams-additive contract). -
~150+ notification test files (mechanisms, outbox DAO / worker / recovery / retention, snapshot oracles, adapters, status). Public constructors / build contract unchanged → no Spring/test wiring churn. Full Java suite green.
Risk & rollout
-
Behavioural change of intent: notifications are delivered asynchronously (with retry + failure recording) instead of synchronously inside the transaction. The content users receive is unchanged; when/how reliably it arrives changes (off the critical path, retried, no longer silently lost). Automatic reminders are de-duplicated across instances.
-
Requires the DB migration — the outbox table + enums, plus the amendments to
webhook_call_log,instance_error,approval_step, andapproval— and the background workers running:NotificationDispatchWorker,OutboxRecoverySweep,OutboxRetentionJob, andAutomaticReminderScheduler. -
The refactor half is low-risk (snapshot-gated, constructors unchanged). The async half is the substantive feature and should get the QA focus below.
QA notes
Exercise across both products and all host configs (Slack-enabled, direct-email, comment-mention / Jira-native):
-
All trigger types: call-for-action (user / moderator / group / vote, with delegates), reminders (automatic / custom / creator), step reassignment (added / removed, incl. email steps), plain notifications, approval expiration, watcher notifications (Jira).
-
Async behaviour: notification arrives shortly after the action (not blocking it); dispatch indicator transitions pending → sent; the failures banner appears when a provider rejects a send; retries occur on transient failures (with backoff); a crashed/abandoned delivery is recovered by the sweep; old outbox rows are pruned; a Slack channel post with unresolved @-mentions falls back (
SENT_WITH_FALLBACK). -
Automatic reminders: fire at most once per interval even with multiple app instances; a missed tick catches up; no duplicate reminders.
-
Admin visibility: repeated identical failures collapse into one
instance_errorrow (count + last seen), not a flood.