Refactor notification-sending logic and add asynchronous notification delivery

Description

Background / problem

Notifications (approval call-for-action, reminders, reassignment, watcher, expiration; across email, Slack, Jira native, and Confluence comment-mention) were sent synchronously, fire-and-forget, inside the approval transaction. That meant:

external send latency and flakiness were on the critical path of the approval action;
a failed/slow provider could block or slow the transaction;
no retry and no visibility — if delivery failed, it was silently lost;
delivery logic was spread across many ad-hoc senders/dispatchers/orchestrators per product and channel, hard to reason about and extend.

What changed

The synchronous send path is replaced with an asynchronous, durable delivery pipeline on the transactional-outbox pattern, and the whole notification subsystem is restructured behind it. As part of that restructure, the per-product Slack stack is unified into a single channel-agnostic chat layer, and the code is prepared for MS Teams as a drop-in second chat channel — new beans only, no orchestration edits (see §2, §6).

1. Asynchronous delivery via a transactional outbox

Content is rendered on the request thread (so it reflects the approval state at decision time) and the fully-rendered rows are written to the notification_outboxtable in the same transaction as the approval action — nothing external happens on the critical path.
A background NotificationDispatchWorker claims due PENDING rows in fair, per-tenant batches (FOR UPDATE SKIP LOCKED) and performs the external send off the request thread.
New DB (Liquibase, jOOQ-generated) — notification_outbox + Postgres enums:
- outbox_status (PENDING, CLAIMED, SENT, FAILED, SENT_WITH_FALLBACK, SUPERSEDED)
- notification_mechanism_type (DIRECT_EMAIL, JIRA_NOTIFICATION, CONFLUENCE_COMMENT_MENTION, SLACK, EXTERNAL_APPROVER_EMAIL)
- notification_recipient_type (USER, GROUP, EXTERNAL_EMAIL, ISSUE_WATCHERS)
- notification_template_type (22 values, mirrors EmailTemplateType)
- atlassian_product (JIRA, CONFLUENCE)
- The migration also amends other tables: webhook_call_log (pre-call outbox), instance_error (approval/step binding + storm batching), approval_step(error_message), approval (last_reminded_at).
The full NotificationMessage is serialized into a payload jsonb (source of truth on read); queryable columns (product, mechanism, template_type, recipient_type + recipient ids, dispatch_id/dispatch_group_id, host/approval/step, critical_for_step, fallback_chain) are denormalized so the claim/query path and the recipient CHECK constraint work without parsing the payload.

2. Delivery mechanisms + fallback

Pluggable NotificationMechanism strategies, selected per recipient by MechanismSelector / MechanismRegistry (keyed by (product, type)), wired in NotificationMechanismConfig. Five concrete mechanisms:

DirectEmailMechanism — wraps an internal template fragment and emails any resolved address (incl. reminder/reassignment external approvers).
ExternalApproverEmailMechanism — the email-step external-approver send: registers the external contact, then emails already-wrapped external-template content; critical, no fallback.
JiraNotificationMechanism — Jira native notification.
ConfluenceCommentMentionMechanism — Confluence footer-comment @-mention.
ChatMechanism — a single channel-agnostic mechanism for chat delivery. It delegates the actual post to a ChatSendService SPI; Slack is the one concrete channel today (SlackSendService), and MS Teams is designed to slot in additively (see §6). Registered once per product in NotificationMechanismConfig.

Fallback chains: e.g. a Slack channel post with unresolved @-mentions falls back to direct-email / native follow-ups (NativeFollowUpStrategy → JiraNativeFollowUp / ConfluenceCommentFollowUp), recorded as SENT_WITH_FALLBACK.

3. Reliability

Retry with backoff: transient-vs-permanent classification (TransientFailures); transient failures are rescheduled to PENDING with exponential backoff via OutboxRetryPolicy + a next_attempt_at gate (claim skips not-yet-due rows). Permanent failures are recorded as a host instance_error via NotificationFailureRecorder.
At-least-once + crash recovery: each claim stamps a per-claim claimed_by token (guards every state transition so a row reclaimed mid-flight can't be double-written) and a picked_at timestamp. OutboxRecoverySweep reclaims rows stuck in CLAIMED past a threshold (crashed worker) and abandons after the retry budget; the budget boundary is shared with the worker's retry path. No transaction spans external I/O, so delivery is explicitly at-least-once (rare duplicate possible; no channel exposes a cheap idempotency key).
Retention: OutboxRetentionJob prunes old terminal (SENT/SENT_WITH_FALLBACK/SUPERSEDED/FAILED) rows.
Storm-collapse: a newer dispatch for the same step supersedes still-PENDING older rows (SUPERSEDED) to avoid duplicate/stale sends.
Step-error stamping: critical failures are reconciled onto the step (declarative step_error_stamped + partial index) so each failed critical row stamps its step exactly once, regardless of which failure path produced it.
Mixed-version-deploy safety (payload_schema_version): additive payload changes are tolerated (lenient deserialization, no bump); a backward-incompatible change bumps the version, and an old pod defers (never abandons/misdelivers) a too-new row to a newer pod.
instance_error storm batching: repeated identical errors collapse into one row (count + last_seen, 5-minute sliding window, advisory-lock-serialized) instead of flooding the admin Instance Errors view.

4. Delivery-status visibility (frontend)

After an approval action the UI polls dispatch status (useDispatchStatus / usePostActionPolling) and shows a NotificationDispatchIndicator (surfaces failures; success/in-flight stays quiet); a bulk variant covers group actions (dispatch_group_id). A shared useDecisionDispatchHolder hook holds the post-action dispatch id across the reload remount (manage page + reference tab).

Delivery failures are surfaced as per-step inline badges (via NotificationFailuresContext + StepNode) instead of being lost silently (useNotificationFailures per-approval and useNotificationFailuresBatch for the list — one batched fetch, no per-row N+1).

Backed by new status/ REST controllers, access-gated through a single NotificationVisibilityService (admin-or-originator + per-reference access checks, JSM lockout; the batch endpoint is bounded and access-checks every item).

5. Structural refactor of the build side (behaviour-preserving, snapshot-gated)

The two product-adapter monoliths were split into per-trigger-family builders (CallForAction, Reminder, Reassignment, Notification, + Watcher for Jira) over a per-product *NotificationSupport engine.
Shared bases extracted: AbstractProductNotificationAdapter (build pipeline + lifted EmailStepDispatch handling), AbstractNotificationSupport (assemble + single-recipient plan), AbstractNotificationDataFetcher (async fetch skeleton: mergeAccountIds + resolveCompleteEmailMap), AbstractTriggerBuilder(the shared support / content-builder / chat-plan-builder / EmailStepRowBuilder quartet for reminder/reassignment builders).
Chat delivery unified + made channel-agnostic: the per-product Slack stack (orchestrators, notifiers, the slack/dispatch/** dispatchers, JiraSlack…/ConfluenceSlack…) collapsed into one ChatMechanism behind a ChatSendService SPI, plus channel-neutral SPIs ChatChannel / ChannelBinding / ChatChannelRowAssembler / Chat{Lifecycle,Decision}UpdateService. Slack is now just the first concrete implementation; the engine no longer references "Slack" specifically. This is what turns "add MS Teams" into an additive change (see §6).
Channel-neutral plan building: a shared ChatPlanBuilder base owns the recipient-graph shaping (DM fan-out, delegate intents, group/channel recipients); MessageAssembler turns RecipientIntents into outbox rows via the ChatChannelRowAssembler SPI (one ChannelPlan per plan). Delegate emission is consistent across Jira/Confluence (both filter to resolved delegates).
Decision-update model extracted to a decision/ package (DecisionUpdateAssembler, StepDecisionUpdate, DecisionEntry, Vote*, ProgressInfo, …), channel-neutral so any chat channel renders it.
Self-contained collaborators pulled out: EmailStepRowBuilder, JsmNotificationLinkModeResolver, MailNotification / ExternalEmailLinks (email links), CompleteEmailMap.
Package layout: model/path/notification/{adapter, chat, mechanism, outbox, status, decision, email, reminder, slack, jira, confluence}.
DB-changeset hygiene: the incremental development changesets are consolidated (column/index folded into the create-table; the instance-error changes merged); a redundant pending index dropped; discriminator enums/columns follow the schema's _type convention (recipient_type, template_type) and the product enum is atlassian_product.
Targeted de-duplication, reduced null-passing (no-channel assemble overloads), and comment/Javadoc accuracy fixes. This half is behaviour-preserving (snapshot-gated).

6. MS-Teams readiness (channel-agnostic delivery by construction)

The chat delivery/update path is built so a second chat channel is purely additive — new beans, no orchestration edits:

ChatMechanism + ChatSendService SPI: a TeamsSendService + a Teams ChatChannel/ChannelBinding + 2 mechanism beans is the whole delivery surface.
ChatLifecycleUpdateDispatcher / ChatDecisionUpdateDispatcher — registry-style composites (mirroring ChatChannels / MechanismRegistry) that fan out to all registered Chat*UpdateService impls, so a second impl doesn't break the engine's by-type injection. Engine call sites (ApprovalProcess, ExpirationScheduler, DynamicStepUpdateService) inject the dispatchers.
The channel-neutral plan/render/targeting types (ChannelPlan, ChatRenderInputs, ChatTargeting, ChannelBinding) carry no Slack-specific types; the Slack specifics live behind the SPI implementations.

7. Automatic reminders (rewritten for cross-instance correctness)

AutomaticReminderScheduler (5-minute cron) anchors on a new approval.last_reminded_at column via a compare-and-set claim (claimAutomaticReminder): exactly one reminder per interval across multiple app instances, with catch-up on a missed/late tick — replacing stateless wall-clock-grid firing that could duplicate (multi-pod) or skip reminders.

Architecture

Async delivery flow — request thread renders + enqueues in-transaction → worker claims, sends, retries/falls-back, and recovers.

Build-side decomposition — per product, mirrored Jira + Confluence, over shared bases.

InternalSystemNotification (admin workflow-failure signals) is the one trigger that bypasses this pipeline — delivered synchronously by a thin Jira-only sender (no fan-out, no outbox, best-effort).

Scope

Whole-branch: the entire model/path/notification/** package rewritten + reorganized; old synchronous senders removed; the channel-agnostic chat layer and the FE status/failures UI added.
The build-side decomposition shrank the adapter monoliths from ~1200–1460 lines each (Jira / Confluence *NotificationAdapter) to the low hundreds, with the logic moved into focused per-trigger builders + shared bases.

Verification

Snapshot oracles (NotificationSnapshotJiraTest + NotificationSnapshotConfluenceTest) assert exact NotificationMessage output for every trigger × host-configuration; zero-diff confirms the build-side refactor is byte-identical.
Enum-parity tests guard the manually-synced enums: every EmailTemplateType ↔ notification_template_type, and the model mechanism type ↔ jOOQ enum (so adding a value without the matching ALTER TYPE fails a unit test, not production).
Dispatcher delegation test proves Chat*UpdateDispatcher fans a call to every registered impl (the Teams-additive contract).
~150+ notification test files (mechanisms, outbox DAO / worker / recovery / retention, snapshot oracles, adapters, status). Public constructors / build contract unchanged → no Spring/test wiring churn. Full Java suite green.

QA notes

Exercise across both products and all host configs (Slack-enabled, direct-email, comment-mention / Jira-native):

All trigger types: call-for-action (user / moderator / group / vote, with delegates), reminders (automatic / custom / creator), step reassignment (added / removed, incl. email steps), plain notifications, approval expiration, watcher notifications (Jira).
Async behaviour: notification arrives shortly after the action (not blocking it); the dispatch indicator surfaces failures; per-step failure badges appear when a provider rejects a send; retries occur on transient failures (with backoff); a crashed/abandoned delivery is recovered by the sweep; a long send isn't reclaimed mid-flight; old outbox rows are pruned; a Slack channel post with unresolved @-mentions falls back (SENT_WITH_FALLBACK).
Admin visibility: repeated identical failures collapse into one instance_error row (count + last seen), not a flood.

Ready to deploy

Details

Priority

Assignee

Michał Błaszczykowski

Reporter

Kamil Zarychta

Labels

design_improvementdocs_update_requiredroadmap

Time tracking

3w 1h logged

Refactor notification-sending logic and add asynchronous notification delivery

Description

Background / problem

What changed

1. Asynchronous delivery via a transactional outbox

2. Delivery mechanisms + fallback

3. Reliability

4. Delivery-status visibility (frontend)

5. Structural refactor of the build side (behaviour-preserving, snapshot-gated)

6. MS-Teams readiness (channel-agnostic delivery by construction)

7. Automatic reminders (rewritten for cross-instance correctness)

Architecture

Scope

Verification

QA notes

Details

Priority

Assignee

Reporter

Labels

Time tracking

More fields

Due date

Original estimate

Fix versions

Affects versions

Components

Delete this worklog?

Delete this comment?

Delete this attachment?

Unlink this work item?

Refactor notification-sending logic and add asynchronous notification delivery

Description

Background / problem

What changed

1. Asynchronous delivery via a transactional outbox

2. Delivery mechanisms + fallback

3. Reliability

4. Delivery-status visibility (frontend)

5. Structural refactor of the build side (behaviour-preserving, snapshot-gated)

6. MS-Teams readiness (channel-agnostic delivery by construction)

7. Automatic reminders (rewritten for cross-instance correctness)

Architecture

Scope

Verification

QA notes

Add watchers

Details

Priority

Assignee

Reporter

Labels

Time tracking

More fields

Due date

Original estimate

Fix versions

Affects versions

Components