[SQL Processing] Identify different trigger sources of a SQL Processing task

Description

CAPL - Story default text according to the team DoR (Definition of Ready)

01 - PERSON OF CONTACT (PERSON THAT CAN ANSWER QUESTIONS ABOUT THE PROBLEM):

@Renan Schroeder @Geny Isam Hamud Herrera

02 - PROBLEM (WHAT'S THE CURRENT PROBLEM SCENARIO OR PAIN TO BE RESOLVED?):

Today the task entity has a flag indicating if it was triggered by a pipeline scheduler or not (mdmTaskFromSchedule). This field is a boolean type and there isn't a default value assigned, so every task triggered by Carol Platform has this flag assigned to false, except when them were triggered by schedule pipeline (ScheduledTask.getTaskToSubmit).

The orchestrator also triggers SQL processing tasks independently through a HTTP request in endpoints:

  • /bigQuery/processQuery
  • carolApps/pipelines/process
  • carolApps/ {carolAppId}/pipelines/process/{pipelineName}
    * tenantApps/pipelines/process
    * tenantApps/{carolAppId}

    /pipelines/process/

    {pipelineName}

03 - GOAL (DESCRIBE THE PROPOSED SOLUTION):

We need to identify these different sources that are triggering a SQL Processing task on Carol Platform to enable the control of where the source will be to get the last datetime of successful task.

Maybe if we create a enum on the Task class identifying the source of trigger with a couple possible options:

  • TaskTriggerSource (Enum):
    • SCHEDULER:
      • All SQL Processing tasks scheduled by pipelines in Carol Apps.
    • USER:
      • All SQL Processing tasks that were triggered directly by user in UI.
    • ORCHESTRATOR:
      • All SQL Processing tasks that were triggered from /processQuery endpoint by orchestrator application.
    • PYCAROL:
      • All SQL Processing tasks that were triggered from /processQuery endpoint by pyCarol Python library.
    • PLATFORM:
      • Any other tasks which aren’t SQL Processing; or
      • Any HTTP requests to endpoints mentioned above that do not contains any known User-Agent assigned.

Example:

If the last execution of a SQL Processing task was from a scheduler pipeline, may the next time of execution of this task can be by orchestrator request, for reasons of architectural deficiency or any other else, and further back to be triggered by scheduler.

The only exception in this rule is when the user wants to trigger the SQL Processing task directly on UI, the control of task execution efficiency should ignore the last successful task execution and proceed to task execution.

Datetime Trigger Source Last Trigger Source checkExistsDataToProcess Rule
2023-01-11 00:05:00 SCHEDULER SCHEDULER True Get last successful datetime from task WHERE task_trigger_source = 'SCHEDULER/ORCHESTRATOR' to check if there is data to be processed
2023-01-11 00:17:00 ORCHESTRATOR SCHEDULER True Get last successful datetime from task WHERE task_trigger_source = 'SCHEDULER/ORCHESTRATOR' to check if there is data to be processed
2023-01-11 00:18:00 PLATFORM ORCHESTRATOR False Proceed with task execution without to check if there is data to be processed
2023-01-11 00:18:30 PLATFORM ORCHESTRATOR True Get last successful datetime from task WHERE task_trigger_source = 'SCHEDULER/ORCHESTRATOR' to check if there is data to be processed
2023-01-11 00:25:00 SCHEDULER PLATFORM True Get last successful datetime from task WHERE task_trigger_source = 'SCHEDULER/ORCHESTRATOR' to check if there is data to be processed
2023-01-11 00:27:00 PYCAROL SCHEDULER False Proceed with task execution without to check if there is data to be processed
2023-01-11 00:28:00 PYCAROL PYCAROL True Get last successful datetime from task WHERE task_trigger_source = 'SCHEDULER/ORCHESTRATOR' to check if there is data to be processed

04 - WHO CAN USE THIS FEATURE (USER ROLES):
05 - ASSETS (FIGMA LINKS, RELEVANT DOCUMENTATION LINKS, JSON EXAMPLES, ETC):
06 - ACCEPTANCE CRITERIA:

  • Every task should be assigned with one of sources available in TaskTriggerSource. The default value for any task should be “PLATFORM", but in SQL processing tasks it shouldn’t be the default value, only ("PYCAROL“, "ORCHESTRATOR“, "SCHEDULER“ or "USER“).
  • Every request over any endpoints mentioned above, that can create SQL processing tasks, should create tasks with source originating from (USER / ORCHESTRATOR / PYCAROL), nothin else.
    • It will depend of the User-Agent sent in header of requests.
  • Every schedule task should be originated only with SCHEDULER, unless the user trigger the reprocess task from UI (so it must be USER).
  • All other tasks provided from Platform should be originated with PLATFORM.