Record did not land in staging (part2)

Description

Contact

  • @Bruno Furtado

Problem

We found a case where client sent a request to intake endpoint and receive 200 (ok) but the record did not land in staging table.

In a universe of 1,204,865 records, 454 were sent via intake but did not land in staging (0.037%). It seems like a small percentage, but the customer felt and is reporting it.

Context (not necessary but nice to now)

Product, customer and environment

  • We are talking about the product Clockin and their big customer: Telemont.
  • We are studyng a case where records were sent by the mobile app (clockinmobile connector).
  • The records should be landing in clockinrecords staging table.
  • These 3 columns can be used as a PKs: deviceCode, employeePersonId and clockinDatetimeStr.
  • This customer is sending 1 record per request, so the body is an array with 1 item.
  • This is the staging we are talking: link.

Carol, records landing and steps

  • We have our audit table where we can see all the records sent by intake (90 days of lifetime).
SELECT * FROM `labs-app-mdm-production.intake`.records_landing LIMIT 1
  • We have our audit table where we can see all the steps of the records (90 days of lifetime).
SELECT * FROM `labs-app-mdm-production.intake`.records_steps LIMIT 1

Investigation steps

1. Created table with all the records sent by intake

The main go here is filtering all the records sent by intake to Telemont, connector clockinmobile, staging clockinrecords.

CREATE TABLE `labs-app-mdm-production.clockin.records_landing_telemont_20231006` AS (
  SELECT
    publish_time,
    message_id,
    auditId,
    tenantId,
    connectorId,
    stagingType,
    step,
    responseCode,
    JSON_VALUE(payload, '$[0].deviceCode') AS deviceCode,
    JSON_VALUE(payload, '$[0].employeePersonId') AS employeePersonId,
    JSON_VALUE(payload, '$[0].clockinDatetime') AS clockinDatetime,
    JSON_VALUE(payload, '$[0].clockinDatetimeStr') AS clockinDatetimeStr
  FROM
    `labs-app-mdm-production.intake`.records_landing
  WHERE
    tenantId = 'e62a09d1e6ae4e9cab42be68d2a1006e'
    AND connectorId = '9b827450be5945369b77bc76d5d30a30'
    AND stagingType = 'clockinrecords'
)

2. Created a table with the records sent by intake but not on staging

Here we have the goal to store the records sent by intake but not land on clockinmobile connector and clockinrecords staging.

CREATE TABLE `labs-app-mdm-production.clockin.records_not_landing_telemont_20231006` AS (
  SELECT
    rlt.*
  FROM
    `labs-app-mdm-production.clockin.records_landing_telemont_20231006` AS rlt
  LEFT OUTER JOIN
    `carol-e62a09d1e6ae4e9cab42.e62a09d1e6ae4e9cab42be68d2a1006e`.stg_clockinmobile_clockinrecords AS scc
      USING (deviceCode, employeePersonId, clockinDatetimeStr)
  WHERE
    rlt.message_id IS NOT NULL
    AND scc.mdmId IS NULL
)

3. Created a table with the records and steps that are not on staging

In addition to knowing which records were sent by intake that were not on staging, we also should know what are the steps of these records.

CREATE TABLE `labs-app-mdm-production.clockin.records_not_landing_steps_telemont_20231006` AS (
  SELECT
    rnlt.auditId,
    rnlt.tenantId,
    rnlt.connectorId,
    rnlt.stagingType,
    rnlt.publish_time AS publish_time_landing,
    rst.publish_time AS publish_time_step,
    rnlt.message_id AS message_id_landing,
    rst.message_id AS message_id_step,
    rst.step,
    rst.responseCode,
    rnlt.deviceCode,
    rnlt.employeePersonId,
    rnlt.clockinDatetimeStr,
  FROM
    `labs-app-mdm-production.clockin.records_not_landing_telemont_20231006` AS rnlt
  LEFT JOIN
    `labs-app-mdm-production.clockin.records_steps_telemont_20231006`AS rst USING (auditId)
  ORDER BY
    deviceCode,
    employeePersonId,
    clockinDatetimeStr,
    step
)

4. Isolating a case

This is a case we can study:

SELECT
  *
FROM
  `labs-app-mdm-production.clockin.records_not_landing_steps_telemont_20231006`
WHERE
  auditId = 'c9200ea61b506a3b'
ORDER BY
  message_id_step

These are the primary keys for the c9200ea61b506a3b audit id:

  • deviceCode: 00c83e1f3c492558
  • employeePersonId: 07398779682
  • clockinDatetimeStr: 2023-10-04T17:25:19.000-03:00

This records is not on the staging:

Just to confirm, also, it is not on the data model:

5. Double check if the request was sent

As we can see on our logs, the request was sent and the response was 200.

https://cloudlogging.app.goo.gl/n7DYa41nDEkt77h4A

6. Record was not found on unified environment also

Telemont is linked to unified tenant but the record was not there also:

SELECT
  *
FROM
  `carol-10d04fe8037e4c5a98a4.10d04fe8037e4c5a98a44e254db092ce`.stg_clockinmobile_clockinrecords
WHERE
  mdmTenantId = 'e62a09d1e6ae4e9cab42be68d2a1006e'
  AND deviceCode = '00c83e1f3c492558'
  AND employeePersonId = '07398779682'
  AND clockinDatetimeStr = '2023-10-04T17:25:19.000-03:00'

6. Record are on CDS

We export CDS to BQ using this script (just the 2023' records):

#!/bin/bash


CAROL_TENANT_ID="e62a09d1e6ae4e9cab42be68d2a1006e"
CAROL_CONNECTOR_ID="9b827450be5945369b77bc76d5d30a30"
CAROL_CONNECTOR_NAME="clockinmobile"
GCP_PROJECT_ID="carol-e62a09d1e6ae4e9cab42"
BQ_DATASET_ID="${CAROL_TENANT_ID}"

gcloud config set project ${GCP_PROJECT_ID}

while read table
do

    CAROL_STAGING_ID="${table}"

    bq mkdef \
        --source_format=PARQUET \
        "gs://prod-mdm-1-carol-internal-${CAROL_TENANT_ID}/staging-output/parquet/${CAROL_CONNECTOR_ID}_${CAROL_STAGING_ID}/2023-*.parquet" \
    > external_${CAROL_CONNECTOR_ID}_${CAROL_STAGING_ID}.json

    bq mk \
        --external_table_definition=external_${CAROL_CONNECTOR_ID}_${CAROL_STAGING_ID}.json \
        ${BQ_DATASET_ID}.external_${CAROL_CONNECTOR_NAME}_${CAROL_STAGING_ID}

done <<EOF
clockinrecords
EOF

Created a temporary table to improve the queries:

CREATE TABLE `labs-app-mdm-production.clockin.records_cds_telemont_20231008` AS (
  SELECT
    mdmId,
    mdmTenantId,
    mdmCreated,
    mdmLastUpdated,
    deviceCode,
    employeePersonId,
    clockinDatetime,
    clockinDatetimeStr
  FROM
    `carol-e62a09d1e6ae4e9cab42.e62a09d1e6ae4e9cab42be68d2a1006e`.external_clockinmobile_clockinrecords
)

And it looks like most part of the records are on CDS (just 2 are not and 1 have not PKs):

SELECT
  rnlt.*
FROM
  `labs-app-mdm-production.clockin.records_not_landing_telemont_20231006` AS rnlt
LEFT JOIN
  `labs-app-mdm-production.clockin.records_cds_telemont_20231008` AS rct
    USING (deviceCode, employeePersonId, clockinDatetimeStr)
WHERE
  rnlt.auditId IS NOT NULL
  AND rct.mdmId IS NULL

More details

  • Spreadsheet with all the records that not land on Telemont: link

AC

  • The parquet file should match the same audit id logged on the table steps on BigQuery
  • Create two new steps BIGQUERY_CUSTOMER_WRITER AND BIGQUERY_UNIFIED_WRITER to be written on table steps when receiving records from intake.

Activity

Automation for Jira 6 November 2023, 18:23 Jira Internal Users

@Robson Thanael Poffo ,
@André Pereira de Oliveira ,

You have just sent your prior comment as an information to the customer in JIRA TOTVS issue https://jiraproducao.totvs.com.br/browse/DLABSCAPL-2381.

Automation for Jira 6 November 2023, 18:23 Jira Internal Users

This issue was automatically transitioned to DONE after being deployed to production on Github.

https://github.com/totvslabs/mdm/releases/tag/v4.81.2

Automation for Jira 6 November 2023, 18:23

#inform
Informamos que houve uma movimentação interna aqui no TOTVS LABS e que esta ação pode ter solucionado total ou parcialmente o item reportado nesta issue. Pedimos que valide novamente e que, em caso de dúvidas, alinhe com o seu ponto focal aqui no TOTVS LABS (Epic Owner: Eduardo Jacometti De Souza).

Automation for Jira 4 November 2023, 13:05 Jira Internal Users

This issue was automatically transitioned to WAITING DEPLOY, as its PR was just merged into master branch in Github.

Automation for Jira 4 November 2023, 13:05 Jira Internal Users

This issue was automatically transitioned to WAITING DEPLOY, as its PR was just merged into master branch in Github.

Douglas Coimbra Lopes 3 November 2023, 17:56 Jira Internal Users

@André Pereira de Oliveira Regression is OK for this branch.

Manual tests

  • The second bunch of records during STREAMING

Automation for Jira 3 November 2023, 13:07 Jira Internal Users

This issue was automatically transitioned to QA REVIEW, as its PR was just approved in Github.

Douglas Coimbra Lopes 2 November 2023, 12:59 Jira Internal Users

@André Pereira de Oliveira TEST FAILED

  1. Regression tests: Test Case: TestSuite Carol Apps

2. Manual tests

INTAKE DATA SENT

TRIGGERING COPY DATA

Automation for Jira 1 November 2023, 22:03 Jira Internal Users

This issue was automatically transitioned to REVIEW, as its PR (not DRAFT and not WIP) was just created in Github.

PR Title: fix: https://totvslabs.atlassian.net/browse/CAPL-4859#icft=CAPL-4859 save auditId on step STAGING_PARQUET_WRITER

Automation for Jira 31 October 2023, 13:57 Jira Internal Users

@Robson Thanael Poffo ,
@André Pereira de Oliveira , @Cindy de Araujo Soares Moore , @Geny Isam Hamud Herrera

Flag was removed since you have just transitioned the issue status/column.

Automation for Jira 30 October 2023, 20:09 Jira Internal Users

@Robson Thanael Poffo ,
@André Pereira de Oliveira , @Cindy de Araujo Soares Moore , @Geny Isam Hamud Herrera

Comment sent to JIRA TOTVS Brazil - https://jiraproducao.totvs.com.br/browse/DLABSCAPL-2381:

Informamos que esta issue foi incluída em uma sprint com previsão de entrega até 2023-11-20 (podendo ocorrer retorno antes). Não é necessário retornar esta interação, caso não tenha dúvidas. Qualquer alteração de planejamento será novamente notificada nesta issue.

Atenciosamente,
Equipe TOTVS LABS.

Automation for Jira 30 October 2023, 20:04 Jira Internal Users

@Robson Thanael Poffo ,
@Geny Isam Hamud Herrera , @André Pereira de Oliveira , @Cindy de Araujo Soares Moore , @Geny Isam Hamud Herrera

This issue was planned to be delivered until 2023-11-20. You can check that by consulting the issue in the Due Date field.

Dates already planned for this issue: 2023-10-30, 2023-11-20

If External Issue Link field is filled, customer was also informed on JIRA TOTVS.

Automation for Jira 30 October 2023, 19:33 Jira Internal Users

This issue was automatically transitioned to QA REVIEW, as its PR was just approved in Github.

Automation for Jira 30 October 2023, 17:21 Jira Internal Users

• Uma duvida tecnica voltado ao AC do , referente aos steps `BIGQUERY_CUSTOMER_WRITER` & `BIGQUERY_UNIFIED_WRITER`
• Ao consultar a tabela intake_step no BQ, so podemos visualizar os steps voltado a tenant customer, mesmo apos um copy data ter sido executado na unified.
• Nos prints abaixo podemos visualizar as buscas na tenant customer vs unified
• Duvida: Afim de aprimorar o data tracking, principalmente voltado ao ponto de que eh a unified quem processa os registros via SQL Processing, nao seria valido incluirmos o step `BIGQUERY_UNIFIED_WRITER` na tenant unified tambem?
• Caso seja esse um topico para outro card/correcao necessaria, peco que por favor confirmem nessa thread. headsup: nao fica claro no aonde deveria ser gravado os steps
• Por ser um topico tecnico, voltado ao objetivo do card que eh tracking de dataloss, inicialmente nao envolvemos produto. Caso seja um topico a envolveh-los, favor avisar aqui.
• Unified: tenantId: `80ebb92328694a42ac96b21daeb05f74`
• Customer : `31486b4a09dc4fb8beabd63b970458d9`
• cc

Edited on Slack

Douglas Coimbra Lopes 30 October 2023, 17:20 Jira Internal Users

@André Pereira de Oliveira Topic shared with the BE team. Awaiting confirmation: https://totvslabs.slack.com/archives/G0189HHUK18/p1698682402529919

cc @Gabriel DAmore Marciano

Automation for Jira 30 October 2023, 17:17 Jira Internal Users

• Uma duvida tecnica voltado ao AC do , referente aos steps `BIGQUERY_CUSTOMER_WRITER` & `BIGQUERY_UNIFIED_WRITER`
• Ao consultar a tabela intake_step no BQ, so podemos visualizar os steps voltado a tenant customer, mesmo apos um copy data ter sido executado na unified.
• Nos prints abaixo podemos visualizar as buscas na tenant customer vs unified
• Duvida: Afim de aprimorar o data tracking, principalmente voltado ao ponto de que eh a unified quem processa os registros via SQL Processing, incluirmos o step `BIGQUERY_UNIFIED_WRITER` na tenant unified tambem?
• Caso seja esse um topico para outro card/correcao necessaria, peco que por favor confirmem nessa thread. headsup: nao fica claro no aonde deveria ser gravado os steps
• Por ser um topico tecnico, voltado ao objetivo do card que eh tracking de dataloss, inicialmente nao envolvemos produto. Caso seja um topico a envolveh-los, favor avisar aqui.
• Unified: tenantId: `80ebb92328694a42ac96b21daeb05f74`
• Customer : `31486b4a09dc4fb8beabd63b970458d9`
• cc

Edited on Slack

Automation for Jira 30 October 2023, 17:13 Jira Internal Users

• Uma duvida tecnica voltado ao AC do , referente aos steps `BIGQUERY_CUSTOMER_WRITER` &amp; `BIGQUERY_UNIFIED_WRITER`
• Ao consultar a tabela intake_step no BQ, so podemos visualizar os steps voltado a tenant customer, mesmo apos um copy data ter sido executado na unified.
• Nos prints abaixo podemos visualizar as buscas na tenant customer vs unified
• Duvida: Afim de aprimorar o data tracking, principalmente voltado ao ponto de que eh a unified quem processa os registros via SQL Processing, incluirmos o step `BIGQUERY_UNIFIED_WRITER` na tenant unified tambem?
• Caso seja esse um topico para outro card/correcao necessaria, peco que por favor confirmem nessa thread. headsup: nao fica claro no aonde deveria ser gravado os steps
• Unified: tenantId: `80ebb92328694a42ac96b21daeb05f74`
• Customer : `31486b4a09dc4fb8beabd63b970458d9`
• cc

Sent by Slack

Automation for Jira 27 October 2023, 15:38 Jira Internal Users

Github user olivandre has just commited and issue was sent back to the REVIEW column.

Automation for Jira 27 October 2023, 14:45 Jira Internal Users

This issue was automatically transitioned to QA REVIEW, as its PR was just approved in Github.

Automation for Jira 26 October 2023, 16:30 Jira Internal Users

@Robson Thanael Poffo ,
@André Pereira de Oliveira , @Geny Isam Hamud Herrera

Comment sent to JIRA TOTVS Brazil - https://jiraproducao.totvs.com.br/browse/DLABSCAPL-2381:

Iniciamos a trabalhar em sua issue e manteremos você informado!

Atenciosamente,
Equipe TOTVS LABS.

Automation for Jira 25 October 2023, 21:05 Jira Internal Users

This issue was automatically transitioned to REVIEW, as its PR (not DRAFT and not WIP) was just created in Github.

PR Title: fix: https://totvslabs.atlassian.net/browse/CAPL-4859#icft=CAPL-4859 save auditId on step STAGING_PARQUET_WRITER

Automation for Jira 23 October 2023, 14:49 Jira Internal Users

This issue was automatically transitioned to IN PROGRESS, as its branch was just created in Github.

Branch Name: CAPL-4859-fix-use-auditid-to-step-STAGING_PARQUET_WRITER

Robson Thanael Poffo 20 October 2023, 17:32 Jira Internal Users

@André Pereira de Oliveira , @Geny Isam Hamud Herrera , @henrique.cavarsan

Segue uma analise de um caso recente:

Peguei um caso do relatório:

1

2

3

4

5

6

7

8

9

10

 
  
 
  
--select clockinsStruct.cloockinDateTime, count(*) total
select *
from `labs-app-mdm-production.a_poffo.divergences_clockinrecords` c
left join `labs-app-mdm-production.clockin.tenants` t on t.tenantId = c.tenantId
where clockinsStruct.cloockinDateTime < "2023-10-17" 
--group by 1
--order by 1 desc
order by clockinsStruct.cloockinDateTime desc

Busquei o registro completamente considerando landing e steps:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

 
  
 
  
  select  struct(
            JSON_VALUE(clockins, '$.clockinDatetimeStr') AS cloockinDateTime,
            JSON_VALUE(clockins, '$.employeePersonId') AS clockinPersonId,
            JSON_VALUE(clockins, '$.deviceCode') AS clockinDeviceCode
          ) clockinsStruct,
          tenantId,
          CONCAT(tenantId, "_", JSON_VALUE(clockins, '$.clockinDatetimeStr'), "_", JSON_VALUE(clockins, '$.employeePersonId'), "_", JSON_VALUE(clockins, '$.deviceCode')) as recordKey, *
  from (
    select l.*, clockins, rs.* except(tenantId)
    FROM `labs-app-mdm-production.intake.records_landing` l, UNNEST(JSON_EXTRACT_ARRAY(payload, "$.")) clockins
    left join `labs-app-mdm-production.intake.records_steps` rs on rs.auditid = l.auditid
    WHERE l.publish_time >= TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL -5 DAY)
    and l.tenantId in ('7c9ab4739cd5405fb4b633280634b35c') --(SELECT tenantId FROM `labs-app-mdm-production.clockin`.tenants)
    and l.stagingType = 'clockinrecords'
  )
where JSON_VALUE(clockins, '$.clockinDatetimeStr') = '2023-10-16T14:35:01.852-03:00'

Aqui filtrando apenas steps para facilitar a leitura:

1

2

3

4

5

 
  
 
  
select *
from `labs-app-mdm-production.intake.records_steps` rs
where rs.auditId = 'cc48cceff2d1d90c'

Eu alterei o relatório para deixar disponivel o auditId, assim teremos ações de validação mais rapidas.

cc @Bruno Furtado , @Danilo Queiroz Mota

Robson Thanael Poffo 20 October 2023, 05:43 Jira Internal Users

@André Pereira de Oliveira , more cases here:

1

2

3

4

5

6

 
  
select tenantName, count(*) total
from `labs-app-mdm-production.a_poffo.divergences_clockinrecords` c
left join `labs-app-mdm-production.clockin.tenants` t on t.tenantId = c.tenantId
where clockinsStruct.cloockinDateTime < "2023-10-19" 
group by 1

A total of 1087 missing records in 30 days for any clockinrecords staging table (we have this staging table in two connectors: clockinmobile and clockinweb).

André Pereira de Oliveira 19 October 2023, 21:55 Jira Internal Users

the step STAGING_PARQUET_WRITER generates a new auditId

  • Create a new Sandbox

  • Create a new APP

  • Install in customer client

  • Sent data to intake (just 1 call)

  • checked the table record_step in BigQuery

  • checked the parquet file was generated

Automation for Jira 13 October 2023, 19:22 Jira Internal Users

@Robson Thanael Poffo ,
@André Pereira de Oliveira , @Geny Isam Hamud Herrera

Comment sent to JIRA TOTVS Brazil - https://jiraproducao.totvs.com.br/browse/DLABSCAPL-2381:

Informamos que esta issue foi incluída em uma sprint com previsão de entrega até 2023-11-03 (podendo ocorrer retorno antes). Não é necessário retornar esta interação, caso não tenha dúvidas. Qualquer alteração de planejamento será novamente notificada nesta issue.

Atenciosamente,
Equipe TOTVS LABS.

Automation for Jira 13 October 2023, 19:22 Jira Internal Users

@Robson Thanael Poffo ,
@Geny Isam Hamud Herrera , @André Pereira de Oliveira , @Geny Isam Hamud Herrera

Comment sent to JIRA TOTVS Brazil - https://jiraproducao.totvs.com.br/browse/DLABSCAPL-2381:

Informamos que esta issue foi incluída em uma sprint com previsão de entrega até 2023-10-31 (podendo ocorrer retorno antes). Não é necessário retornar esta interação, caso não tenha dúvidas. Qualquer alteração de planejamento será novamente notificada nesta issue.

Atenciosamente,
Equipe TOTVS LABS.

Automation for Jira 13 October 2023, 19:22 Jira Internal Users

Message thread link on #red-phone channel:

https://totvslabs.slack.com/archives/C03NT4US9J9/p1697217758230239

Automation for Jira 13 October 2023, 19:22 Jira Internal Users

@Robson Thanael Poffo ,
@Geny Isam Hamud Herrera , @André Pereira de Oliveira , @Geny Isam Hamud Herrera

This issue was planned to be delivered until 2023-10-30. You can check that by consulting the issue in the Due Date field.

Dates already planned for this issue: 2023-10-30

If External Issue Link field is filled, customer was also informed on JIRA TOTVS.