Record did not land in staging (part2)
Description
Contact
- @Bruno Furtado
Problem
We found a case where client sent a request to intake endpoint and receive 200 (ok) but the record did not land in staging table.
In a universe of 1,204,865 records, 454 were sent via intake but did not land in staging (0.037%). It seems like a small percentage, but the customer felt and is reporting it.
Context (not necessary but nice to now)
Product, customer and environment
- We are talking about the product Clockin and their big customer: Telemont.
- We are studyng a case where records were sent by the mobile app (
clockinmobile
connector). - The records should be landing in
clockinrecords
staging table. - These 3 columns can be used as a PKs:
deviceCode
,employeePersonId
andclockinDatetimeStr
. - This customer is sending 1 record per request, so the body is an array with 1 item.
- This is the staging we are talking: link.
Carol, records landing and steps
- We have our audit table where we can see all the records sent by intake (90 days of lifetime).
SELECT * FROM `labs-app-mdm-production.intake`.records_landing LIMIT 1
- We have our audit table where we can see all the steps of the records (90 days of lifetime).
SELECT * FROM `labs-app-mdm-production.intake`.records_steps LIMIT 1
Investigation steps
1. Created table with all the records sent by intake
The main go here is filtering all the records sent by intake to Telemont, connector clockinmobile
, staging clockinrecords
.
CREATE TABLE `labs-app-mdm-production.clockin.records_landing_telemont_20231006` AS (
SELECT
publish_time,
message_id,
auditId,
tenantId,
connectorId,
stagingType,
step,
responseCode,
JSON_VALUE(payload, '$[0].deviceCode') AS deviceCode,
JSON_VALUE(payload, '$[0].employeePersonId') AS employeePersonId,
JSON_VALUE(payload, '$[0].clockinDatetime') AS clockinDatetime,
JSON_VALUE(payload, '$[0].clockinDatetimeStr') AS clockinDatetimeStr
FROM
`labs-app-mdm-production.intake`.records_landing
WHERE
tenantId = 'e62a09d1e6ae4e9cab42be68d2a1006e'
AND connectorId = '9b827450be5945369b77bc76d5d30a30'
AND stagingType = 'clockinrecords'
)
2. Created a table with the records sent by intake but not on staging
Here we have the goal to store the records sent by intake but not land on clockinmobile
connector and clockinrecords
staging.
CREATE TABLE `labs-app-mdm-production.clockin.records_not_landing_telemont_20231006` AS (
SELECT
rlt.*
FROM
`labs-app-mdm-production.clockin.records_landing_telemont_20231006` AS rlt
LEFT OUTER JOIN
`carol-e62a09d1e6ae4e9cab42.e62a09d1e6ae4e9cab42be68d2a1006e`.stg_clockinmobile_clockinrecords AS scc
USING (deviceCode, employeePersonId, clockinDatetimeStr)
WHERE
rlt.message_id IS NOT NULL
AND scc.mdmId IS NULL
)
3. Created a table with the records and steps that are not on staging
In addition to knowing which records were sent by intake that were not on staging, we also should know what are the steps of these records.
CREATE TABLE `labs-app-mdm-production.clockin.records_not_landing_steps_telemont_20231006` AS (
SELECT
rnlt.auditId,
rnlt.tenantId,
rnlt.connectorId,
rnlt.stagingType,
rnlt.publish_time AS publish_time_landing,
rst.publish_time AS publish_time_step,
rnlt.message_id AS message_id_landing,
rst.message_id AS message_id_step,
rst.step,
rst.responseCode,
rnlt.deviceCode,
rnlt.employeePersonId,
rnlt.clockinDatetimeStr,
FROM
`labs-app-mdm-production.clockin.records_not_landing_telemont_20231006` AS rnlt
LEFT JOIN
`labs-app-mdm-production.clockin.records_steps_telemont_20231006`AS rst USING (auditId)
ORDER BY
deviceCode,
employeePersonId,
clockinDatetimeStr,
step
)
4. Isolating a case
This is a case we can study:
SELECT
*
FROM
`labs-app-mdm-production.clockin.records_not_landing_steps_telemont_20231006`
WHERE
auditId = 'c9200ea61b506a3b'
ORDER BY
message_id_step
These are the primary keys for the c9200ea61b506a3b
audit id:
- deviceCode:
00c83e1f3c492558
- employeePersonId:
07398779682
- clockinDatetimeStr:
2023-10-04T17:25:19.000-03:00
This records is not on the staging:
- Link to the query
Just to confirm, also, it is not on the data model:
- Link to the query
5. Double check if the request was sent
As we can see on our logs, the request was sent and the response was 200.
https://cloudlogging.app.goo.gl/n7DYa41nDEkt77h4A
6. Record was not found on unified environment also
Telemont is linked to unified tenant but the record was not there also:
SELECT
*
FROM
`carol-10d04fe8037e4c5a98a4.10d04fe8037e4c5a98a44e254db092ce`.stg_clockinmobile_clockinrecords
WHERE
mdmTenantId = 'e62a09d1e6ae4e9cab42be68d2a1006e'
AND deviceCode = '00c83e1f3c492558'
AND employeePersonId = '07398779682'
AND clockinDatetimeStr = '2023-10-04T17:25:19.000-03:00'
6. Record are on CDS
We export CDS to BQ using this script (just the 2023' records):
#!/bin/bash
CAROL_TENANT_ID="e62a09d1e6ae4e9cab42be68d2a1006e"
CAROL_CONNECTOR_ID="9b827450be5945369b77bc76d5d30a30"
CAROL_CONNECTOR_NAME="clockinmobile"
GCP_PROJECT_ID="carol-e62a09d1e6ae4e9cab42"
BQ_DATASET_ID="${CAROL_TENANT_ID}"
gcloud config set project ${GCP_PROJECT_ID}
while read table
do
CAROL_STAGING_ID="${table}"
bq mkdef \
--source_format=PARQUET \
"gs://prod-mdm-1-carol-internal-${CAROL_TENANT_ID}/staging-output/parquet/${CAROL_CONNECTOR_ID}_${CAROL_STAGING_ID}/2023-*.parquet" \
> external_${CAROL_CONNECTOR_ID}_${CAROL_STAGING_ID}.json
bq mk \
--external_table_definition=external_${CAROL_CONNECTOR_ID}_${CAROL_STAGING_ID}.json \
${BQ_DATASET_ID}.external_${CAROL_CONNECTOR_NAME}_${CAROL_STAGING_ID}
done <<EOF
clockinrecords
EOF
Created a temporary table to improve the queries:
CREATE TABLE `labs-app-mdm-production.clockin.records_cds_telemont_20231008` AS (
SELECT
mdmId,
mdmTenantId,
mdmCreated,
mdmLastUpdated,
deviceCode,
employeePersonId,
clockinDatetime,
clockinDatetimeStr
FROM
`carol-e62a09d1e6ae4e9cab42.e62a09d1e6ae4e9cab42be68d2a1006e`.external_clockinmobile_clockinrecords
)
And it looks like most part of the records are on CDS (just 2 are not and 1 have not PKs):
SELECT
rnlt.*
FROM
`labs-app-mdm-production.clockin.records_not_landing_telemont_20231006` AS rnlt
LEFT JOIN
`labs-app-mdm-production.clockin.records_cds_telemont_20231008` AS rct
USING (deviceCode, employeePersonId, clockinDatetimeStr)
WHERE
rnlt.auditId IS NOT NULL
AND rct.mdmId IS NULL
More details
- Spreadsheet with all the records that not land on Telemont: link
AC
- The parquet file should match the same audit id logged on the table steps on BigQuery
- Create two new steps BIGQUERY_CUSTOMER_WRITER AND BIGQUERY_UNIFIED_WRITER to be written on table steps when receiving records from intake.
@Robson Thanael Poffo ,
@André Pereira de Oliveira ,
You have just sent your prior comment as an information to the customer in JIRA TOTVS issue https://jiraproducao.totvs.com.br/browse/DLABSCAPL-2381.
This issue was automatically transitioned to DONE after being deployed to production on Github.
https://github.com/totvslabs/mdm/releases/tag/v4.81.2
#inform
Informamos que houve uma movimentação interna aqui no TOTVS LABS e que esta ação pode ter solucionado total ou parcialmente o item reportado nesta issue. Pedimos que valide novamente e que, em caso de dúvidas, alinhe com o seu ponto focal aqui no TOTVS LABS (Epic Owner: Eduardo Jacometti De Souza).
This issue was automatically transitioned to WAITING DEPLOY, as its PR was just merged into master branch in Github.
This issue was automatically transitioned to WAITING DEPLOY, as its PR was just merged into master branch in Github.
@André Pereira de Oliveira Regression is OK for this branch.
Manual tests
The second bunch of records during STREAMING
This issue was automatically transitioned to QA REVIEW, as its PR was just approved in Github.
@André Pereira de Oliveira TEST FAILED
Regression tests: Test Case: TestSuite Carol Apps
Scenario: Null 500 internal server error when we try to create a new organization
https://cloudlogging.app.goo.gl/CkaeqEiVHqXHtEwGA
Access carol as carol admin
try to create a new org
2. Manual tests
We’ve noticed the BIGQUERY_UNIFIED_WRITER is not being populated for the records that exist on the customer before triggering the copy data. Follow the steps
Create a Customer tenant
Create a Unified tenant
Create a Carol app along with a git repo attached to it
Attach the app with the unified tenant
Release the app
Approve the app and install it on both tenants
Send records for the customer (these will be the records that won’t have the BQ_UNIFIED-WRITER step after copied)
Trigger copy data from unified (In case your app is Hybrid)
Check from the intake steps table if the records copied have BIGQUERY_UNIFIED_WRITER step
sandbox: https://totvslatreuspen.qarol.ai/newcustomer/carol-ui/connectors/nlp/purchaseorder
table stg_nlp_purchaseorder
Query: https://console.cloud.google.com/bigquery?project=carol-c267e3fc807391178c81&ws=!1m22!1m3!8m2!1s1003340884739!2s875441604bfa485f8c5f9a602febe65e!1m4!1m3!1scarol-c267e3fc807391178c81!2sbquxjob_3d6838fb_17d57825eaa!3sUS!1m4!4m3!1slabs-app-mdm-qa!2sintake!3srecords_steps!1m7!8m2!1s1003340884739!2sb785d864de854d82b0549d1c42891454!14m3!1scarol-c267e3fc807391178c81!2sbquxjob_3c8fd6fb_18b8fe29621!3sUS&j=bq:US:bquxjob_3d6838fb_17d57825eaa&page=queryresults
Shreds of evidence BELOW @Gabriel DAmore Marciano If necessary we can speak tomorrow in a parking lot about it earlier than daily
INTAKE DATA SENT
TRIGGERING COPY DATA
@André Pereira de Oliveira Regression Running:
https://jenkins.qa-mdm-1.gke.totvslabs.com/job/ParallelTestSoapUISandbox/628/
https://jenkins.qa-mdm-1.gke.totvslabs.com/job/ParallelTestSoapUISandbox/629/
https://jenkins.qa-mdm-1.gke.totvslabs.com/job/ParallelTestSoapUISandbox/630/
This issue was automatically transitioned to REVIEW, as its PR (not DRAFT and not WIP) was just created in Github.
PR Title: fix: https://totvslabs.atlassian.net/browse/CAPL-4859#icft=CAPL-4859 save auditId on step STAGING_PARQUET_WRITER
@Robson Thanael Poffo ,
@André Pereira de Oliveira , @Cindy de Araujo Soares Moore , @Geny Isam Hamud Herrera
Flag was removed since you have just transitioned the issue status/column.
@Robson Thanael Poffo ,
@André Pereira de Oliveira , @Cindy de Araujo Soares Moore , @Geny Isam Hamud Herrera
Comment sent to JIRA TOTVS Brazil - https://jiraproducao.totvs.com.br/browse/DLABSCAPL-2381:
Informamos que esta issue foi incluída em uma sprint com previsão de entrega até 2023-11-20 (podendo ocorrer retorno antes). Não é necessário retornar esta interação, caso não tenha dúvidas. Qualquer alteração de planejamento será novamente notificada nesta issue.
Atenciosamente,
Equipe TOTVS LABS.
@Robson Thanael Poffo ,
@Geny Isam Hamud Herrera , @André Pereira de Oliveira , @Cindy de Araujo Soares Moore , @Geny Isam Hamud Herrera
This issue was planned to be delivered until 2023-11-20. You can check that by consulting the issue in the Due Date field.
Dates already planned for this issue: 2023-10-30, 2023-11-20
If External Issue Link field is filled, customer was also informed on JIRA TOTVS.
This issue was automatically transitioned to QA REVIEW, as its PR was just approved in Github.
• Uma duvida tecnica voltado ao AC do , referente aos steps `BIGQUERY_CUSTOMER_WRITER` & `BIGQUERY_UNIFIED_WRITER`
• Ao consultar a tabela intake_step no BQ, so podemos visualizar os steps voltado a tenant customer, mesmo apos um copy data ter sido executado na unified.
• Nos prints abaixo podemos visualizar as buscas na tenant customer vs unified
• Duvida: Afim de aprimorar o data tracking, principalmente voltado ao ponto de que eh a unified quem processa os registros via SQL Processing, nao seria valido incluirmos o step `BIGQUERY_UNIFIED_WRITER` na tenant unified tambem?
• Caso seja esse um topico para outro card/correcao necessaria, peco que por favor confirmem nessa thread. headsup: nao fica claro no aonde deveria ser gravado os steps
• Por ser um topico tecnico, voltado ao objetivo do card que eh tracking de dataloss, inicialmente nao envolvemos produto. Caso seja um topico a envolveh-los, favor avisar aqui.
• Unified: tenantId: `80ebb92328694a42ac96b21daeb05f74`
• Customer : `31486b4a09dc4fb8beabd63b970458d9`
• cc
Edited on Slack
@André Pereira de Oliveira Topic shared with the BE team. Awaiting confirmation: https://totvslabs.slack.com/archives/G0189HHUK18/p1698682402529919
cc @Gabriel DAmore Marciano
• Uma duvida tecnica voltado ao AC do , referente aos steps `BIGQUERY_CUSTOMER_WRITER` & `BIGQUERY_UNIFIED_WRITER`
• Ao consultar a tabela intake_step no BQ, so podemos visualizar os steps voltado a tenant customer, mesmo apos um copy data ter sido executado na unified.
• Nos prints abaixo podemos visualizar as buscas na tenant customer vs unified
• Duvida: Afim de aprimorar o data tracking, principalmente voltado ao ponto de que eh a unified quem processa os registros via SQL Processing, incluirmos o step `BIGQUERY_UNIFIED_WRITER` na tenant unified tambem?
• Caso seja esse um topico para outro card/correcao necessaria, peco que por favor confirmem nessa thread. headsup: nao fica claro no aonde deveria ser gravado os steps
• Por ser um topico tecnico, voltado ao objetivo do card que eh tracking de dataloss, inicialmente nao envolvemos produto. Caso seja um topico a envolveh-los, favor avisar aqui.
• Unified: tenantId: `80ebb92328694a42ac96b21daeb05f74`
• Customer : `31486b4a09dc4fb8beabd63b970458d9`
• cc
Edited on Slack
• Uma duvida tecnica voltado ao AC do , referente aos steps `BIGQUERY_CUSTOMER_WRITER` & `BIGQUERY_UNIFIED_WRITER`
• Ao consultar a tabela intake_step no BQ, so podemos visualizar os steps voltado a tenant customer, mesmo apos um copy data ter sido executado na unified.
• Nos prints abaixo podemos visualizar as buscas na tenant customer vs unified
• Duvida: Afim de aprimorar o data tracking, principalmente voltado ao ponto de que eh a unified quem processa os registros via SQL Processing, incluirmos o step `BIGQUERY_UNIFIED_WRITER` na tenant unified tambem?
• Caso seja esse um topico para outro card/correcao necessaria, peco que por favor confirmem nessa thread. headsup: nao fica claro no aonde deveria ser gravado os steps
• Unified: tenantId: `80ebb92328694a42ac96b21daeb05f74`
• Customer : `31486b4a09dc4fb8beabd63b970458d9`
• cc
Sent by Slack
@André Pereira de OliveiraTESTFAILED
BIGQUERY_UNIFIED_WRITER
is Missing from the tenant customer even after a copy of data performed on the Unified tenanthttps://totvsedykameo.qarol.ai/unifiedbqwriter/carol-ui/carol-app/7020a187bdf94f518887fc5f9e9ef033/tenants
https://console.cloud.google.com/bigquery?project=carol-c267e3fc807391178c81&ws=!1m18!1m4!1m3!1scarol-c267e3fc807391178c81!2sbquxjob_2e1f6f25_18b814eb55c!3sUS!1m4!1m3!1scarol-c267e3fc807391178c81!2sbquxjob_5a045c36_18b80ba894a!3sUS!1m7!8m2!1s1003340884739!2s6ff8f71760984d2c97f6fd527b476b25!14m3!1scarol-c267e3fc807391178c81!2sbquxjob_1f2aa3f0_18b81535ccf!3sUS&j=bq:US:bquxjob_3d6838fb_17d57825eaa&page=queryresults
Github user olivandre has just commited and issue was sent back to the REVIEW column.
This issue was automatically transitioned to QA REVIEW, as its PR was just approved in Github.
@Robson Thanael Poffo ,
@André Pereira de Oliveira , @Geny Isam Hamud Herrera
Comment sent to JIRA TOTVS Brazil - https://jiraproducao.totvs.com.br/browse/DLABSCAPL-2381:
Iniciamos a trabalhar em sua issue e manteremos você informado!
Atenciosamente,
Equipe TOTVS LABS.
@André Pereira de Oliveira Regression Running
https://jenkins.qa-mdm-1.gke.totvslabs.com/job/ParallelTestSoapUISandbox/606/
https://jenkins.qa-mdm-1.gke.totvslabs.com/job/ParallelTestSoapUISandbox/607/
https://jenkins.qa-mdm-1.gke.totvslabs.com/job/ParallelTestSoapUISandbox/608/
This issue was automatically transitioned to REVIEW, as its PR (not DRAFT and not WIP) was just created in Github.
PR Title: fix: https://totvslabs.atlassian.net/browse/CAPL-4859#icft=CAPL-4859 save auditId on step STAGING_PARQUET_WRITER
This issue was automatically transitioned to IN PROGRESS, as its branch was just created in Github.
Branch Name: CAPL-4859-fix-use-auditid-to-step-STAGING_PARQUET_WRITER
@André Pereira de Oliveira , @Geny Isam Hamud Herrera , @henrique.cavarsan
Segue uma analise de um caso recente:
Peguei um caso do relatório:
1
2
3
4
5
6
7
8
9
10
Busquei o registro completamente considerando landing e steps:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Aqui filtrando apenas steps para facilitar a leitura:
1
2
3
4
5
Eu alterei o relatório para deixar disponivel o auditId, assim teremos ações de validação mais rapidas.
cc @Bruno Furtado , @Danilo Queiroz Mota
@André Pereira de Oliveira , more cases here:
1
2
3
4
5
6
A total of 1087 missing records in 30 days for any clockinrecords staging table (we have this staging table in two connectors: clockinmobile and clockinweb).
the step STAGING_PARQUET_WRITER generates a new auditId
Create a new Sandbox
Create a new APP
Install in customer client
Sent data to intake (just 1 call)
checked the table record_step in BigQuery
checked the parquet file was generated
@Robson Thanael Poffo ,
@André Pereira de Oliveira , @Geny Isam Hamud Herrera
Comment sent to JIRA TOTVS Brazil - https://jiraproducao.totvs.com.br/browse/DLABSCAPL-2381:
Informamos que esta issue foi incluída em uma sprint com previsão de entrega até 2023-11-03 (podendo ocorrer retorno antes). Não é necessário retornar esta interação, caso não tenha dúvidas. Qualquer alteração de planejamento será novamente notificada nesta issue.
Atenciosamente,
Equipe TOTVS LABS.
@Robson Thanael Poffo ,
@Geny Isam Hamud Herrera , @André Pereira de Oliveira , @Geny Isam Hamud Herrera
Comment sent to JIRA TOTVS Brazil - https://jiraproducao.totvs.com.br/browse/DLABSCAPL-2381:
Informamos que esta issue foi incluída em uma sprint com previsão de entrega até 2023-10-31 (podendo ocorrer retorno antes). Não é necessário retornar esta interação, caso não tenha dúvidas. Qualquer alteração de planejamento será novamente notificada nesta issue.
Atenciosamente,
Equipe TOTVS LABS.
Message thread link on #red-phone channel:
https://totvslabs.slack.com/archives/C03NT4US9J9/p1697217758230239
@Robson Thanael Poffo ,
@Geny Isam Hamud Herrera , @André Pereira de Oliveira , @Geny Isam Hamud Herrera
This issue was planned to be delivered until 2023-10-30. You can check that by consulting the issue in the Due Date field.
Dates already planned for this issue: 2023-10-30
If External Issue Link field is filled, customer was also informed on JIRA TOTVS.