Response of endpoint for Tenant app Settings is not consistent (204)

Description

01 - PERSON OF CONTACT (PERSON THAT CAN ANSWER QUESTIONS ABOUT THE PROBLEM):

@Lucas Noetzold @Bruno Furtado

02 - PROBLEM (WHAT'S THE ISSUE?):

@Bruno Furtado reported that the endpoint for listing Tenant App configurations, {{GET /tenantApps/

{tenantAppId}/settings}}, is showing inconsistent and incorrect behavior: the endpoint might return 200 OK and moments later, for the same endpoint, same tenant and same app id, it returns 204 NO CONTENT, again, after a while, it returns 200 OK with the correct content.

This erratic behavior can be seen in the logs.

log query
select
timestamp,
insert_id,
json_value(json_payload.environmentId) as tenantId,
json_value(json_payload.audit.userLogin) as userLogin,
json_value(json_payload.audit.pathParams.tenantAppId) as tenantAppId,
json_value(json_payload.audit.responseCode) as responseCode,
json_value(json_payload.traceId) as traceId,
json_payload
from
`labs-app-mdm-production.global.mdmAudit._AllLogs`
where json_value(json_payload.audit.urlTemplate) like '%/tenantApps/{tenantAppId}

/settings'
and json_value(json_payload.audit.responseCode) = "204"
limit 100

resource.labels.namespace_name="mdm"
resource.type="k8s_container"
resource.labels.container_name="ui"
jsonPayload.audit.urlTemplate=("/api/v1/tenantApps/{tenantAppId}/settings" OR "/api/v3/tenantApps/{tenantAppId}/settings")
jsonPayload.audit.responseCode="204"

On 16/02/24 we saw a huge spike in such errors, which caused this card to be issued, after it was reported by the customers.

Additional info from first analysis

First occurrence found in logs is of 2023-08-20 19:04:25.143 (might just be because older logs were already discarded).

Looking at the code, this endpoint is actually configured to return 204 NO CONTENT if a RecordNotFoundException is thrown, at TenantAppSettingResource::findAll:157-158:

Logging this RecordNotFoundException might be a good first step, so we can have more information to analyze.

This resembles a problem we are already familiar with where, when overloaded, Elasticsearch returns 404 NOT FOUND for entities that actually do exist, in code that results in a RecordNotFoundException. This could be the reason but, looking at the graphs, ES didn’t seem to be overloaded at the time most of the errors happened:

Still, it is worth investigating. You can look at all metrics for Elasticsearch at the time here.

04 - LINKS (ADD A LINK TO THE BUG OR TO THE TENANT):

PRDE issue:

Jira TOTVS issue: https://jiraproducao.totvs.com.br/browse/DLABSCAPL-4101

05 - EXPECTED BEHAVIOR (LIST THE EXPECTED BEHAVIORS TO CONSIDER THIS BUG AS DONE):

  1. API response should be consistent and correct
    1. 204 if there is no data, consistently
    2. 200 when there is data, consistently

Findings

Finding 1

The period with the highest count of 204 is 15/02 between 9:54 and 13:00.

The issue does not seem to be related to elasticsearch, all its metrics were stable during that period.

Seems like the 204 response we’re getting is the one produced at TenantAppSettingResource:158, logs were added to it with a marker TEMP1 to help analyze, and we can see that the TEMP1 marker strongly correlate with the 204 responses produced by the audit filter for the same endpoint:

Finding 2

Picking on a specific log you can see that the exception message says:

Tenant app or settings not found for id or name "clockinapp", for tenant ID d053c53b04d2412d892ada20c4100e65

Record not found for the type mdmTenantAppFile, field mdmTenantAppId with key c927959f92e6434bb048b1d0c2c2b19d in the tenant d053c53b04d2412d892ada20c4100e65

Now, if you look at elasticsearch, the app clockinapp fot the tenant d053c53b04d2412d892ada20c4100e65 is:

{
  "mdmAppIconUrl": "https://cdn.carol.ai/static/carolApps/icons/carolApp--c.png",
  "mdmAppPictureUrl": [],
  "mdmAppPoweredBy": {
      "en-US": "Carol"
  },
  "mdmAppSubscriptionUrl": "/apps/clockinapp/12.1.24020008/index.html",
  "mdmAppVersion": "12.1.24020008",
  "mdmCarolAppId": "89a5f410558345fd94f74cde07886ef4",
  "mdmConnectorGroupSelected": {
      "mdmName": "totvsrm",
      "mdmLabel": {
          "en-US": "TOTVS RM"
      },
      "mdmConnectorNames": [
          "totvs",
          "clockinweb",
          "clockinmobile"
      ]
  },
  "mdmCreated": "2021-09-28T23:53:44.888Z",
  "mdmCreatedUser": "adm.clockin@ima.sp.gov.br",
  "mdmDescription": {
      "en-US": "Aplicativo para permitir o registro eletrônico do ponto eletrônico através do reconhecimento facial. Aplicativo móvel disponível na Play Store (dispositivos Android)."
  },
  "mdmEntityType": "mdmTenantApp",
  "mdmId": "5e847920ca774683a8947b15a8deddf2",
  "mdmInstallationTaskId": "4673ffd0d2f14923bbdbcec8614c3ebc",
  "mdmInstallationTaskStatus": "COMPLETED",
  "mdmLabel": {
      "en-US": "TOTVS RH Clock In Backoffice"
  },
  "mdmLastUpdated": "2024-02-23T22:53:24.518Z",
  "mdmName": "clockinapp",
  "mdmStatus": "ACTIVE",
  "mdmTenantId": "d053c53b04d2412d892ada20c4100e65",
  "mdmUpdatedUser": "adm.clockin@ima.sp.gov.br",
  "mdmPipelineAllowedEnvironments": []
}

You can see that the app ID is 5e847920ca774683a8947b15a8deddf2, whilst the log says it is looking for app settings with the ID c927959f92e6434bb048b1d0c2c2b19d in this tenant. I.e. it’s looking for the wrong app ID.

An app with the ID c927959f92e6434bb048b1d0c2c2b19d do exist, but for another tenant.

Finding 3

My initial suspicion, that the app c927959f92e6434bb048b1d0c2c2b19d was cached for tenants that were not his own, turned out not to be true. We cleaned the TENANT_APP and TENANT_APP_KEY_BY_NAME, but the errors did not stop after that (not even for a time).

Activity

Automation for Jira 18 March 2024, 18:19 Jira Internal Users

@Bruno Furtado ,
@MARCOS STUMPF , @Cindy de Araujo Soares Moore , @Cindy de Araujo Soares Moore , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold

This issue was just linked to issue(s) https://totvslabs.atlassian.net/browse/PRDE-3456#icft=PRDE-3456, according to mention in the description field.

Automation for Jira 12 March 2024, 20:23 Jira Internal Users

Github user douglascoimbra has just approved a PR (added as Shard Assignee in this Jira issue).

fix: https://totvslabs.atlassian.net/browse/CAPL-5518#icft=CAPL-5518 Fix in TenantApp cache

Automation for Jira 12 March 2024, 19:06 Jira Internal Users

Github user Damore has just approved a PR (added as Shard Assignee in this Jira issue).

fix: https://totvslabs.atlassian.net/browse/CAPL-5518#icft=CAPL-5518 Fix in TenantApp cache

Automation for Jira 12 March 2024, 17:27 Jira Internal Users

This issue was automatically transitioned to WAITING DEPLOY, as its PR was just merged into master branch in Github.

Automation for Jira 12 March 2024, 17:26 Jira Internal Users

Github user douglascoimbra has just approved a PR (added as Shard Assignee in this Jira issue).

fix: https://totvslabs.atlassian.net/browse/CAPL-5518#icft=CAPL-5518 Filtrando TenantApp por name e tenantId

Automation for Jira 12 March 2024, 16:21 Jira Internal Users

This issue was automatically transitioned to QA REVIEW, as its PR was just approved in Github.

Automation for Jira 12 March 2024, 14:53 Jira Internal Users

Card validado pelo time de QA. Regression e testes manuais OK. Pendente o code review. cc

• Para prosseguirmos, pendente apenas o deep dive voltado as (Notification Schema e SQL processing) nao mais executando como Internal. (Em production as tasks estao corretamente como Internal)

Edited on Slack - plataforma-carol-internal - Douglas Coimbra Lopes

Automation for Jira 12 March 2024, 14:47 Jira Internal Users

Card validado pelo time de QA. Regression e testes manuais OK. Pendente o code review. cc

• Para prosseguirmos, pendente apenas o deep dive voltado as (Notification Schema e SQL processing) nao mais executando como Internal

Edited on Slack - plataforma-carol-internal - Douglas Coimbra Lopes

Automation for Jira 12 March 2024, 14:46 Jira Internal Users

Card validado pelo time de QA. Regression e testes manuais OK. Pendente o code review. cc

• Para prosseguirmos, pendente apenas o deep dive voltado as Internal tasks (Notification Schema e SQL processing) nao mais executando como Internal

Sent by Slack - plataforma-carol-internal - Douglas Coimbra Lopes

Douglas Coimbra Lopes 12 March 2024, 14:44 Jira Internal Users

BUSCA DE SETTING NO CUSTOMER ANTES DE DELETAR A SETTING PELO DEV APP OWNER

DELETANDO APP SETTING DA DEV TENANT E LIBERANDO NOVA VERSAO

INSTALANDO APP NA UNIFIED + CHECKOUT

CHECKOUT REALIZADO

CUSTOMER COM O APP SETTING

CUSTOMER INSTALANDO AGORA A NEW VERSION SEM APP SETTING

Gabriel DAmore Marciano 8 March 2024, 01:14 Jira Internal Users

Boa noite @Bruno Furtado Devido a semana de integração do time, devo trazer mais atualizações sobre o tema na segunda-feira

Bruno Furtado 6 March 2024, 18:18 Jira Internal Users

Olá pessoal, temos alguma atualização que possa ser passada ao cliente? O item tende a ser entregue até o final da sprint no dia 25/mar? Obrigado.

Automation for Jira 1 March 2024, 19:01 Jira Internal Users

@Bruno Furtado ,
@Pedro Buzzi , @Lucas Noetzold , @Cindy de Araujo Soares Moore , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold

This issue was planned to be delivered until 2024-03-25. You can check that by consulting the issue in the Due Date field.

Dates already planned for this issue: 2024-03-01, 2024-03-25

If External Issue Link field is filled, customer was also informed on JIRA TOTVS.

Lucas Noetzold 29 February 2024, 01:48 Jira Internal Users

@Cindy de Araujo Soares Moore segue um resumo, para dar continuidade na análise:

  1. Bruno detectou o problema: endpoint GET tenantApps/{tenantAppId}/settings, que lista todos os settings de um tenant app (por id ou nome), estava retornando 200 OK (com dados) e 204 NO CONTENT de forma intermitente (para um mesmo app);

  2. Identifiquei que no TenantAppSettingsResource::findAll, linha 158, é onde esse endpoint retorna 204 quando há uma RecordNotFoundException

  3. Adicionei um log com um marker temporário TEMP1 para ver a stack e identificar de onde essa RecordNotFoundException estava vindo (podia ser do settings, ou do próprio app, ou sei lá…);

  4. com esse marker deu pra ver que todos os RecordNotFound estão acontecendo quando esse endpoint é feito a busca por nome do clockinapp, e não conseguiram encontrar o tenant app com id c927959f92e6434bb048b1d0c2c2b19d

    1. esse tenant app existe, c927959f92e6434bb048b1d0c2c2b19d, ele é o id do clockin da tenant clockinunified

    2. já os logs vêm de várias tenants, e não da clockinunified

    3. ele não encontra o app por que esse app não é da tenant que está buscando

    4. os RecordNotFound são produzidos quando ele tenta buscar os settings, e não o tenantApp

    5. tem alguns logs com esse marker que não são do clockin, mas se você investigar vai ver que esses são justamente os casos corretos, onde o app realmente não tem settings e o endpoint funcionou como deveria;

  5. com base nisso, olhando o código, dá pra deduzir que:

    1. na linha 128, ou 132, ele está obtendo o tenant app da unificada, com id c927959f92e6434bb048b1d0c2c2b19d

    2. na linha 151 ele tenta buscar os settings e aí não encontra o app, pois de fato ele não pertence à tenant, ou seja, o problema é que outras tenants estão obtendo uma referencia para o TenantApp do clockin da unificada, na linha 128 ou 132;

  6. hoje cedo limpei os caches de TENANT_APP e TENANT_APP_KEY_BY_NAME, pra eliminar a possibilidade de ser apenas cache poluído, mas não foi o caso, o problema permaneceu;

  7. já analisei bastante o código, não encontrei nenhuma falha de lógica, dentro dos services, que poderia justificar ele estar encontrando o app da tenant errada, mas.. alguma coisa deve ter!

coloquei mais links e informações na descrição do próprio card, pra ajudar na investigação

Automation for Jira 28 February 2024, 14:07 Jira Internal Users

@Bruno Furtado ,
@Lucas Noetzold , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold

This issue was just linked to issue(s) https://totvslabs.atlassian.net/browse/PRDE-3456#icft=PRDE-3456, according to mention in the description field.

Automation for Jira 28 February 2024, 13:59 Jira Internal Users

@Bruno Furtado ,
@Lucas Noetzold , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold

This issue was just linked to issue(s) https://totvslabs.atlassian.net/browse/PRDE-3456#icft=PRDE-3456, according to mention in the description field.

Automation for Jira 27 February 2024, 23:06 Jira Internal Users

@Bruno Furtado ,
@Lucas Noetzold , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold

This issue was just linked to issue(s) https://totvslabs.atlassian.net/browse/PRDE-3456#icft=PRDE-3456, according to mention in the description field.

Automation for Jira 27 February 2024, 15:01 Jira Internal Users

@Bruno Furtado ,
@Lucas Noetzold , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold

This issue was just linked to issue(s) https://totvslabs.atlassian.net/browse/PRDE-3456#icft=PRDE-3456, according to mention in the description field.

Automation for Jira 27 February 2024, 15:00 Jira Internal Users

@Bruno Furtado ,
@Lucas Noetzold , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold

This issue was just linked to issue(s) https://totvslabs.atlassian.net/browse/PRDE-3456#icft=PRDE-3456, according to mention in the description field.

Lucas Noetzold 27 February 2024, 14:43 Jira Internal Users

@Bruno Furtado finalmente encontrei a causa (preenchi nos findings), agora vou trabalhar na solução.

Automation for Jira 27 February 2024, 14:39 Jira Internal Users

@Bruno Furtado ,
@Lucas Noetzold , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold

This issue was just linked to issue(s) https://totvslabs.atlassian.net/browse/PRDE-3456#icft=PRDE-3456, according to mention in the description field.

Automation for Jira 27 February 2024, 14:32 Jira Internal Users

@Bruno Furtado ,
@Lucas Noetzold , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold

This issue was just linked to issue(s) https://totvslabs.atlassian.net/browse/PRDE-3456#icft=PRDE-3456, according to mention in the description field.

Automation for Jira 27 February 2024, 14:31 Jira Internal Users

@Bruno Furtado ,
@Lucas Noetzold , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold

This issue was just linked to issue(s) https://totvslabs.atlassian.net/browse/PRDE-3456#icft=PRDE-3456, according to mention in the description field.

Automation for Jira 27 February 2024, 13:35 Jira Internal Users

@Bruno Furtado ,
@Lucas Noetzold , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold

This issue was just linked to issue(s) https://totvslabs.atlassian.net/browse/PRDE-3456#icft=PRDE-3456, according to mention in the description field.

Automation for Jira 26 February 2024, 15:14 Jira Internal Users

@Bruno Furtado ,
@Lucas Noetzold , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold

This issue was just linked to issue(s) https://totvslabs.atlassian.net/browse/PRDE-3456#icft=PRDE-3456, according to mention in the description field.

Automation for Jira 26 February 2024, 13:23 Jira Internal Users

@Bruno Furtado ,
@Lucas Noetzold , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold

This issue was just linked to issue(s) https://totvslabs.atlassian.net/browse/PRDE-3456#icft=PRDE-3456, according to mention in the description field.

Automation for Jira 23 February 2024, 15:17 Jira Internal Users

This issue was automatically transitioned to REGRESSION, as its PR was just merged into qa branch in Github.

Automation for Jira 23 February 2024, 15:17 Jira Internal Users

This issue was automatically transitioned to REGRESSION, as its PR was just merged into qa branch in Github.

Automation for Jira 23 February 2024, 14:46 Jira Internal Users

@Bruno Furtado ,
@Lucas Noetzold , @Lucas Noetzold

This issue was just linked to issue(s) https://totvslabs.atlassian.net/browse/CAPL-5563#icft=CAPL-5563, according to mention in a prior comment.

Lucas Noetzold 23 February 2024, 14:46 Jira Internal Users

@Bruno Furtado estamos implementando alguns logs extras para ajudar na análise desse issue. Até o momento não consegui replicar o cenário dessa inconsistência, os logs vão me permitir enteder o ambiente com mais detalhes, para descobrir o que está acontecendo.

https://totvslabs.atlassian.net/browse/CAPL-5563

Bruno Furtado 23 February 2024, 13:33 Jira Internal Users

Obrigado pelo retorno @Lucas Noetzold . Quando houver novas atualizações, pode compartilhar por aqui? Por hora estou controlando o cliente.

Lucas Noetzold 21 February 2024, 17:28 Jira Internal Users

@Bruno Furtado estou finalizando a issue atual e essa será a próxima, creio que inicio ainda hoje

Bruno Furtado 21 February 2024, 16:35 Jira Internal Users

@Lucas Noetzold , chegamos a atuar nessa task?

Automation for Jira 16 February 2024, 14:46 Jira Internal Users

@Bruno Furtado ,
@Lucas Noetzold , @Lucas Noetzold

This issue was planned to be delivered until 2024-03-01. You can check that by consulting the issue in the Due Date field.

Dates already planned for this issue: 2024-03-01

If External Issue Link field is filled, customer was also informed on JIRA TOTVS.

Automation for Jira 16 February 2024, 14:29 Jira Internal Users

Message thread link on #red-phone channel:

https://totvscarol.slack.com/archives/C03NT4US9J9/p1708093745798999