Response of endpoint for Tenant app Settings is not consistent (204)
Description
01 - PERSON OF CONTACT (PERSON THAT CAN ANSWER QUESTIONS ABOUT THE PROBLEM):
@Lucas Noetzold @Bruno Furtado
02 - PROBLEM (WHAT'S THE ISSUE?):
@Bruno Furtado reported that the endpoint for listing Tenant App configurations, {{GET /tenantApps/
{tenantAppId}/settings}}, is showing inconsistent and incorrect behavior: the endpoint might return200 OK
and moments later, for the same endpoint, same tenant and same app id, it returns 204 NO CONTENT
, again, after a while, it returns 200 OK
with the correct content.
This erratic behavior can be seen in the logs.
log query
timestamp,
insert_id,
json_value(json_payload.environmentId) as tenantId,
json_value(json_payload.audit.userLogin) as userLogin,
json_value(json_payload.audit.pathParams.tenantAppId) as tenantAppId,
json_value(json_payload.audit.responseCode) as responseCode,
json_value(json_payload.traceId) as traceId,
json_payload
from
`labs-app-mdm-production.global.mdmAudit._AllLogs`
where json_value(json_payload.audit.urlTemplate) like '%/tenantApps/{tenantAppId}
/settings'
and json_value(json_payload.audit.responseCode) = "204"
limit 100
resource.labels.namespace_name="mdm"
resource.type="k8s_container"
resource.labels.container_name="ui"
jsonPayload.audit.urlTemplate=("/api/v1/tenantApps/{tenantAppId}/settings" OR "/api/v3/tenantApps/{tenantAppId}/settings")
jsonPayload.audit.responseCode="204"
On 16/02/24 we saw a huge spike in such errors, which caused this card to be issued, after it was reported by the customers.
Additional info from first analysis
First occurrence found in logs is of 2023-08-20 19:04:25.143
(might just be because older logs were already discarded).
Looking at the code, this endpoint is actually configured to return 204 NO CONTENT
if a RecordNotFoundException
is thrown, at TenantAppSettingResource::findAll:157-158
:
Logging this RecordNotFoundException
might be a good first step, so we can have more information to analyze.
This resembles a problem we are already familiar with where, when overloaded, Elasticsearch returns 404 NOT FOUND
for entities that actually do exist, in code that results in a RecordNotFoundException
. This could be the reason but, looking at the graphs, ES didn’t seem to be overloaded at the time most of the errors happened:
Still, it is worth investigating. You can look at all metrics for Elasticsearch at the time here.
04 - LINKS (ADD A LINK TO THE BUG OR TO THE TENANT):
PRDE issue:
Jira TOTVS issue: https://jiraproducao.totvs.com.br/browse/DLABSCAPL-4101
05 - EXPECTED BEHAVIOR (LIST THE EXPECTED BEHAVIORS TO CONSIDER THIS BUG AS DONE):
- API response should be consistent and correct
- 204 if there is no data, consistently
- 200 when there is data, consistently
Findings
Finding 1
The period with the highest count of 204 is 15/02 between 9:54 and 13:00.
The issue does not seem to be related to elasticsearch, all its metrics were stable during that period.
Seems like the 204 response we’re getting is the one produced at TenantAppSettingResource:158
, logs were added to it with a marker TEMP1
to help analyze, and we can see that the TEMP1
marker strongly correlate with the 204 responses produced by the audit filter for the same endpoint:
Finding 2
Picking on a specific log you can see that the exception message says:
Tenant app or settings not found for id or name "clockinapp", for tenant ID
d053c53b04d2412d892ada20c4100e65
Record not found for the type mdmTenantAppFile, field mdmTenantAppId with key
c927959f92e6434bb048b1d0c2c2b19d
in the tenantd053c53b04d2412d892ada20c4100e65
Now, if you look at elasticsearch, the app clockinapp
fot the tenant d053c53b04d2412d892ada20c4100e65
is:
{
"mdmAppIconUrl": "https://cdn.carol.ai/static/carolApps/icons/carolApp--c.png",
"mdmAppPictureUrl": [],
"mdmAppPoweredBy": {
"en-US": "Carol"
},
"mdmAppSubscriptionUrl": "/apps/clockinapp/12.1.24020008/index.html",
"mdmAppVersion": "12.1.24020008",
"mdmCarolAppId": "89a5f410558345fd94f74cde07886ef4",
"mdmConnectorGroupSelected": {
"mdmName": "totvsrm",
"mdmLabel": {
"en-US": "TOTVS RM"
},
"mdmConnectorNames": [
"totvs",
"clockinweb",
"clockinmobile"
]
},
"mdmCreated": "2021-09-28T23:53:44.888Z",
"mdmCreatedUser": "adm.clockin@ima.sp.gov.br",
"mdmDescription": {
"en-US": "Aplicativo para permitir o registro eletrônico do ponto eletrônico através do reconhecimento facial. Aplicativo móvel disponível na Play Store (dispositivos Android)."
},
"mdmEntityType": "mdmTenantApp",
"mdmId": "5e847920ca774683a8947b15a8deddf2",
"mdmInstallationTaskId": "4673ffd0d2f14923bbdbcec8614c3ebc",
"mdmInstallationTaskStatus": "COMPLETED",
"mdmLabel": {
"en-US": "TOTVS RH Clock In Backoffice"
},
"mdmLastUpdated": "2024-02-23T22:53:24.518Z",
"mdmName": "clockinapp",
"mdmStatus": "ACTIVE",
"mdmTenantId": "d053c53b04d2412d892ada20c4100e65",
"mdmUpdatedUser": "adm.clockin@ima.sp.gov.br",
"mdmPipelineAllowedEnvironments": []
}
You can see that the app ID is 5e847920ca774683a8947b15a8deddf2
, whilst the log says it is looking for app settings with the ID c927959f92e6434bb048b1d0c2c2b19d
in this tenant. I.e. it’s looking for the wrong app ID.
An app with the ID c927959f92e6434bb048b1d0c2c2b19d
do exist, but for another tenant.
Finding 3
My initial suspicion, that the app c927959f92e6434bb048b1d0c2c2b19d
was cached for tenants that were not his own, turned out not to be true. We cleaned the TENANT_APP
and TENANT_APP_KEY_BY_NAME
, but the errors did not stop after that (not even for a time).
@Bruno Furtado ,
@MARCOS STUMPF , @Cindy de Araujo Soares Moore , @Cindy de Araujo Soares Moore , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold
This issue was just linked to issue(s) , according to mention in the description field.
Github user douglascoimbra has just approved a PR (added as Shard Assignee in this Jira issue).
fix: Fix in TenantApp cache
Github user Damore has just approved a PR (added as Shard Assignee in this Jira issue).
fix: Fix in TenantApp cache
This issue was automatically transitioned to WAITING DEPLOY, as its PR was just merged into master branch in Github.
Github user douglascoimbra has just approved a PR (added as Shard Assignee in this Jira issue).
fix: Filtrando TenantApp por name e tenantId
This issue was automatically transitioned to QA REVIEW, as its PR was just approved in Github.
Card validado pelo time de QA. Regression e testes manuais OK. Pendente o code review. cc
•
• Para prosseguirmos, pendente apenas o deep dive voltado as (Notification Schema e SQL processing) nao mais executando como Internal. (Em production as tasks estao corretamente como Internal)
Edited on Slack - plataforma-carol-internal - Douglas Coimbra Lopes
Card validado pelo time de QA. Regression e testes manuais OK. Pendente o code review. cc
•
• Para prosseguirmos, pendente apenas o deep dive voltado as (Notification Schema e SQL processing) nao mais executando como Internal
Edited on Slack - plataforma-carol-internal - Douglas Coimbra Lopes
Card validado pelo time de QA. Regression e testes manuais OK. Pendente o code review. cc
•
• Para prosseguirmos, pendente apenas o deep dive voltado as Internal tasks (Notification Schema e SQL processing) nao mais executando como Internal
Sent by Slack - plataforma-carol-internal - Douglas Coimbra Lopes
BUSCA DE SETTING NO CUSTOMER ANTES DE DELETAR A SETTING PELO DEV APP OWNER
DELETANDO APP SETTING DA DEV TENANT E LIBERANDO NOVA VERSAO
INSTALANDO APP NA UNIFIED + CHECKOUT
CHECKOUT REALIZADO
CUSTOMER COM O APP SETTING
CUSTOMER INSTALANDO AGORA A NEW VERSION SEM APP SETTING
Boa noite @Bruno Furtado Devido a semana de integração do time, devo trazer mais atualizações sobre o tema na segunda-feira
Olá pessoal, temos alguma atualização que possa ser passada ao cliente? O item tende a ser entregue até o final da sprint no dia 25/mar? Obrigado.
@Bruno Furtado ,
@Pedro Buzzi , @Lucas Noetzold , @Cindy de Araujo Soares Moore , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold
This issue was planned to be delivered until 2024-03-25. You can check that by consulting the issue in the Due Date field.
Dates already planned for this issue: 2024-03-01, 2024-03-25
If External Issue Link field is filled, customer was also informed on JIRA TOTVS.
@Cindy de Araujo Soares Moore segue um resumo, para dar continuidade na análise:
Bruno detectou o problema: endpoint
GET tenantApps/{tenantAppId}/settings
, que lista todos os settings de um tenant app (por id ou nome), estava retornando200 OK
(com dados) e204 NO CONTENT
de forma intermitente (para um mesmo app);Identifiquei que no
TenantAppSettingsResource::findAll
, linha 158, é onde esse endpoint retorna 204 quando há umaRecordNotFoundException
Adicionei um log com um marker temporário
TEMP1
para ver a stack e identificar de onde essaRecordNotFoundException
estava vindo (podia ser do settings, ou do próprio app, ou sei lá…);com esse marker deu pra ver que todos os
RecordNotFound
estão acontecendo quando esse endpoint é feito a busca por nome doclockinapp
, e não conseguiram encontrar o tenant app com idc927959f92e6434bb048b1d0c2c2b19d
esse tenant app existe,
c927959f92e6434bb048b1d0c2c2b19d
, ele é o id do clockin da tenantclockinunified
já os logs vêm de várias tenants, e não da
clockinunified
ele não encontra o app por que esse app não é da tenant que está buscando
os RecordNotFound são produzidos quando ele tenta buscar os settings, e não o tenantApp
tem alguns logs com esse marker que não são do clockin, mas se você investigar vai ver que esses são justamente os casos corretos, onde o app realmente não tem settings e o endpoint funcionou como deveria;
com base nisso, olhando o código, dá pra deduzir que:
na linha 128, ou 132, ele está obtendo o tenant app da unificada, com id
c927959f92e6434bb048b1d0c2c2b19d
na linha 151 ele tenta buscar os settings e aí não encontra o app, pois de fato ele não pertence à tenant, ou seja, o problema é que outras tenants estão obtendo uma referencia para o TenantApp do clockin da unificada, na linha 128 ou 132;
hoje cedo limpei os caches de
TENANT_APP
eTENANT_APP_KEY_BY_NAME
, pra eliminar a possibilidade de ser apenas cache poluído, mas não foi o caso, o problema permaneceu;já analisei bastante o código, não encontrei nenhuma falha de lógica, dentro dos services, que poderia justificar ele estar encontrando o app da tenant errada, mas.. alguma coisa deve ter!
coloquei mais links e informações na descrição do próprio card, pra ajudar na investigação
@Bruno Furtado ,
@Lucas Noetzold , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold
This issue was just linked to issue(s) , according to mention in the description field.
@Bruno Furtado ,
@Lucas Noetzold , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold
This issue was just linked to issue(s) , according to mention in the description field.
@Bruno Furtado ,
@Lucas Noetzold , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold
This issue was just linked to issue(s) , according to mention in the description field.
@Bruno Furtado ,
@Lucas Noetzold , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold
This issue was just linked to issue(s) , according to mention in the description field.
@Bruno Furtado ,
@Lucas Noetzold , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold
This issue was just linked to issue(s) , according to mention in the description field.
@Bruno Furtado finalmente encontrei a causa (preenchi nos findings), agora vou trabalhar na solução.
@Bruno Furtado ,
@Lucas Noetzold , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold
This issue was just linked to issue(s) , according to mention in the description field.
@Bruno Furtado ,
@Lucas Noetzold , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold
This issue was just linked to issue(s) , according to mention in the description field.
@Bruno Furtado ,
@Lucas Noetzold , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold
This issue was just linked to issue(s) , according to mention in the description field.
@Bruno Furtado ,
@Lucas Noetzold , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold
This issue was just linked to issue(s) , according to mention in the description field.
@Bruno Furtado ,
@Lucas Noetzold , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold
This issue was just linked to issue(s) , according to mention in the description field.
@Bruno Furtado ,
@Lucas Noetzold , @Douglas Coimbra Lopes , @Gabriel DAmore Marciano , @Lucas Noetzold
This issue was just linked to issue(s) , according to mention in the description field.
This issue was automatically transitioned to REGRESSION, as its PR was just merged into qa branch in Github.
This issue was automatically transitioned to REGRESSION, as its PR was just merged into qa branch in Github.
@Bruno Furtado ,
@Lucas Noetzold , @Lucas Noetzold
This issue was just linked to issue(s) , according to mention in a prior comment.
@Bruno Furtado estamos implementando alguns logs extras para ajudar na análise desse issue. Até o momento não consegui replicar o cenário dessa inconsistência, os logs vão me permitir enteder o ambiente com mais detalhes, para descobrir o que está acontecendo.
Obrigado pelo retorno @Lucas Noetzold . Quando houver novas atualizações, pode compartilhar por aqui? Por hora estou controlando o cliente.
@Bruno Furtado estou finalizando a issue atual e essa será a próxima, creio que inicio ainda hoje
@Lucas Noetzold , chegamos a atuar nessa task?
@Bruno Furtado ,
@Lucas Noetzold , @Lucas Noetzold
This issue was planned to be delivered until 2024-03-01. You can check that by consulting the issue in the Due Date field.
Dates already planned for this issue: 2024-03-01
If External Issue Link field is filled, customer was also informed on JIRA TOTVS.
Message thread link on #red-phone channel:
https://totvscarol.slack.com/archives/C03NT4US9J9/p1708093745798999