Problemas de acceso a Rankmi

Incident Report for Rankmi

Postmortem

El día Jueves 3 de Abril, fue detectado en nuestro cluster de Kubernetes una indisponibilidad de tipo de instancias en AWS, por que lo nuestros servicios perdieron la habilidad de escalar y recuperarse a fallas. Pudimos identificar un problema en nuestros sistemas para el aprovisionamiento de instancias para el correcto escalado de nuestros serivicios.

La incidencia comenzó aproximadamente a las 5:30 pm y se extendió por 25 minutos, cuando logramos estabilizar los servicios afectados.

Acciones:

Se identificó la falla en los tipos de instancia disponibles en aws.
Cambio en los tipos de instancia a nivel de nuestro administrador de cluster

Plan de acción:

Se han agregado nuevos tipos de instancia soportados a nuestra infraestructura, lo que nos permitirá obtener los recursos de cómputo desde una mayor cantidad de tipos de instancia disponibles .

‌

On Thursday, April 3rd, an instance type unavailability was detected in our Kubernetes cluster on AWS, causing our services to lose the ability to scale and recover from failures. We identified an issue in our systems related to instance provisioning, which affected the proper scaling of our services.

The incident began at approximately 5:30 PM and lasted for 25 minutes until we were able to stabilize the affected services.

Actions Taken:

Identified the issue with the availability of instance types in AWS.
Updated the instance types at the cluster manager level.

Action Plan:

Added support for additional instance types in our infrastructure, allowing us to obtain compute resources from a broader range of available instance types.

Posted Apr 04, 2025 - 08:18 GMT-03:00

Resolved

El incidente ha sido resuelto. Se está investigando la causa raíz.

Posted Apr 03, 2025 - 18:28 GMT-03:00

Monitoring

Se ha corregido el problema a nivel de infraestructura. Estamos monitoreando el comportamiento de la aplicacion
---
The issue has been resolved at the infrastructure level. We are monitoring the application's behavior.

Posted Apr 03, 2025 - 18:07 GMT-03:00

Identified

Se han detectado problemas de acceso y lentitud a la plataforma.
Hemos identificado el problema y el equipo se encuentra trabajando para reestablecer el servicio

--

Access and performance issues have been detected on the platform.
We have determined the cause of the problem, and the team is working to restore the service.

Posted Apr 03, 2025 - 18:02 GMT-03:00

This incident affected: APIs - Servicios (API, Home).