Debugging a 500 Internal Server Error in Kubernetes: A Detective Story

Sometimes the best debugging sessions are the ones that take you down unexpected paths. What started as a simple “500 ISE” turned into a deep dive through Kubernetes probes, node resource allocation, and graceful shutdown mechanics. Here’s how it unfolded.

The Problem

Our self-hosted Langfuse instance on Google Kubernetes Engine was throwing 500 Internal Server Errors. Users couldn’t access the platform, and the pod was stuck in a crash loop.

langfuse-web-767f9598c-dsxfl   0/1   CrashLoopBackOff   19 (5m ago)   55m

18 restarts in under an hour. Something was very wrong.

Following the Trail

Step 1: Check the Obvious

First, let’s see what the pod is telling us:

kubectl logs -n langfuse langfuse-web-767f9598c-dsxfl --tail=50

The logs showed the app starting up fine:

✓ Ready in 14.7s
2025-12-23T04:38:52.197Z info   MCP feature registered: prompts
Signal:  SIGTERM
SIGTERM / SIGINT received. Shutting down in 110 seconds.

Wait - the app starts successfully, then immediately receives SIGTERM? That’s Kubernetes killing the container. But why?

Step 2: Dig into the Events

kubectl describe pod -n langfuse langfuse-web-767f9598c-dsxfl

The events revealed the culprit:

Warning  Unhealthy  Liveness probe failed: context deadline exceeded
Warning  Unhealthy  Readiness probe failed: HTTP probe failed with statuscode: 500

Two different failures:

Liveness probe timing out (taking longer than 5 seconds)
Readiness probe returning 500

Step 3: Test the Endpoints Directly

Let’s see what these endpoints actually return:

kubectl exec -n langfuse langfuse-web-767f9598c-dsxfl -- \
  wget -qO- http://localhost:3000/api/public/health

# Output: {"status":"OK","version":"3.137.0"}

Health endpoint works! What about ready?

kubectl exec -n langfuse langfuse-web-767f9598c-dsxfl -- \
  wget -qO- http://localhost:3000/api/public/ready

# Output: HTTP/1.1 500 Internal Server Error

The health check passes but the readiness check fails with 500. Why would they behave differently?

Step 4: Read the Source Code

A quick look at Langfuse’s ready.ts revealed the answer:

if (isSigtermReceived()) {
  logger.info("Readiness check failed: SIGTERM / SIGINT received, shutting down.");
  return res.status(500).json({
    status: "SIGTERM / SIGINT received, shutting down",
    version: VERSION.replace("v", ""),
  });
}

The ready endpoint checks if the app has received a termination signal. If yes, it returns 500 to tell Kubernetes “don’t send me traffic, I’m shutting down.”

This created a vicious cycle:

Liveness probe times out
Kubernetes sends SIGTERM
App sets sigtermReceived = true
Ready endpoint returns 500
Container gets killed and restarted
Repeat

The root cause wasn’t the 500 - it was the liveness probe timing out in the first place.

Step 5: Why is the Liveness Probe Timing Out?

Time to check resource allocation:

kubectl describe node gk3-langfuse-pool-2-2b69afef-8sqn | grep -A 5 "Allocated resources"

Allocated resources:
  Resource           Requests           Limits
  --------           --------           ------
  cpu                15327m (96%)       23960m (150%)
  memory             61251344256 (99%)  60655043072 (99%)

The node was 99% memory allocated and 150% CPU overcommitted. The langfuse-web pod was being starved of resources, causing it to respond slowly to health checks.

The Fix

Immediate Relief: Move the Pod

First, I needed to get the pod onto a less loaded node. This is where I discovered two useful kubectl commands:

kubectl cordon - Mark a node as unschedulable. New pods won’t be placed here, but existing pods keep running.

kubectl cordon gk3-langfuse-pool-2-2b69afef-8sqn
# node/gk3-langfuse-pool-2-2b69afef-8sqn cordoned

kubectl uncordon - Remove the unschedulable mark. The node can accept new pods again.

kubectl uncordon gk3-langfuse-pool-2-2b69afef-8sqn
# node/gk3-langfuse-pool-2-2b69afef-8sqn uncordoned

After cordoning the overloaded node, I deleted the pod to force rescheduling:

kubectl delete pod -n langfuse langfuse-web-767f9598c-dsxfl

GKE’s autoscaler kicked in and created a fresh node with plenty of resources.

But the Problem Persisted

Even on the new node, probes were still timing out. The 5-second timeout simply wasn’t enough for a Next.js app that takes ~17 seconds to fully initialize.

The original probe configuration:

initialDelaySeconds: 20 (barely enough time)
timeoutSeconds: 5 (too aggressive)
failureThreshold: 3 (not enough buffer)

The Real Fix: Adjust Probe Settings

kubectl patch deployment -n langfuse langfuse-web --type='json' -p='[
  {"op": "replace", "path": "/spec/template/spec/containers/0/livenessProbe/initialDelaySeconds", "value": 60},
  {"op": "replace", "path": "/spec/template/spec/containers/0/livenessProbe/timeoutSeconds", "value": 15},
  {"op": "replace", "path": "/spec/template/spec/containers/0/livenessProbe/failureThreshold", "value": 5},
  {"op": "replace", "path": "/spec/template/spec/containers/0/readinessProbe/initialDelaySeconds", "value": 30},
  {"op": "replace", "path": "/spec/template/spec/containers/0/readinessProbe/timeoutSeconds", "value": 15},
  {"op": "replace", "path": "/spec/template/spec/containers/0/readinessProbe/failureThreshold", "value": 5}
]'

After the rollout:

kubectl get pods -n langfuse -l app=web
# NAME                            READY   STATUS    RESTARTS   AGE
# langfuse-web-5b5db7d8bb-wb78t   1/1     Running   0          69s

Finally, 1/1 Ready!

Making it Permanent

To persist this fix through future Terraform applies, I added the probe settings to the Helm values:

langfuse:
  livenessProbe:
    initialDelaySeconds: 60
    timeoutSeconds: 15
    failureThreshold: 5
  readinessProbe:
    initialDelaySeconds: 30
    timeoutSeconds: 15
    failureThreshold: 5

Key Takeaways

1. The 500 Error Was a Symptom, Not the Cause

The actual error message was misleading. The 500 from /api/public/ready was the app correctly reporting “I’m shutting down” - the real problem was why it was shutting down in the first place.

2. Probe Timeouts Need to Match Your App’s Reality

Default probe settings (5s timeout) work for lightweight apps, but heavier frameworks like Next.js need more breathing room, especially during startup.

3. Node Resource Pressure is Silent but Deadly

The pod wasn’t OOMKilled or showing obvious resource errors. It was just… slow. Slow enough that health checks failed. Always check kubectl describe node when debugging mysterious slowness.

4. Cordon and Uncordon are Your Friends

These commands let you control pod placement without disrupting running workloads:

# Stop new pods from being scheduled on a node
kubectl cordon <node-name>

# Allow scheduling again
kubectl uncordon <node-name>

# Drain a node (cordon + evict all pods)
kubectl drain <node-name> --ignore-daemonsets

5. Always Check the Source Code

When debugging why an endpoint returns an unexpected status code, reading the actual implementation beats guessing every time.

The Debugging Checklist

For future reference, when a Kubernetes pod is crash-looping:

Check logs: kubectl logs <pod> --previous
Check events: kubectl describe pod <pod>
Test endpoints directly: kubectl exec <pod> -- curl localhost:port/endpoint
Check node resources: kubectl describe node <node-name>
Check pod resource requests/limits: Are they reasonable?
Review probe settings: Are timeouts appropriate for your app?

Sometimes the fix is simple, but finding it requires following the breadcrumbs wherever they lead.