Hunting Zombie Processes: From 6,500 Zombies to a Bug in Git

Feb 23, 2026

The Discovery

It started with an expired access token. I was trying to git fetch in one of my repos and got an authentication failure. While poking around, I noticed something odd — thousands of zombie processes on my machine:

$ ps aux | grep -w Z | wc -l
6530

Over six thousand zombies, all [git] and [ssh] defunct processes. Every single one parented by the same PID — the houndd process running inside a Docker container.

Hound is a code search tool. It periodically runs git fetch to keep its indexed repositories up to date. Two of my configured repositories pointed to Azure DevOps with an expired token, so every fetch attempt was failing. And every failure was leaking a zombie.

Understanding the Setup

Hound runs as PID 1 inside its Docker container. This is important — when a process’s parent exits, the orphaned child gets re-parented to PID 1. If PID 1 doesn’t call wait() on adopted children, they become zombies.

The hound Go code spawns git in several places:

run() — uses exec.Command().CombinedOutput() for git fetch, git reset, git remote show
HeadRev() — uses cmd.Start() + cmd.Wait() for git rev-parse
Clone() — uses cmd.CombinedOutput() for git clone
AutoGeneratedFiles() — uses cmd.Start() + cmd.Wait() for git ls-files and git check-attr

There was already a fix attempt on the fix/zombie-processes branch that added cmd.Wait() calls to error paths in HeadRev() and AutoGeneratedFiles(). The fix was deployed. But the zombies kept coming — about 4 per minute.

First Theory: Missing Wait() in Hound

The natural assumption was that hound’s Go code wasn’t properly waiting on some git child processes. CombinedOutput() internally calls Wait(), so those paths should be safe. The fix addressed HeadRev() and AutoGeneratedFiles(), but maybe there were other leak paths?

To find out which process was actually becoming a zombie, I needed to intercept every git invocation.

The Git Proxy

To catch every git invocation — whether from hound’s Go code or from git’s own internal subprocesses — I built a proxy binary:

func main() {
    self, _ := os.Readlink("/proc/self/exe")
    realGit := self + ".real"

    os.MkdirAll("/srv/hound/git-proxy-logs", 0755)
    pid := os.Getpid()
    f, _ := os.Create(fmt.Sprintf("/srv/hound/git-proxy-logs/%d.log", pid))
    fmt.Fprintf(f, "pid=%d cmd=%s %s\n", pid, self, strings.Join(os.Args[1:], " "))
    f.Close()

    syscall.Exec(realGit, os.Args, os.Environ())
}

Key design decisions:

syscall.Exec replaces the proxy process with real git — same PID, same parent, no interference with git’s stdout/stderr
Log to files, not stderr — hound depends on git’s output; we can’t pollute it
Per-PID filenames — no contention between concurrent git processes
/proc/self/exe to resolve the real path — because argv[0] might just be "git" without a path

I replaced both /usr/bin/git and /usr/lib/git-core/git with the proxy (renaming the originals to *.real), rebuilt the container to run as root (to have permission to replace system binaries), and deployed.

The Breakthrough

After letting it run and accumulate zombies, I matched zombie PIDs to log files. The PIDs needed translation — ps shows host-namespace PIDs, but the proxy logs container-namespace PIDs. The mapping lives in /proc/<pid>/status under NSpid.

The result was unambiguous:

host=2783115 container=87 comm=git.real log=pid=87 cmd=/usr/lib/git-core/git remote-https origin https://<redacted>
host=2783237 container=170 comm=git.real log=pid=170 cmd=/usr/lib/git-core/git remote-https origin https://<redacted>
host=2783495 container=354 comm=git.real log=pid=354 cmd=/usr/lib/git-core/git remote-https origin https://<redacted>

Every zombie was git remote-https — the HTTPS transport helper. When hound runs git fetch, git spawns git remote-https as a child to handle the HTTPS protocol. When authentication fails, git fetch exits, but its git remote-https child hasn’t been waited on. The orphaned child gets re-parented to PID 1 (houndd), and since houndd doesn’t reap adopted children, it becomes a zombie.

The Root Cause in Git

With the smoking gun pointing at git remote-https, I cloned git’s source (v2.39.5 to match the container). The zombie was a transport helper, and there’s a file literally called transport-helper.c — that’s where I started looking.

Inside, get_helper() is the function that spawns the helper process. Confirming this is the right place, line 139 constructs the command name:

strvec_pushf(&helper->args, "remote-%s", data->name);

When fetching over HTTPS, data->name is "https", producing remote-https. Combined with helper->git_cmd = 1 (which tells start_command to run it as a git subcommand), this is exactly what spawns /usr/lib/git-core/git remote-https — our zombie.

The helper is started a few lines later:

code = start_command(helper);

Since start_command() forks the child, there must be a matching waitpid() somewhere. In git’s codebase, that’s wrapped in finish_command(). Grepping for finish_command in the file finds exactly one call — inside disconnect_helper():

static int disconnect_helper(struct transport *transport)
{
    // ...
    res = finish_command(data->helper);  // calls waitpid()
    FREE_AND_NULL(data->helper);
    // ...
}

So disconnect_helper() is the only place the helper gets reaped. The next question is: does every code path reach it? Searching for exit( in the file reveals the answer — no. There are at least 6 exit(128) calls scattered across the file:

if (recvline(data, &buf))
    exit(128);  // helper child is never waited on!

When git remote-https reports an authentication failure, recvline() fails (the helper’s output pipe closes), and git calls exit(128) directly — never going through disconnect_helper(), never calling finish_command(), never calling waitpid().

Why `atexit`

One approach would be to patch each exit(128) site to call disconnect_helper(transport) first. But this has problems:

It’s error-prone — miss one and you still have zombies
Some of those exit() sites don’t have transport in scope, so threading it through would be invasive
Future code could add new exit() calls without knowing about the cleanup requirement

Instead, I used an atexit handler — a safety net that catches all exit paths with zero changes to existing control flow. Git itself uses this pattern in several places, most notably in run-command.c where cleanup_children_on_exit is registered via atexit to kill and reap child processes on abnormal exit, with clear_child_for_cleanup to deregister them on normal cleanup. The same pattern applies here — register the helper for reaping, clear it when properly disconnected.

The Fix

The fix is an atexit handler:

static struct child_process *helper_to_reap;

static void cleanup_helper_on_exit(void)
{
    if (helper_to_reap)
        finish_command(helper_to_reap);
}

Registered right after the helper starts:

data->helper = helper;
helper_to_reap = helper;
atexit(cleanup_helper_on_exit);

And cleared on normal cleanup:

res = finish_command(data->helper);
helper_to_reap = NULL;  // atexit handler becomes a no-op
FREE_AND_NULL(data->helper);

This catches all the exit(128) paths without modifying any of them. The atexit handler runs during process teardown and reaps the transport helper child.

The normal cleanup path and the atexit handler do not conflict. On the normal path, disconnect_helper() calls finish_command() to reap the child, then sets helper_to_reap = NULL. When the process eventually exits, the atexit handler sees NULL and does nothing. On the error path, exit(128) is called without going through disconnect_helper(), so helper_to_reap is still set, and the atexit handler reaps the child.

The same pattern applies to connect.c, where finish_connect() can be bypassed when callers die() or exit() before reaching it. An identical atexit handler ensures the SSH or proxy child is reaped.

Verification

HTTPS transport (`transport-helper.c`)

I built the patched git inside a debian:bookworm container (matching the target environment to avoid glibc mismatches), deployed it, and watched:

=== Zombies after 2 minutes ===
0

Zero zombies, despite continuous authentication failures on two repositories every 15 seconds.

SSH transport (`connect.c`)

To verify the connect.c fix independently, I used a red/green approach:

Red (before fix): Built the hound image with git patched only in transport-helper.c (no connect.c atexit handler). Hid SSH keys (mv ~/.ssh ~/.ssh_hidden) to force SSH authentication failures. Within seconds, SSH zombies accumulated:

$ ps aux | grep defunct
user     3253967  [ssh] <defunct>
user     3253977  [ssh] <defunct>
user     3254057  [ssh] <defunct>
user     3254063  [ssh] <defunct>
user     3254069  [ssh] <defunct>

All parented to houndd (PID 1 in the container), confirming the finish_connect() path was being bypassed on exit(128).

Green (after fix): Rebuilt git with the connect.c atexit handler, rebuilt the hound image, and restarted the container — still with SSH keys hidden. The same SSH failures occurred (Host key verification failed, exit status 128), but:

$ ps aux | grep defunct | grep -v grep | wc -l
0

Zero zombies. The atexit handler in connect.c successfully reaps the ssh child process on abnormal exit.

Lessons Learned

Zombie processes are always about waitpid() — but the question is who should be calling it and which child isn’t being waited on.
PID 1 matters in containers. Most programs aren’t designed to be PID 1. When they are, orphaned grandchildren become their responsibility. This is why docker run --init (which uses tini as PID 1) exists.
The obvious suspect isn’t always the culprit. The initial fix targeted hound’s Go code, but the real bug was in git itself. The git proxy approach was what finally revealed the truth.
Process names in ps tell you less than you think. Zombie processes lose their /proc/<pid>/cmdline. The comm field survives, but it’s just the executable name — not the arguments. We needed the proxy’s log files to see the full picture.
Namespace translation is essential for container debugging. Host PIDs and container PIDs are different. /proc/<pid>/status with NSpid is the bridge.

Resolution

Both transport paths are now fixed and verified:

HTTPS: transport-helper.c — atexit handler reaps git-remote-https
SSH/proxy/local: connect.c — atexit handler reaps the connection child

The fix is available on GitHub and has been submitted to the Git mailing list. The bug exists in git’s current master branch as of February 2026.