Hunting Zombie Processes: From 6,500 Zombies to a Bug in Git



The Discovery

It started with an expired access token. I was trying to git fetch in one of my repos and got an authentication failure. While poking around, I noticed something odd — thousands of zombie processes on my machine:

$ ps aux | grep -w Z | wc -l
6530

Over six thousand zombies, all [git] and [ssh] defunct processes. Every single one parented by the same PID — the houndd process running inside a Docker container.

Hound is a code search tool. It periodically runs git fetch to keep its indexed repositories up to date. Two of my configured repositories pointed to Azure DevOps with an expired token, so every fetch attempt was failing. And every failure was leaking a zombie.

Understanding the Setup

Hound runs as PID 1 inside its Docker container. This is important — when a process’s parent exits, the orphaned child gets re-parented to PID 1. If PID 1 doesn’t call wait() on adopted children, they become zombies.

The hound Go code spawns git in several places:

There was already a fix attempt on the fix/zombie-processes branch that added cmd.Wait() calls to error paths in HeadRev() and AutoGeneratedFiles(). The fix was deployed. But the zombies kept coming — about 4 per minute.

First Theory: Missing Wait() in Hound

The natural assumption was that hound’s Go code wasn’t properly waiting on some git child processes. CombinedOutput() internally calls Wait(), so those paths should be safe. The fix addressed HeadRev() and AutoGeneratedFiles(), but maybe there were other leak paths?

To find out which process was actually becoming a zombie, I needed to intercept every git invocation.

The Git Proxy

To catch every git invocation — whether from hound’s Go code or from git’s own internal subprocesses — I built a proxy binary:

func main() {
    self, _ := os.Readlink("/proc/self/exe")
    realGit := self + ".real"

    os.MkdirAll("/srv/hound/git-proxy-logs", 0755)
    pid := os.Getpid()
    f, _ := os.Create(fmt.Sprintf("/srv/hound/git-proxy-logs/%d.log", pid))
    fmt.Fprintf(f, "pid=%d cmd=%s %s\n", pid, self, strings.Join(os.Args[1:], " "))
    f.Close()

    syscall.Exec(realGit, os.Args, os.Environ())
}

Key design decisions:

I replaced both /usr/bin/git and /usr/lib/git-core/git with the proxy (renaming the originals to *.real), rebuilt the container to run as root (to have permission to replace system binaries), and deployed.

The Breakthrough

After letting it run and accumulate zombies, I matched zombie PIDs to log files. The PIDs needed translation — ps shows host-namespace PIDs, but the proxy logs container-namespace PIDs. The mapping lives in /proc/<pid>/status under NSpid.

The result was unambiguous:

host=2783115 container=87 comm=git.real log=pid=87 cmd=/usr/lib/git-core/git remote-https origin https://<redacted>
host=2783237 container=170 comm=git.real log=pid=170 cmd=/usr/lib/git-core/git remote-https origin https://<redacted>
host=2783495 container=354 comm=git.real log=pid=354 cmd=/usr/lib/git-core/git remote-https origin https://<redacted>

Every zombie was git remote-https — the HTTPS transport helper. When hound runs git fetch, git spawns git remote-https as a child to handle the HTTPS protocol. When authentication fails, git fetch exits, but its git remote-https child hasn’t been waited on. The orphaned child gets re-parented to PID 1 (houndd), and since houndd doesn’t reap adopted children, it becomes a zombie.

The Root Cause in Git

With the smoking gun pointing at git remote-https, I cloned git’s source (v2.39.5 to match the container). The zombie was a transport helper, and there’s a file literally called transport-helper.c — that’s where I started looking.

Inside, get_helper() is the function that spawns the helper process. Confirming this is the right place, line 139 constructs the command name:

strvec_pushf(&helper->args, "remote-%s", data->name);

When fetching over HTTPS, data->name is "https", producing remote-https. Combined with helper->git_cmd = 1 (which tells start_command to run it as a git subcommand), this is exactly what spawns /usr/lib/git-core/git remote-https — our zombie.

The helper is started a few lines later:

code = start_command(helper);

Since start_command() forks the child, there must be a matching waitpid() somewhere. In git’s codebase, that’s wrapped in finish_command(). Grepping for finish_command in the file finds exactly one call — inside disconnect_helper():

static int disconnect_helper(struct transport *transport)
{
    // ...
    res = finish_command(data->helper);  // calls waitpid()
    FREE_AND_NULL(data->helper);
    // ...
}

So disconnect_helper() is the only place the helper gets reaped. The next question is: does every code path reach it? Searching for exit( in the file reveals the answer — no. There are at least 6 exit(128) calls scattered across the file:

if (recvline(data, &buf))
    exit(128);  // helper child is never waited on!

When git remote-https reports an authentication failure, recvline() fails (the helper’s output pipe closes), and git calls exit(128) directly — never going through disconnect_helper(), never calling finish_command(), never calling waitpid().

Initial Fix: Custom atexit Handler

One approach would be to patch each exit(128) site to call disconnect_helper(transport) first. But this has problems:

My initial patch used a custom atexit handler — a safety net that catches all exit paths with zero changes to existing control flow. The handler called finish_command() to reap the transport helper child during process teardown, and was cleared on the normal cleanup path to avoid double-waiting. The same pattern was applied to connect.c for SSH/proxy children.

This worked and eliminated the zombies in my testing, so I submitted it to the Git mailing list.

Mailing List Review

The patch went through four revisions on the mailing list. The key turning point came in the review of v2, when Jeff King (peff) pointed out a subtle problem with calling finish_command() in an atexit handler:

This waits for the command to exit. Are we sure it will always do so, and it won’t sometimes be waiting on us to do something (like close a pipe that is feeding it)? If not, then we can get deadlocks.

I think you actually want to kill(), then wait. There is already support for this in run-command.[ch]. You just need to set the clean_on_exit flag of the child_process struct.

He was right. If the child process was blocked waiting for the parent to close a pipe (e.g., stdin), calling finish_command() — which just waits — would deadlock. The child waits for the parent to close the pipe, and the parent waits for the child to exit. Neither proceeds.

Git’s run-command.c already has infrastructure for this exact problem: the clean_on_exit flag. When set, git registers the child process for cleanup on exit. The cleanup sends SIGTERM first, then waits — ensuring the child terminates promptly. It also handles signal-based exits, not just atexit. A companion flag, wait_after_clean, tells the cleanup to wait for the child to actually exit after sending the signal, ensuring it doesn’t become a zombie.

The Fix

The final fix is remarkably simple — just setting two flags on the child_process struct before starting the command:

In transport-helper.c:

helper->clean_on_exit = 1;
helper->wait_after_clean = 1;
code = start_command(helper);

In connect.c (for both git_proxy_connect and git_connect):

conn->clean_on_exit = 1;
conn->wait_after_clean = 1;
if (start_command(conn))
    die(_("unable to fork"));

No custom atexit handlers, no new global state, no changes to existing control flow. The existing run-command.c cleanup infrastructure handles everything — it sends SIGTERM to the child, waits for it to exit, and works on both normal exit and signal-based exit paths.

Verification

HTTPS transport (transport-helper.c)

I built the patched git inside a debian:bookworm container (matching the target environment to avoid glibc mismatches), deployed it, and watched:

=== Zombies after 2 minutes ===
0

Zero zombies, despite continuous authentication failures on two repositories every 15 seconds.

SSH transport (connect.c)

To verify the connect.c fix independently, I used a red/green approach:

  1. Red (before fix): Built the hound image with git patched only in transport-helper.c (no connect.c fix). Hid SSH keys (mv ~/.ssh ~/.ssh_hidden) to force SSH authentication failures. Within seconds, SSH zombies accumulated:
$ ps aux | grep defunct
user     3253967  [ssh] <defunct>
user     3253977  [ssh] <defunct>
user     3254057  [ssh] <defunct>
user     3254063  [ssh] <defunct>
user     3254069  [ssh] <defunct>

All parented to houndd (PID 1 in the container), confirming the finish_connect() path was being bypassed on exit(128).

  1. Green (after fix): Rebuilt git with the connect.c fix, rebuilt the hound image, and restarted the container — still with SSH keys hidden. The same SSH failures occurred (Host key verification failed, exit status 128), but:
$ ps aux | grep defunct | grep -v grep | wc -l
0

Zero zombies. The clean_on_exit mechanism in connect.c successfully reaps the ssh child process on abnormal exit.

Lessons Learned

  1. Zombie processes are always about waitpid() — but the question is who should be calling it and which child isn’t being waited on.

  2. PID 1 matters in containers. Most programs aren’t designed to be PID 1. When they are, orphaned grandchildren become their responsibility. This is why docker run --init (which uses tini as PID 1) exists.

  3. The obvious suspect isn’t always the culprit. The initial fix targeted hound’s Go code, but the real bug was in git itself. The git proxy approach was what finally revealed the truth.

  4. Process names in ps tell you less than you think. Zombie processes lose their /proc/<pid>/cmdline. The comm field survives, but it’s just the executable name — not the arguments. We needed the proxy’s log files to see the full picture.

  5. Namespace translation is essential for container debugging. Host PIDs and container PIDs are different. /proc/<pid>/status with NSpid is the bridge.

  6. Code review catches what testing doesn’t. My custom atexit fix passed all my tests, but the mailing list review identified a deadlock risk I hadn’t considered. The final fix was simpler and more robust — it leveraged existing infrastructure instead of reinventing it.

Resolution

The patch went through four revisions on the Git mailing list — from custom atexit handlers to leveraging git’s built-in clean_on_exit mechanism — and has been merged to upstream. Both transport paths are now fixed: