Hunting Zombie Processes: From 6,500 Zombies to a Bug in Git
The Discovery
It started with an expired access token. I was trying to git fetch in one of my repos and got an authentication failure. While poking around, I noticed something odd — thousands of zombie processes on my machine:
$ ps aux | grep -w Z | wc -l
6530
Over six thousand zombies, all [git] and [ssh] defunct processes. Every single one parented by the same PID — the houndd process running inside a Docker container.
Hound is a code search tool. It periodically runs git fetch to keep its indexed repositories up to date. Two of my configured repositories pointed to Azure DevOps with an expired token, so every fetch attempt was failing. And every failure was leaking a zombie.
Understanding the Setup
Hound runs as PID 1 inside its Docker container. This is important — when a process’s parent exits, the orphaned child gets re-parented to PID 1. If PID 1 doesn’t call wait() on adopted children, they become zombies.
The hound Go code spawns git in several places:
run()— usesexec.Command().CombinedOutput()forgit fetch,git reset,git remote showHeadRev()— usescmd.Start()+cmd.Wait()forgit rev-parseClone()— usescmd.CombinedOutput()forgit cloneAutoGeneratedFiles()— usescmd.Start()+cmd.Wait()forgit ls-filesandgit check-attr
There was already a fix attempt on the fix/zombie-processes branch that added cmd.Wait() calls to error paths in HeadRev() and AutoGeneratedFiles(). The fix was deployed. But the zombies kept coming — about 4 per minute.
First Theory: Missing Wait() in Hound
The natural assumption was that hound’s Go code wasn’t properly waiting on some git child processes. CombinedOutput() internally calls Wait(), so those paths should be safe. The fix addressed HeadRev() and AutoGeneratedFiles(), but maybe there were other leak paths?
To find out which process was actually becoming a zombie, I needed to intercept every git invocation.
The Git Proxy
To catch every git invocation — whether from hound’s Go code or from git’s own internal subprocesses — I built a proxy binary:
func main() {
self, _ := os.Readlink("/proc/self/exe")
realGit := self + ".real"
os.MkdirAll("/srv/hound/git-proxy-logs", 0755)
pid := os.Getpid()
f, _ := os.Create(fmt.Sprintf("/srv/hound/git-proxy-logs/%d.log", pid))
fmt.Fprintf(f, "pid=%d cmd=%s %s\n", pid, self, strings.Join(os.Args[1:], " "))
f.Close()
syscall.Exec(realGit, os.Args, os.Environ())
}
Key design decisions:
syscall.Execreplaces the proxy process with real git — same PID, same parent, no interference with git’s stdout/stderr- Log to files, not stderr — hound depends on git’s output; we can’t pollute it
- Per-PID filenames — no contention between concurrent git processes
/proc/self/exeto resolve the real path — becauseargv[0]might just be"git"without a path
I replaced both /usr/bin/git and /usr/lib/git-core/git with the proxy (renaming the originals to *.real), rebuilt the container to run as root (to have permission to replace system binaries), and deployed.
The Breakthrough
After letting it run and accumulate zombies, I matched zombie PIDs to log files. The PIDs needed translation — ps shows host-namespace PIDs, but the proxy logs container-namespace PIDs. The mapping lives in /proc/<pid>/status under NSpid.
The result was unambiguous:
host=2783115 container=87 comm=git.real log=pid=87 cmd=/usr/lib/git-core/git remote-https origin https://<redacted>
host=2783237 container=170 comm=git.real log=pid=170 cmd=/usr/lib/git-core/git remote-https origin https://<redacted>
host=2783495 container=354 comm=git.real log=pid=354 cmd=/usr/lib/git-core/git remote-https origin https://<redacted>
Every zombie was git remote-https — the HTTPS transport helper. When hound runs git fetch, git spawns git remote-https as a child to handle the HTTPS protocol. When authentication fails, git fetch exits, but its git remote-https child hasn’t been waited on. The orphaned child gets re-parented to PID 1 (houndd), and since houndd doesn’t reap adopted children, it becomes a zombie.
The Root Cause in Git
With the smoking gun pointing at git remote-https, I cloned git’s source (v2.39.5 to match the container). The zombie was a transport helper, and there’s a file literally called transport-helper.c — that’s where I started looking.
Inside, get_helper() is the function that spawns the helper process. Confirming this is the right place, line 139 constructs the command name:
strvec_pushf(&helper->args, "remote-%s", data->name);
When fetching over HTTPS, data->name is "https", producing remote-https. Combined with helper->git_cmd = 1 (which tells start_command to run it as a git subcommand), this is exactly what spawns /usr/lib/git-core/git remote-https — our zombie.
The helper is started a few lines later:
code = start_command(helper);
Since start_command() forks the child, there must be a matching waitpid() somewhere. In git’s codebase, that’s wrapped in finish_command(). Grepping for finish_command in the file finds exactly one call — inside disconnect_helper():
static int disconnect_helper(struct transport *transport)
{
// ...
res = finish_command(data->helper); // calls waitpid()
FREE_AND_NULL(data->helper);
// ...
}
So disconnect_helper() is the only place the helper gets reaped. The next question is: does every code path reach it? Searching for exit( in the file reveals the answer — no. There are at least 6 exit(128) calls scattered across the file:
if (recvline(data, &buf))
exit(128); // helper child is never waited on!
When git remote-https reports an authentication failure, recvline() fails (the helper’s output pipe closes), and git calls exit(128) directly — never going through disconnect_helper(), never calling finish_command(), never calling waitpid().
Why atexit
One approach would be to patch each exit(128) site to call disconnect_helper(transport) first. But this has problems:
- It’s error-prone — miss one and you still have zombies
- Some of those
exit()sites don’t havetransportin scope, so threading it through would be invasive - Future code could add new
exit()calls without knowing about the cleanup requirement
Instead, I used an atexit handler — a safety net that catches all exit paths with zero changes to existing control flow. Git itself uses this pattern in several places, most notably in run-command.c where cleanup_children_on_exit is registered via atexit to kill and reap child processes on abnormal exit, with clear_child_for_cleanup to deregister them on normal cleanup. The same pattern applies here — register the helper for reaping, clear it when properly disconnected.
The Fix
The fix is an atexit handler:
static struct child_process *helper_to_reap;
static void cleanup_helper_on_exit(void)
{
if (helper_to_reap)
finish_command(helper_to_reap);
}
Registered right after the helper starts:
data->helper = helper;
helper_to_reap = helper;
atexit(cleanup_helper_on_exit);
And cleared on normal cleanup:
res = finish_command(data->helper);
helper_to_reap = NULL; // atexit handler becomes a no-op
FREE_AND_NULL(data->helper);
This catches all the exit(128) paths without modifying any of them. The atexit handler runs during process teardown and reaps the transport helper child.
The normal cleanup path and the atexit handler do not conflict. On the normal path, disconnect_helper() calls finish_command() to reap the child, then sets helper_to_reap = NULL. When the process eventually exits, the atexit handler sees NULL and does nothing. On the error path, exit(128) is called without going through disconnect_helper(), so helper_to_reap is still set, and the atexit handler reaps the child.
The same pattern applies to connect.c, where finish_connect() can be bypassed when callers die() or exit() before reaching it. An identical atexit handler ensures the SSH or proxy child is reaped.
Verification
HTTPS transport (transport-helper.c)
I built the patched git inside a debian:bookworm container (matching the target environment to avoid glibc mismatches), deployed it, and watched:
=== Zombies after 2 minutes ===
0
Zero zombies, despite continuous authentication failures on two repositories every 15 seconds.
SSH transport (connect.c)
To verify the connect.c fix independently, I used a red/green approach:
- Red (before fix): Built the hound image with git patched only in
transport-helper.c(noconnect.catexit handler). Hid SSH keys (mv ~/.ssh ~/.ssh_hidden) to force SSH authentication failures. Within seconds, SSH zombies accumulated:
$ ps aux | grep defunct
user 3253967 [ssh] <defunct>
user 3253977 [ssh] <defunct>
user 3254057 [ssh] <defunct>
user 3254063 [ssh] <defunct>
user 3254069 [ssh] <defunct>
All parented to houndd (PID 1 in the container), confirming the finish_connect() path was being bypassed on exit(128).
- Green (after fix): Rebuilt git with the
connect.catexit handler, rebuilt the hound image, and restarted the container — still with SSH keys hidden. The same SSH failures occurred (Host key verification failed, exit status 128), but:
$ ps aux | grep defunct | grep -v grep | wc -l
0
Zero zombies. The atexit handler in connect.c successfully reaps the ssh child process on abnormal exit.
Lessons Learned
-
Zombie processes are always about
waitpid()— but the question is who should be calling it and which child isn’t being waited on. -
PID 1 matters in containers. Most programs aren’t designed to be PID 1. When they are, orphaned grandchildren become their responsibility. This is why
docker run --init(which usestinias PID 1) exists. -
The obvious suspect isn’t always the culprit. The initial fix targeted hound’s Go code, but the real bug was in git itself. The git proxy approach was what finally revealed the truth.
-
Process names in
pstell you less than you think. Zombie processes lose their/proc/<pid>/cmdline. Thecommfield survives, but it’s just the executable name — not the arguments. We needed the proxy’s log files to see the full picture. -
Namespace translation is essential for container debugging. Host PIDs and container PIDs are different.
/proc/<pid>/statuswithNSpidis the bridge.
Resolution
Both transport paths are now fixed and verified:
- HTTPS:
transport-helper.c— atexit handler reapsgit-remote-https - SSH/proxy/local:
connect.c— atexit handler reaps the connection child
The fix is available on GitHub and has been submitted to the Git mailing list. The bug exists in git’s current master branch as of February 2026.