Crazy high CPU usage on Snow Leopard and a surprising culprit

· 868 Words

After coming back from a team-building trip, I started to notice things were going a bit slow on my work laptop. I took it with me for casual emailing and working on the train, but spent about 4 days not really using it and certainly not ‘working’ on it. It came out once to display the lyrics to a song about Pithivier, and once to check emails, but that was about it.

On my return to work everything ran a little bit slower than I remember. I’m building a web app, so I have a vertical stack of Postgres, Python and Chrome running at the same time. To top it off I had a brief excursion into someone else’s code which I didn’t know as intimately as my own. For a while I put it down to ‘I’ve been away and forgotten what it’s like to use’. Soon came ‘Python is a slow language’. Not long after that, ‘and so is JavaScript’. No late-binding language was safe.

I put up with this for about a week. Things were usable, just sluggish. It gave me a bit of nostalgia, taking me back to Windows days when things would just slow down after a couple of years. I noticed that Firefox was the main culprit. I’ve used Firefox for 6 or 7 years, and it has had, on occasion, some terrible performance. This wasn’t such a surprise. I tend to favour Chrome anyway, so I just got on with it.

But this evening it came to a point when I had a handful of tabs open, Firefox was converging on 200% CPU and I thought ‘this is my machine not yours’ and took a closer look.

The first thing I noticed was that the of the CPU cycles used by Firefox, the Firefox code itself wasn’t actually guilty.

Notice all that red? That’s System CPU usage. And that green skimming on top? That’s User CPU. It’s been over a year since I’ve been able to take an interest in this kind of thing, but if I’m correct, the red is system calls, called as a result of userland code, executed in protected kernel space. The green is user code in userland space.

I’m not sure why I didn’t notice this before. This is more serious then: it’s not buggy Firefox code, but something’s going wrong with the system. Perhaps it’s a kext gone bezerk, or hardware damage resulting in repeated attempts to perform low-level tasks. Perhaps it’s a bug in mach, heaven forfend. I have been on the brink of wiping and re-installing my mac, and this nudged me a bit closer.

I’m a fairly vanilla user. I don’t have tonnes of services running, or kexts and drivers installed. Launchd is pretty much as God intended. I don’t have any extra daemons running, at elevated privileges or otherwise.

I had noticed other processes use high CPU. TextMate was one culprit, and I’m sure Chrome did this too from time to time. Python was affected, but the closest I got to finding the problem was identifying a stack-maintenance function in CPython, which isn’t such a surprise. Firefox was the most noteworthy recidivist, and it was doing it right now.

So I decided to run Shark on the Firefox process. The results were as follows:

 

Aha! So it is kernel code responsible. I know from my copy of Mac OS Internals: A Systems Approach (it’s an excellent book), as well as it being fairly obvious from the name, that vm_fault is a page fault.

Whilst this was running I was searching the web for random high CPU usage. I came across this discussion which blamed BTServer. I just happened to notice that Bluetooth was enabled. We’d tried to use it to get the Pithivier words transferred to my laptop on the team-building trip. Bluetooth was the first port of call. I’d last done this in 2004, before wifi came as standard, and it was straightforward then (it didn’t work this time, we used wifi).

I disabled Bluetooth. Behold:

CPU suddenly dropped to what I’d expect. Just like that. I have frankly no idea whatsoever why this happened.

I can’t see a connection between the Bluetooth kext (I imagine it’s a kext) and page faults. I don’t see why it would happen for one codebase / process any more than any other. I can’t see why it wouldn’t happen in the Bluetooth daemon not a completely unrelated process. I can only guess that it somehow shifted some function pointers around, or registered on some interrupts, or something. Perhaps the user_trap somehow installed by the Bluetooth server, and had some expensive code in it? Perhaps an IOKit driver was installed in the wrong bit of the device tree.

But I don’t really care, problem solved. Except it does leave me casually wondering if the design of Firefox leads to more page faults than Chrome. And apologising to Javascript and Python and all its late-binding friends, as the problem was almost certainly written in C or C++.

tl;dr if your CPU runs excessively high on Mac OS X Snow Leopard, check to see if you’ve got Bluetooth enabled.

If you do, disable it. It’s 2012.

Read more