How Linux suspend and resume works in the ACPI age

Posted 7 Feb 2007 at 11:16 UTC by mjg59

Back in the APM days, everything was easy. You called an ioctl on /dev/apm, and the kernel made a BIOS call. After that, it was all up to the hardware. Sure, it never really worked properly, and it was basically impossible to debug what the hardware actually did. And then ACPI came along, and nothing worked at all. Several years later, we're almost back to where we were with APM. But what's actually happening when you hit that sleep key?

Without the ability to suspend and resume, laptop users are doomed to spend several hours of their lives waiting for machines to boot and shutdown. This is, clearly, suboptimal. APM made it fairly easy to implement this, because almost everything was handled by the BIOS. And that, in a nutshell, is one of the primary reasons why ACPI ended up in charge.

The biggest problem with APM is that it left policy in hardware. Don't want to suspend on lid closure? The OS doesn't get any say in the matter, though if you're lucky there might be a BIOS option to control it. Would prefer it if the BIOS didn't scribble all over the contents of your video registers while it tries to reprogram them (probably back to the defaults of the Windows drivers...)? Sucks to be you. Want the sleep button to trigger suspend to disk, not suspend to RAM? A-ha ha ha.

ACPI deals with that problem, by moving almost all the useful functionality out of hardware. The downside of this is that the functionality needs to be reimplemented in the OS. Which, given that the ACPI spec is around 600 pages long, has taken a little time.

(Of course, it turns out that most of the ACPI spec is entirely uninteresting for suspend and resume purposes, but that's not really the point right now)

So, firstly, lets have some ACPI jargon. ACPI itself stands for "Advanced Configuration and Power Interface". It's not just a power management spec - it provides the OS with a description of all the built-in hardware in your system, along with a certain degree of abstraction. It gives you information about interrupt routing, tells you if someone's just removed a hot-pluggable DVD drive from a laptop and may even let you control which video output is being used.

This information is provided in a table called the DSDT (Discrete System Descriptor Table). The DSDT is in a bytecode called AML (ACPI Machine Language), compiled from a simple language called ASL (ACPI Source Language, shockingly enough). At boot time, the system reads the DSDT, parses it and executes various methods. These can do pretty much anything, but on the bright side they're being executed in kernel context and (in principle) you can filter out anything that you really don't want to do (such as scribbling all over CMOS or something).

The final relevant piece of ACPI information is something called the FADT, or Fixed Address Descriptor Table. This gives the OS information about various register addresses. It's a static structure, and doesn't contain any executable code.

So, how does all of this stuff actually work?

First of all, the user hits the sleep key. This triggers a hardware interrupt, which is caught by the embedded controller. That pokes a register in the southbridge, which flags that a general purpose event has just occured. The OS notices this, and checks the DSDT for what's supposed to happen next. Generally, this just calls a notification event. This is bounced back out to userspace via /proc/acpi/events (currently, though it's going to be moved to the input layer in future) and userspace gets to choose what happens next.

Let's concentrate on the common scenario, which is that someone hitting the sleep button wants to suspend to RAM. Via some abstraction (either acpid, gnome-power-manager or kpowersave or something), userspace makes that decision and initiates the suspend to RAM process by either calling a suspend script directly or bouncing via HAL.

Depending on distribution, this ends up running a shell script or binary which attempts to prepare the system for suspend. Right now, this tends to involve a bunch of bandaids around various broken drivers - unloading modules and reloading them is one of the easiest workarounds for breakage. Finally, the string "mem" is written to /sys/power/state.

This jumps back into the kernel. First, userspace is stopped. This stops it getting horribly confused when a load of hardware mysteriously stops working. Then the kernel goes through the device tree and calls suspend methods on each bound driver. Individual drivers have responsibility for storing enough state in order to be able to reprogram the device on resume - ACPI doesn't make guarantees about what the hardware state is going to be when we come back. Once the kernel-side suspend code has been run, we execute a couple of ACPI methods - PTS (Prepare To Sleep) and GTS (Going To Sleep). These tend to poke various things that the kernel knows nothing about, and so a certain amount of magic may be involved.

At this point, the system should be fairly quiescent. Only two things to do now. Firstly, the address of the kernel wakeup code is written to an address contained in the FADT. Secondly, two magic values from the DSDT are written to registers described in the FADT. This usually causes some sort of system management trap, which makes sure that the memory is put in self-refresh mode and actually sequences the machine into suspend. For the S3 power state, this basically involves shutting the machine (other than the RAM) down completely.

Time passes.

The user presses the power button. The system switches on, jumps to the BIOS start address, does a certain amount of setup (programming the memory controller and so on) and then looks at the ACPI status register. This tells it that the machine was previously suspended to RAM, so it then jumps to the wakeup address programmed earlier. This leads it to a bunch of real-mode x86 code provided by the kernel, which programs the CPU back into protected mode and restores register state. Suddenly we're running kernel code again.

From this point onwards, it's much the reverse of the suspend process. We call the ACPI WAK method, resume all the drivers and restart userspace. The shell script suddenly starts running again and cleans up after itself, reloading any drivers that were unloaded before suspend. As far as userspace is concerned, the only thing that's happened is that the clock has jumped forward.

So why is this difficult?

In a lot of cases, it's just down to bugs in the drivers. Restoring hardware state can be hard, especially if you don't actually have all the documentation for the hardware to start with - traditionally, many Linux drivers have ended up depending on the BIOS to have programmed the hardware into a semi-sane state, and there's no guarantee that that will happen with ACPI. Other cases can just be oversights - for instance, the bug in the APIC (not to be confused with ACPI) code that meant a single register wasn't restored, resulting in some machines resuming without any interrupts being delivered.

The single biggest problem is video hardware. The spec doesn't require the BIOS to reprogram the video hardware at all, and so often it'll come back in an entirely unprogrammed state. This is an issue, since we (in general) have absolutely no idea how to bring a video card up from scratch. One of the easiest workarounds is to execute code from the video BIOS in the same way that the system BIOS does on machine startup. vbetool lets you do this from userspace, and it works a surprisingly large amount of the time. However, there's no guarantee that it'll be successful. Vendors often unmap that section of BIOS after the system has been brought up, since they've got far more BIOS code than will fit in the BIOS region of the legacy address space. In the long run, the only solution is drivers that know how to program an entirely uninitialised chip. The new modesetting branch of the Intel driver aims to do this, as do the developers of noveau.

Despite all this misery, ACPI support is generally improving. Most machines can now suspend and resume once more. The next big challenge is improving run-time power management in order to get battery life to at least the level it is under Windows, and ideally beyond that.

Thanks, posted 9 Feb 2007 at 20:28 UTC by ncm » (Master)

Thank you, Matthew. I gave up on ACPI-suspend on my Dell D620, and switched to swsuspend2. It mostly works, modulo the occasional OOPS while trying to suspend which I am inclined to blame on the bcm43xx driver. Since the whole machine powers off, the BIOS is obliged to initialize everything when it comes back up.
CPU speed is another area where ACPI seems to over-complicate matters. I see hardly any reason to operate the CPUs in any mode between completely powered-off and as fast as they can go. In particular, "throttling modes" look completely pointless. Maybe they're supposed to reduce power usage when people are running screen savers or other spin-loop processes that shouldn't be running at all?

Probably the only solution, ultimately, to the ACPI/BIOS problem will be to "turn" a big laptop reseller, and count on them put the screws on the manufacturers. In the meantime it's a holding action.

Now, if only I could find time to trace the code that maps key events to scripts on Debian, or find a page by somebody who did...

reverse-engineering for handhelds, posted 10 Feb 2007 at 22:13 UTC by lkcl » (Master)

we have to deal with this, a lot - nothing to do with ACPI, on the reverse-engineering project to get linux running on HTC (high tech corporation) handhelds (smartphones and pdas). it's the one major irritating factor that can stop a device from being useful.
i managed to get linux running entirely on the ipaq hw6915 in just six weeks (because of the common hardware between the htc universal, ipaq hx4700 and a couple of others).

however, i spent a very frustrating further two weeks trying to work out what it was that was stopping the device from being able to resume: exactly as you say, interrupts weren't being enabled [in the end i had to give up as i was running out of time]

now the really annoying thing is that it is terribly difficult to track down why this is.

firstly, you're resuming: you _can't_ do any significant debugging: it either works, or it doesn't. if you get it right, it resumes. if you don't get it right, you've no way of communicating anything to find out why.

secondly, the booting is coming not from startup but from gnu-haret.exe which is the ARM / wince equivalent of LOADLIN.EXE for x86 / win32. so, you're booting into linux with most of the hardware preinitialised. on some devices, it is an absolute _bitch_ to work out the GPIO pin requirements, some of which require switching to alternate states and back with special timings in between to let buggy or glitchy hardware recover, because there's not enough current or something - you just don't know.

and on one device, 272 GPIO pins (192 on the CPU, 16 on a chip we've named EGPIO for 'extended gpio', and 64 on a separate custom I/O chip which we've named ASIC3) ... actually weren't enough, so the designers had to _borrow_ some of the I/O pins on the GSM radio rom (another 64 or so gpio pins - it's another ARM processor) and so you have to, unbelievably, communicate proprietary commands over a _serial_ line to specify which set of speakers are to be used! (yes, this device i'm describing i think it has 5 speakers 2 of which are stereo speakers and 3 microphones and a headphone socket _and_ bluetooth audio _and_ a car-kit)

physics, posted 13 Feb 2007 at 05:10 UTC by ncm » (Master)

Reverse engineering is like physics. There's certainly a design because it (universe, gadget) exists. There's equally no guarantee the design can be elucidated from obtainable evidence. The only saving grace for reverse engineers is that increased complexity drives designers to more standardized components, because the designers have to be able to understand it themselves. We just have to hold out until Linux is one of those components.
Physicists, OTOH, are probably SOL.

tell me about it :), posted 13 Feb 2007 at 13:17 UTC by lkcl » (Master)

yes. the increased commoditisation and success of HTC means that they are endeavouring to drive down the development costs.
the HTC smartphone/PDAs (HTC designs all the iPAQs) from the past three or so years have all used the Akai 4641 sound chip instead of the rather esoteric philips UDA1380 (which is a pain to program in some of the hardware configurations it's been used in : the ADC and DAC clocks can be on separate ... it's a long story :) )

the first HTC device (the wallaby) was an _absolute_ bitch, and the only saving grace was that the next two revisions kept the majority of the components and also that one of the chips was also in the iPAQs, which the HP/Compaq engineers had at least some scant internal documentation on.

but anyway - this is off-topic for the original article so i'll shut up now :)

[linuxkernelnewbies] Advogato: How Linux suspend and resume works in the ACPI age