Category Archives: Linux

Solving PC Battery Drain Issues: A Case Study of Endless Laptop’s Technical Journey

Originally posted at Endless OS Foundation

In June 2022, a community of low-income women entrepreneurs in the United States were approached with the possibility of obtaining their first personal computer under Endless Laptop innovative financing plans. They attended a local event, made a small down payment, and walked away with a powerful Personal Computer full of tools and content ready to support the growth of their businesses and the education of their families.

The Endless Laptop

Our device access program was finally underway, reaching families that had never had a PC in their home before. This initiative was swiftly extended to underserved communities in Guatemala and Mexico, but as the user base grew, a significant flaw emerged: the laptop battery would run down very quickly when the system was in sleep mode.

In this article I wish to share the story of how Endless OS Foundation’s highly skilled Tech team relentlessly chased this issue through several twists into the guts of the PC architecture, eventually pinpointing the problem to the most unlikely of things: a misbehaving device driver corresponding to a type of disk not even present in the product.

Due to the complexity and specificity of this problem, this recap unavoidably needs to draw on fairly technical jargon. I hope it is accessible to those who have at least a loose interest in PC and operating systems architecture!

Solving excessive battery drain through a stroke of good luck?

We first became aware of issues caused by rapid battery depletion in sleep mode at the same time that our Taipei lab staff were coincidentally debugging a different power management issue on a newer variant of the same laptop. It turned out that the new product, which was being considered as a successor, could not sleep at all: upon waking up, it would lose access to the disk.

Our software platform, Endless OS, is based on open source software (Linux), which allows us to tap into an extensive public developer community. In wider discussion about this issue, we tested and approved a workaround for this issue, which would have both laptop models revert to an older implementation of sleep mode.

In addition to allowing the newer product to sleep and wake without issue, this change had the additional effect of greatly reducing the amount of power used during sleep on the original Endless Laptop model that we had already delivered to our userbase. This exercise had coincidentally solved the power usage issue being seen by our users. “That’s handy!”, we thought, as we swiftly rolled out this change to our user base via a software update.

Failures to wake up from sleep mode

Our attention was called back to this issue when we later identified a slow but steady stream of support requests from our users reporting that the device would occasionally get stuck in sleep mode. When this issue was encountered, the system would be unresponsive to any attempt to wake it up from the low power state. It was very hard to reproduce this failure, but we were eventually able to hit the failure and characterize it in detail.

Our workaround to the battery drain issue above was causing these systems to use S3 legacy suspend, a historical implementation of PC sleep mode. In this mode, control of the device is fully handed over to the system firmware when going sleep mode, and the CPU and RAM are powered down. Because the wakeup failure was happening in this mode, it was apparent that the issue was emerging at system firmware level, beyond the reach of the operating system. It is perhaps not surprising that such a firmware issue may exist: this product was not designed for S3 legacy suspend, S3 is likely untested and unsupported on this device, and we should probably not be using it.

Despite the initial indication that we had got lucky with the workaround to use S3 legacy suspend, it turned out to be unreliable and we knew we had to drop this and go back to understand the original problem in more depth. We had two questions to answer:

  1. Why was the newer variant of the product failing to access the disk after waking up, before we put the (problematic) workaround in place?
  2. Why was the device draining so much power during sleep mode, before we put the (problematic) workaround in place?

Intel Volume Management – failed disk access after wakeup

We compared the two product variants closely and spotted the reason why the newer variant had disk access issues after wakeup: it had the disks configured differently.

The original product had been rolled out with Intel VMD, a system function enabling powerful data storage setups, not entirely relevant for our home PC use case. The newer sample had been configured to access the disks in the traditional way, without VMD. And the non-VMD configuration was experiencing the lack of disk access after waking up.

We looked closely and found that our Linux-based operating system was completely powering down the disk in non-VMD sleep mode. This makes sense, because you want to save as much power as possible while the system is sleeping. But we observed that the device was unable to restore power to the disk from that state, and using advanced debugging tools, we observed that Windows, a different operating system, was not cutting the disk power during sleep mode on this product.

Advanced disk power state debugging

We still don’t know why the power is retained in that configuration, nevertheless we updated the Linux behaviour to match. The problem was now avoided, but this time in a way where we had a far more precise understanding of the issue.

Modern Standby: understanding power usage

Now we had both laptop models able to sleep and wake up, regardless of disk configuration, without using the problematic legacy suspend method. It was time to return our attention to the original problem: why is so much power consumed when the system is asleep?

This product uses a Modern Standby design where the core system processor and operating system actually remain active during sleep mode. However, the operating system attempts to turn off as many hardware components as it can (screen, Wi-Fi, disk, etc), pause all apps, and get the processor into an ultra-low power mode where it has almost no work to do. The goal is that power consumption will reduce so drastically that the system can be in sleep mode for days, even though technically you could regard the core system as being awake and running.

In our case, clearly this power consumption goal was not being met. The battery was being drained in a matter of hours in sleep mode.

We called upon some low-level debugging features of the Intel processor that identify which specific parts of the system are reaching their lowest power states during suspend, and which are not. This revealed that the SATA disk controller was preventing the CPU from going into low power mode.

This was a very surprising finding. SATA refers to a type of disk, but this product uses a more modern type of storage (NVMe) – not SATA! Why on earth would the unused SATA controller be getting in our way? What could cause it to prevent the CPU from deep sleep?

The mysterious Tiger Lake SATA power savings issue

Harnessing the power of the open source community, we were able to ask those questions directly to Intel engineers highly familiar with the workings of the hardware. That quickly gave us the exact direction we needed: SATA power savings had been intentionally disabled for this specific Intel “Tiger Lake” processor family on Linux. When power savings had been enabled at an earlier point, it had caused multiple users to mysteriously find themselves unable to boot their computers; nobody knew why.

This suggested that there was probably a whole range of products suffering from this power drain issue. It also meant that we would have to solve this SATA disk issue in order to make progress, despite our product not even making use of SATA.

Refocusing around this challenge, Endless’s Jian-Hong Pan impressed us all by quickly spotting a peculiar detail that had evaded everyone else for years: the code being used to turn on power savings for Intel SATA controllers was quietly and unexpectedly activating an additional behavior change for these devices. Much older Intel SATA controllers needed a “quirk” in order to support multiple disks, and this behavior change had been intended to be restricted to Intel hardware up to around 2017, but 6 years later, Linux was inadvertently applying the quirk to most present day Intel SATA controllers. And for whatever reason, applying this obsolete quirk to the Intel Tiger Lake processors would cause the SATA disks to become completely inaccessible.

Mission accomplished, time to sleep

Thanks to our findings, the Linux SATA maintainers were able to restrict the application of the SATA quirk and activate power savings for Tiger Lake SATA, which should improve power usage on a whole range of devices in addition to ours. We then prevented our disk being problematically turned off during sleep mode and re-enabled Modern Standby for this product, which is now able to achieve around a week of battery life in sleep mode. These fixes were all incorporated into official versions of Linux, and rapidly rolled out to our userbase in Endless OS 5.1.2. With the problem incidence rate subsequently dropped to zero, we can comfortably conclude that Endless Laptop’s first-time PC users in underserved communities are now enjoying long battery life of their devices.

That was a long, hard, fascinating ride. What started with power usage issues took us through suspend mechanisms, firmware issues, disk power management, and quirks for unrelated hardware predating our product by several years. This example demonstrates the skill and resilience of the Endless team, the power of open source communities, and the importance of solving technical issues through truly understanding their root cause, no matter how deep you have to go.

Credit to Jian-Hong Pan and Cassidy Blaede at Endless for their detailed investigation of this issue, David Box and Mika Westerberg from Intel for their speedy and invaluable direction, and Linux SATA maintainers Niklas Cassel and Mario Limonciello for pushing the crucial fixes over the finish line.

PCI development project

I’m looking to find someone to take over some part-time contract work that I’ve been doing. I’m only stopping as I am about to start some full-time summer work.

The project is Linux driver development for a PCI frame grabber. Kernel experience is essential, and the important areas of knowledge are PCI and DMA. Location does not matter, this is a remote development project.

If you’re interested, or know anyone that might be, please drop me an email.

Lorin Olivier’s GL860 driver

Lorin Olivier has created a kernel driver for his GL860 webcam. Lorin’s device is the 05e3:f191 variant, whereas mine is the more common 05e3:0503. There are differences between the devices that we don’t have much of a grasp on. The code we’ve written for each device is incompatible with the other, even though there are some protocol similarities.

Lorin reports that his driver works reasonably well with his device – it works with camorama, xawtv, ekiga, amsn, mplayer and xsane. He has also determined how to adjust various camera settings (luminosity, saturation, hue, sharpness, retrolighting, mirror effects, light source, AC power frequency).

Although Lorin doesn’t actually own an 0503 device, he’s attempted to implement support for it based on my earlier efforts. Given that I didn’t get very far, it probably doesn’t work that well. I haven’t had a chance to try it, but there’s no point me sitting on this any longer.

It’s in my git repository in the nvgl subdirectory:
git://projects.reactivated.net/~dsd/gl860.git (gitweb).

All credit goes to Lorin here – thanks! He’s done a great job, but do remember that its experimental code based on a reverse-engineered protocol, so don’t expect it to be flawless.

GL860: more devices, colour images

Lorin Olivier also has a GL860 with a different USB ID (05e3:f191) in an Asus F5RL laptop. He had some success with my code but the images look to be in a different format when running my software. He’s contributed traffic logs from windows which I’ve put alongside mine in the git repository.

Simon (Sur3) also tried it with his 05e3:0503 device and got a seemingly different image format as well. He also took an image that came back from mine and decoded the Bayer colour space, so I can now get images back in colour!

It’s great to see other people getting involved in these efforts, as I will probably not be able to put much time towards this for a while.

GL860 driver code

More webcam hacking. I can get proper images now, minus colour. I’ve published my code: git://projects.reactivated.net/~dsd/gl860.git (gitweb interface). Nightly snapshots will be generated here.

So far it just includes my experimental programs to try and make sense of the protocol and capture images. It works, sort of, but there’s a lot to be done. It also requires libusb-1.0 due to the isochronous endpoint. Only try it if you’re interested in development or are just very keen and curious.

libfprint v0.0.6 and other new devices

Although I’m not really working on the “old” code any more, I released libfprint v0.0.6 today. It fixes compatibility with newer DigitalPersona scanners including the ones in Covadis products (who kindly donated hardware to allow for this development). It also adds Gustavo Chain’s driver for the SecuGen Hamster III.

Gavin Smalley donated a Veridicom 5thSense scanner, which I reverse engineered and produced a driver for. This driver is only available from the highly volatile libfprint development repository. It works well.

System76 generously donated a laptop with one of the dreaded 147e:2016 UPEK scanners so that I can work on getting it supported in fprint. It’s too early to discuss driver practicalities, but I have almost figured out the image format.

The laptop also includes an integrated Genesys Logic GL860 USB webcam (05e3:0503), not standards compliant and not usable under Linux. I’ll probably also be working on a driver for this device. Again, I have already almost determined the image format, but have not looked at the rest of the traffic.

Critical Linux kernel vmsplice security issues

There have been 2 significant security flaws found in the Linux kernel, accompanied by plenty of misinformation and confusion. This is my attempt to clear things up a bit.

The short story: If you are running Linux 2.6.17 or newer then any user who has local console or SSH terminal access to your machine can easily become root or crash the system. If this is a problem for you, then you need to upgrade to gentoo-sources-2.6.23-r8 or gentoo-sources-2.6.24-r2. At the time of writing, there are no official released upstream kernels which solve the issues – Linux 2.6.24.1 and 2.6.23.15 are vulnerable.

The longer story:

There are actually two separate security issues in question here. However, they both have the same impact (any user can adjust kernel memory and hence become root), and both issues exist within the implementation of the vmsplice() system call. vmsplice() was added in Linux 2.6.17 and is built into every kernel build – there is no configuration option to exclude vmsplice. Two separate exploits have been publicly released which exploit each of the two issues respectively.

The first security issue under discussion was added in Linux 2.6.23 (obviously unintentionally!). This means that 2.6.22 and older are not vulnerable to the first exploit. This issue was fixed by this patch in Linux 2.6.23.15 and Linux 2.6.24.1. This vulnerability has been classified with two codes: CVE-2008-0009 and CVE-2008-0010.

The second security issue is more serious. Firstly, it has existed for the entire lifetime of vmsplice() which means that any kernel version 2.6.17 or newer is vulnerable. Secondly, it is not fixed in any upstream kernel release at time of writing, but the fix has been merged into Linus’ upstream development tree. This vulnerability has been assigned ID CVE-2008-0600.

gentoo-sources-2.6.23-r7 and gentoo-sources-2.6.24-r1 include the fix for the first issue, but are still vulnerable to the second (which is equally serious).

gentoo-sources-2.6.23-r8 and gentoo-sources-2.6.24-r2 include the fix for the second issue and are hence secured against all known vmsplice exploits at this point in time. 2.6.23-r8 will be marked stable when I wake up 7-8 hours from now, so testing of that release would be appreciated.

UPDATE: gentoo-sources-2.6.23-r8 is now stable, and upstream have also released the following which fix all currently known issues: Linux 2.6.23.16, 2.6.24.2 and 2.6.25-rc1.

Gentoo kernel project contributors

On the Gentoo kernel maintenance front, I’ve been slacking lately. After launching the project, my fingerprint scanning efforts soon started to eat almost all of the time I’m willing to spend in front of a computer. Then comes a busy xmas/new year, quick week in the US, exam revision and now exams; it’s been a few months since I put proper time into the Gentoo kernel front. I’m feeling a little guilty as this inactivity all started at pretty much the same time as when I became the kernel project lead.

Yet, the Gentoo kernel bug list shows only 23 bugs open, plus no critical/widespread unsolved issues at a cursory glance (when I was doing this singlehandedly, I usually had problems keeping this count below 40). This is all thanks to Maarten Bressers, Duane Griffin and Mike Pagano. Unfortunately Maarten is tied up with other issues at the moment, but Duane pops up from time to time and singlehandedly solves some tricky-looking issues and Mike is very active and is doing a fine job keeping things shipshape.

Before getting involved with Gentoo kernel bugs and genpatches maintenance, all 3 of the aforementioned people had no prior involvement with the kernel. One of the things that prompted me to write this post was to get up today and see an IRC conversation, where Mike uses some diagnostic knowledge he’s gained from a Gentoo kernel bug to make a suggestion to another user who is having trouble booting their system (which I am quite confident will solve the issue). Definitive proof that Mike has become a skilled and efficient bug-attacking machine.

If other developers are wondering how I managed to recruit these “newbies” into enthusiastic and productive contributors, my process was as follows:

  1. Write a maintenance guide giving people enough information to get started
  2. Encourage the interested respondents to ask lots of questions (I think this is the most important part — be clear that you’re available to be consulted).
  3. Advertise it in the Gentoo Weekly Newsletter.
  4. Wait for some questions to come in (and answer them).

All in all, it was quite time consuming to write the initial document and then answering questions, but the fact that I can then be largely inactive for a few months and still have things running smoothly tells me that it was worth the investment.

Recent ramblings

Recent writing-related updates: