From popped caps to a peculiar kernel error

In my younger years I’ve been known to my friends as the ‘hardware man’, someone who gets around nicely with PCs in general. So I had the dubious luck to see some of my friends’ relatives machines from time to time, trying my best to fix them. Sometimes it was easy, sometimes it was… interesting.

Zoltán brought this box with the error report: this is the PC his little cousin uses to browse the net, and it became unstable lately, plus they started a company, and the OS on the machine is not exactly legit. With all this info, I gave my preliminary diagnosis for the hardware (suspected the PSU) and the software (thou shalt use open source). Now I’m left with a gray box and my tools. Let’s open ‘er up!

Whoa. It became widely known in the later part of 2002, that some high-capacity electrolytic capacitors from certain manufacturers are not exactly up to industry standards. Their failure rates were unusually high, some examples barely reached 250 hours of duty, then began to lose their capacity exponentially.  Soon, there was a backstory.

A materials scientist working for the Rubycon Co. in Japan left the company and began working for Lumious Town Electric. The scientist promised LTE, he could develop a copy of the water-based electrolyte formula he helped develop at Rubycon, and he did just that. The problem started when people from his team of engineers left LTE for other companies, taking some of the IP of LTE with them. Only snag: the formula was only partially complete; it was missing a vital ingredient that prevented decomposition thru electrolysis. That made the capacitors unstable (these caps are all-sealed, axial type) – the electrolyte began decomposing as the capacitor was charged, and produced an ever so small amount of hydrogen. Everytime you charged that cap, a small bubble of hydrogen would form inside, increasing pressure. These capacitors are sealed at the bottom with a rubber plug, and the tops are indented with a visible ‘X’ shape – these are precautions in case the user overvoltages or damages the cap in any way – the excess steam either vents on the top, or pushes past the rubber seal at the bottom.looks OK, doesn't it? look in the bottom left corner.

It seems this mainboard had been made with the defective capacitors, all of them are visibly damaged. I’m looking at a big repair here. Out of curiosity, I also opened the PSU. More blown caps. Darn.

got to love when deaf people use computers

Now, if repairing equipment that uses switching mode supplies (CPU 1.5V feed, nowadays up to 60-70A), when the capacitor starts to go, the voltages also destabilize. The controlling logic tries to compensate by increasing the switching frequency of the transistor charging the cap. This can go on for a while, but when the capacitor leaks out enough fluid, all the redundancies in the circuit are not enough to keep the output voltage constant, and the switching transistors may become overheated (=damaged).

Therefore, if the board does not spring to life after replacing all capacitors, it’s scrap. Knowing this, I took out my older soldering iron from the drawer (my newer, temperature-controlled unit is not hot enough to solder out the components from a 4-layer PCB). I replaced all capacitors with ones with the same rated capacity and same rated voltage, but with ones that have a higher heat threshold (85C > 105C). I put everything back together, took a deep breath and flicked the PSU on. No smoke, no buzzing, the standby LED lit up on the board… so far, so good. I measured the standby 5V supply, it looked smooth enough, voltage was 4.99V, still ok. Pushed the power button.

To my great relief, the machine sprung to life, the POST ran, but because I took the CMOS battery out for the repair, I had some settings to do, but hey, I rescued an old machine from the dismantlers. Great. Let’s install Linux onto it!

I had several bootdisks at hand, but because this PC will be used by someone not experienced in anything but Microsoft, I decided to use an Ubuntu desktop distro (Ubuntu also had back then an LTS distro which was, to say the least, suboptimal). Everything was fine until the Installer started booting. I got this error:

BUG: Int 14: CR2 ffffb0f0
EDI 00000000 ESI 00000000 EBP c0731f3c ESP c0731f1c

Interrupt 14 is a page fault interrupt, thrown by the processor when it attempts to access a page in memory which is marked as not present.  Now, this should not cause a problem, because normally, a page fault is only visible to a paging daemon, and it signalises to the daemon to load that page back into the memory. As I’m not a paging daemon, I should not see this message, unless someting untoward happened. That value after CR2 looked familiar… 4096 multiplied twice by 1024? Yup. Not that is matters, as this register merely points to the program’s address that caused the page fault.

At this point, I ran two different memory diagnostic software for 2-3 full cycles, but the RAM checks out (I was afraid the fluctuating voltages did some damage to the sensitive 3.3V DRAM chips, but no, they were lukcy…).

Out of curiosity I dd’ed my debian desktop over onto the patient, but no luck, still getting the page fault. This is getting frustrating. To make matters worse, when I tried to install my old Windows XP, it installed and ran fine! But the licensing costs for the Microsoft OS are far more than this whole PC is worth, so there has to be a way around this.

Let’s think, the RAM is good, but there is a paging fault. The linux kernel has its quirks, but it would not shut itself out of good memory, what’s left? The BIOS also controls memory management, let’s look around. Lucky I’m a bit older, still remembering the olden days of PCs, when someone was not going to get away with simply plugging in a new soundcard, you also had to set up the card’s base address, dma address, and even find a free interrupt to use! Also, some graphic cards used the PC’s memory addresses between 0x00F00000 and 0x00FFFFFF to map their video memory. The BIOSes had (retained for many years) a setting called “Memory Hole 15>16MB” for the purpose of protecting the memory contents. This was back when 4MB RAM could also buy you a brand new monitor! So I looked for the setting, and sure enough, the default in the machine’s BIOS was ‘Enabled’. I disabled it, and the installer (and later the operating system) ran fine, the machine is still in their business, it only got relegated to inventory duties.

That old PC is a fighter I tell you.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s