Bob posted on Mon, 05 Nov 2012 11:04:54 +0000 as excerpted: > Have been having a heat problem lately. Mostly since Mint 12. > Problem started with Pan, and have been active on the gmane pan user > group and have some resolution of problems by reducing the number of > headers that my groups contain. This helped in some instances. > > I am running a home built system. It has a gigabyte ma790GPT-UD3H MB, > 8Gb DDR3 1333 memory with an Amd 965 processor with stock cooling. > Prior to the upgrade into Linux 3 kernels, I had no known heating > problems.
That would appear to be the AMD Phenom II X4 965. Quad-core, 125 or 140W depending on generation, 3.4 GHz. Max ambient CPU pkg temp, 62 or 65C. But see below. > Until recently, the cpu temp would, with no major activity other than > firefox or a movie player, hover around 92 to 99 degrees F. ~33-37C > When Pan was downloading headers, the temp could easily soar into the > 120 deg, F. range. 49C+ > I pulled the 965, heat-sink and fan, removed and re-applied the > thermal grease and after cleaning everything, reinstalled it all. > > That took care of some of the problems, The idle core temp would still > hover around 92 degrees, but would peak at around 110. Hover @ 33C, peak @ 43C. > Finally, I went into the bios, and turned off the AMD cool and quiet. > This basically lets the fan run at full speed all the time. > > Now, the idle temp runs around 88 degrees and peaks at around 100. Hover @ 31C, peak @ 38C. > Just an FYI if anyone is seeing this type of problem. Eye of Mate also > causes the heat up problem in slide-show mode with less than a 4 second > delay between changes in displayed image. > > All of this since since the change from the Linux Kernel 2 series to > Kernel 3. > > Right now, typing this in, Core Temp is 87 deg and ambient temp is 81 > deg. 30.5C core, 27C ambient. > I posted this to alt.os.linux.mint and received the following. > > ========= <http://duckduckgo.COM/?q=linux+power+regression> I remember seeing that controversy when it was "live", as I follow FLOSS community news and blogs reasonably closely (several different feeds), paying special attention to the kernel as I run Linus-mainline git kernels, bug reporting when I hit one, etc. > There is a lot there. Read some of the messages if you are having the > same type of problems with Pan hanging. I installed a libsensor panel > appelet to keep an eye on the system temp while I run the system. > > I posted the following after reading up and making a change to grub to > install the workaround. > > Tried the `pcie_aspm=force' mod in etc/default/grub and it seems to have > helped dramaticaly. I ran eye of mate in slideshow mode for about 30 > seconds at a frame every 2 seconds and the temp went from 90 f, to 97F. 32C, 36C. > Then tried manually at one frame per second for 45 seconds. Temp rose to > 100F and would drop back and forth from 100 to 98 back and forth. 38C, 36.5-38C > Stopping the display caused the temp to drop to 92F in around 5 seconds. 33C > Before it would just keep rising to 105F to 110F or higher before > locking up. 40.5C-43C > // Just as a test, I downloaded over 4000 images with Pan(0.139), all in > one continuous run, and the system temp never exceeded 93F. Before > making the change, Pan would have locked up after the first 100 or so > images.// 34C Just as a note, standard computer system temps are normally stated in Celsius, often even here in the US, where most temps are in F. I'd strongly recommend at least reporting in C, as that's a whole lot easier to compare to specs and to other comments seen on the net. I'd actually recommend setting the display to C as well, and doing a manual convert to F only when comparing to room temp (in F), etc. That's why I converted all those to C. Meanwhile, I don't know what kernel hwmon drivers you're running and thus what you're actually monitoring, but I'd assume your reported core temps are from the k10temp module/driver. That driver is "interesting", because the hardware it's reporting on is "interesting". The reported temp is *NOT* a standard temperature at all, but rather, a hardware value only indirectly related to a specific actual temp, but instead, relative to a specified standard. Assuming you have kernel sources available, it's worth reading Documentation/hwmon/k10temp . (If your kernel sources are at the standard /usr/src/linux location, that would put the file in question at /usr/src/linux/Documentation/hwmon/k10temp .) I found this out on my new "bulldozer" system, when the reported core temps were 23C (73F) or so, with air cooling and an ambient room temp of 28C (82.5F) or so. Obviously that doesn't make sense as an absolute temperature value, since with air cooling there's no way the core could be cooler than the ambient room temp! Investigating why, I discovered the kernel's k10temp document as well as the git-commit comments associated with the original driver commit, etc. FWIW, for my bulldozer, AMD says the critical value is 70C, which apparently does correspond to a real 70C (158F). But below or above that, the hardware apparently takes into account other factors as well, including what current power usage is relative to rated thermal dissipation. (If the real temp is high at idle power usage, the reported value will I believe be closer to 70 than the real temp would suggest, because it means there's less actual thermal dissipation headroom, conversely, if the CPU is actively cooled and thus still running relatively cool at near rated power usage, the reported temp will likely be lower than real temp, representing more headroom than would normally be expected.) Searching for x4 965, yours is probably either 62 or 65C. Running sensors in a terminal window should report it as the "crit" temp. I'm not sure I particularly like that, but it's the way the hardware works, so there's not a lot to be done about it. But that could go some way toward explaining strange coretemp readings, if you see 'em. The core "temp" isn't actually temp at all, but a synthetic value intended to more accurately reflect real TDP headroom against ratings, than actual temp. (It's also worth noting that for early systems, this monitor was bugged and the driver won't report anything for it at all unless forced to do so. But socket AM2+ and above shouldn't have that issue and it doesn't appear to apply to either of us.) Meanwhile, most mobos have a CPU socket mounted temperature sensor as well, which should report cpu package "real" temps. FWIW, I monitor both, as well as a whole host of other system health and performance factors. (CPU and memory voltage, cpu, external northbridge and southbridge temps, cpu power usage, gpu temp, cpu and system exhaust fan speeds, hard drive temps, core user/system/nice/wait/total CPU usage and CPUFreq for each of the 6 cores separately, app/cache/buffer/total physical memory usage, swap usage, 1 minute load average, network inbound and outbound thruput... all monitored, graphed and text-value reported once per second via a superkaramba theme I setup. The same superkaramba theme reports the last ~20 syslog entries (10s updates IIRC), along with the top three memory and top 6 CPU using apps (1s updates), local time and date and day of week, UTC time, boottime and last repo sync time. All this is displayed in an 1800x170 (306 kpx) bar across most of the top of my (1920x2160, 4.15Mpx) desktop, thus using ~4.35% of my available desktop space.) But while your CPU appears to be rated to 62 (143.5F) or 65C synthetic core "temp", you're reporting lockups at 49C (120F) or so. Something's still wrong. Forcing ASPM and full fan speeds has helped work around the problem by keeping temps lower, but there's still something wrong, as you should have a clearance of 12C (say 20F) at those reported temps, real or synthetic. That said, other than replacing the CPU, it doesn't look like you have much choice but to live with it at this point. You've re-seated the heatsink, set the fan to constant 100%, and forced ASPM, which have helped work around the problem by keeping temps lower, but there's no getting around the fact that you're seeing lockups at temps 11C/20F+ lower than the thing's supposedly rated. That's not good. =:^( And both pan and eye of mate are borderline on your system when they shouldn't be, due to that missing thermal headroom. =:^( -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman _______________________________________________ Pan-users mailing list Pan-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/pan-users