Re: [Linaro-validation] Overheating Pandas

2013-07-03 Thread James Tunnicliffe
I believe that in the LAVA lab there are a few pandas with USB keys
that are used for builds to try and overcome some reliability
problems. Don't know if it was a temperature problem or something
else. With any luck someone who knows more about that issue can speak
up and share what they found. You could also try running "stress --cpu
4 --vm 2" and see if any errors show. I find that on my desktop
running 2x the number of CPU stress threads as I have CPUs is about
right to eat all available resources. That will just stress RAM and
CPU, not disk I/O, which should pinpoint the problem. Plenty of other
options 
(http://www.hecticgeek.com/2012/11/stress-test-your-ubuntu-computer-with-stress/)...

Is running at 100% of the thermal limit really an issue? Isn't the
point that it is the limit, which itself should have some safety built
in? I don't know off hand if the OMAP 4 SoCs incorporate hardware
frequency limiting or if it is entirely software, in which case the
kernel frequency governor should (at a guess) be throttling back.

I did have a panda give up on me about a year ago. It wasn't being
worked hard, but did refuse to get through a boot most of the time (it
did power on and get part way through booting). Those boards aren't
designed for high reliability and it may be that you just need to get
a couple of replacements.

James

On 3 July 2013 14:13, Renato Golin  wrote:
> Hi Folks,
>
> I'm running two buildbots here at home and am getting consistent failures
> from the Pandas because of overheating. I've set up a monitor that will tell
> me the current CPU temperature and the allowed maximum, and when the bot
> passes 90%, it shuts itself off.
>
> The problem is that I'm running with heat-sinks and the boards are on top of
> three fans, so there really isn't much more I can do to solve this problem.
>
> I personally think this is a hardware problem, since everything is in the
> same die, CPU, GPU and RAM, and the physical dimensions of the chip are
> quite small. I remember when Intel started overheating (around 486DX66) and
> the die was huge (more head dissipation), plus RAM and GPU were separate,
> and it still needed a hefty heat-sink.
>
> It's true that gates are far smaller today, but it's not true that a dual
> core 1.3GHz + GPU + RAM will produce less heat on a small die than a 66KHz
> CPU on a huge die, so why anyone think it's a good idea to release a 1+GHz
> chip without *any* form of heat dissipation is beyond my comprehension.
>
> Manufacturers only got away with it, so far, because people rarely use 100%
> of the CPU power for extended periods of time, because ARM devices end up as
> set-top boxes, mobile phones and tablets. However, even those devices will
> heat up when playing 2 h films or games, and they do have some form of heat
> sink.
>
> We, at the toolchain group, make things worse by using 100% CPU, 24 / 7,
> something that Panda boards, or Arndales were not designed to do. However,
> with ARM moving into the server space, their designs will have to be
> re-thought, and what a better place than Linaro for making sure we get it
> right?
>
> For the time being, I believe we *must* have air conditioning in the Lab all
> the time, and we *must* have heat-sinks on every board, and we *must*
> monitor the CPU temperature of the boards, at least until we're comfortable
> that they're not failing all the time.
>
> Can we make a temperature monitor (like the one attached) a default feature
> on Linaro Ubuntu distributions? We could dump that info to the syslog/dmesg
> whenever it crosses the (say) 75% threshold, and report more often when it
> crosses the 95%, possibly dumping the processe(s) that are consuming more
> CPU at the time, to enable post-mortem debugging.
>
> cheers,
> --renato
>
> As a side note, the quad-A9 ODroid does ship with a massive heat-sink, which
> also serves as a fancy case. Quite clever, really.
>
> ___
> linaro-validation mailing list
> linaro-validat...@lists.linaro.org
> http://lists.linaro.org/mailman/listinfo/linaro-validation
>



-- 
James Tunnicliffe

___
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain


Re: [Linaro-validation] Overheating Pandas

2013-07-03 Thread James Tunnicliffe
On 3 July 2013 17:41, Renato Golin  wrote:
> On 3 July 2013 17:22, Mans Rullgard  wrote:
>>
>> I repeat, the 4460 will run at 1.2GHz indefinitely without thermal
>> management.
>
>
> My mistake, I said 1.3GHz when it was actually 1.2GHz. So, at 1.2GHz, it
> freezes every few hours on full load on both 4430 and 4460.
>
> linaro@linaro-panda-01:~$ cat
> /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
> 120
>
> Now what?

Are you using the same set up as the LAVA lab in terms of OS, kernel,
software versions? If the Cbuild/LAVA boards run reliably (and I don't
think they have any direct cooling or a heatsink on them), then that
is a useful place to start.

--
James Tunnicliffe

___
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain


Re: [Linaro-validation] Overheating Pandas

2013-07-04 Thread James Tunnicliffe
On 4 July 2013 12:27, Renato Golin  wrote:
> On 3 July 2013 18:33, Richard Earnshaw  wrote:
>>
>> keep lowering the clock limit (.../cpufreq/scaling_max_freq) until you get
>> stability.  If you don't, then it isn't a heating problem.
>
>
> It might be a bit too soon, but I just got a few 7h builds out of the boards
> at 920MHz without a single glitch, whereas before, they wouldn't run for
> more than 4hs in a row. Both boards are running non-stop since 8pm
> yesterday.
>
> I'll keep them running during Connect on the exact same place as they are
> now (and were before), just to be sure, but I'm still betting that they
> cannot run at 1.2GHz on full steam for long periods without some serious
> cooling.

Faster clocks also drink more power - now you are using slower clocks
those boards will be stressing the PSU less. There are plenty of other
components on the board, any of which could be causing the problem,
including the peripherals that you plugged in.

If you don't have another 5V PSU to try but do have a spare ATX PSU
then it isn't difficult to hook up the 0 and 5V rails from a molex
connector. May be worth a go. You could easily run all the boards from
1 ATX PSU. (short pin 16 (green) to a black pin to turn the PSU on
http://en.wikipedia.org/wiki/ATX#Power_supply).

James

___
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
http://lists.linaro.org/mailman/listinfo/linaro-toolchain