Re: [boinc_dev] [boinc-android-testing] Re: POGS Computation Errors - Calling for help.

David Anderson Wed, 07 Aug 2013 22:23:47 -0700

I couldn't immediately see the problem.
A stack trace would help;
all we know is it's crashing somewhere in process_tree_cpu_time().


Let me explain how this code works, in case anyone else wants to review it.
The purpose of process_tree_cpu_time(pid) is to get the current CPU time
of the given process and all its descendants.
Unix doesn't provide an easy way to do this;
getrusage() only reports on children that have exited.
So instead we enumerate the set of all processes (using /proc),
find the ones that are descendants of pid,
and add up their CPU time.

The functions involved are:
procinfo_setup()
    scans /proc, and build a data structure,
    namely a std::map that maps pid to PROCINFO
    (a structure that describes a process, and which contains
    a std::vector of the PIDs of its children)
procinfo_app()
    given a PID, sum the CPU time of its descendants.
    The summing is done by a recursive function add_child_totals().
    add_child_totals(), BTW, has a safeguard that prevents infinite recursion
    if the process tree has a cycle (i.e. is not actually a tree);
    this happens sometimes in Windows.

Daniel, if you want to put in more printf()s would could try to narrow
the crash down more among these functions.

-- David

On 07-Aug-2013 8:46 PM, Daniel Carrion wrote:

Hello

Iris has attached to my test project site. Looks like all tasks exited out at
 task.cpu_time as per before.

See task stderr:
http://akira.onburde.net/burdetest/results.php?hostid=2&offset=0&show_names=0&state=6&appid=

 I'm going to pull this section out to confirm that it is this function call
 causing problems and release new version.

Cheers

Daniel

On Thu, Aug 8, 2013 at 7:43 AM, Daniel Carrion <[email protected]
<mailto:[email protected]>> wrote:

Oh and Thanks again David for helping out.


On Thu, Aug 8, 2013 at 7:42 AM, Daniel Carrion <[email protected]
<mailto:[email protected]>> wrote:

Hopefully we can get a few more occurrences in case that was a one off. If it
occurs again I might just have it spit out before and after that point and
remove the "first 180 second" threshold. Every 10 seconds won't make the
stderr.txt file too full.

I've got a test project with POGS like tasks set up here:
http://akira.onburde.net/burdetest/. At the moment the worker (fit_sed) is a
dummy app that outputs like fit_sed but runs shorter. I'll see if Iris wants
to jump on there for a few runs to see if it faults with short tasks as well.
If pogstest has to be brought down next week I can continue troubleshooting
on this site.

Cheers

Daniel


On Thu, Aug 8, 2013 at 7:12 AM, David Anderson <[email protected]
<mailto:[email protected]>> wrote:

That's a help. I'll take a close look at that code. -- David


On 07-Aug-2013 12:39 PM, Daniel Carrion wrote:

Hey David/Kevin

Please see attached. Managed to catch one. Looks like it was calling
task.cpu_time().

Watching out for more to see if it consistently happens at this point. My
test user base is minimal for this so have to wait a bit. I'm pretty much
relying on Iris' device and one other.

Regards

Daniel

On Wed, Aug 7, 2013 at 2:35 PM, Daniel Carrion <[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>> wrote:

Actually, I'll have the output cycle as the stuff we're interested in is
during the polling loop.


On Wed, Aug 7, 2013 at 2:13 PM, Daniel Carrion <[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>> wrote:

Hi David

I'm just testing this out now. A bit worried about the size of this
stderr.txt file on users devices. Could end up over 50MB. I'll give people a
heads up before releasing.

Regards

Daniel


On Wed, Aug 7, 2013 at 4:31 AM, David Anderson <[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>__>> wrote:

The code below (execv()) is executed in the child process. It looks like
what's getting the SIGSEGV is the parent process. I'd try putting a
fprintf(stderr, ...) at the start and end of TASK::poll(), and before the
call to task.cpu_time(), and before the call to boinc_report_app_start(). The
problem is likely in one of those. -- David


On 06-Aug-2013 7:47 AM, Daniel Carrion wrote:

Hey David/Kevin

Iris has run a couple jobs on the test instances along with a couple of
others. Here's the output of one of his tasks:

http://54.208.29.24/pogstest/____result.php?resultid=294
<http://54.208.29.24/pogstest/__result.php?resultid=294>


<http://54.208.29.24/pogstest/__result.php?resultid=294
<http://54.208.29.24/pogstest/result.php?resultid=294>>

Seems as though it's crashing as it's going to execute task?

08:05:25 (5741): wrapper (main): task poll begin 08:05:25 (6818): wrapper
(TASK::run): in child proc of the fork 08:05:25 (6818): wrapper (TASK::run):
construct argv 08:05:25 (6818): wrapper (TASK::run): set up env variables
08:05:25 (6818): wrapper (TASK::run): executing app SIGSEGV: segmentation
violation

From modified code (TASK::run):

808 809         if (nvars > 0) { 810 set_up_env_vars(&env_vars, nvars); 811
fprintf(stderr, "%s wrapper (TASK::run): executing app with vars\n", 812
 boinc_msg_prefix(buf, sizeof(buf)) 813             ); 814             retval
= execve(app_path, argv, env_vars); 815         } else { 816
fprintf(stderr, "%s wrapper (TASK::run): executing app\n", 817
 boinc_msg_prefix(buf, sizeof(buf)) 818             ); 819             retval
= execv(app_path, argv); 820         }


I guess it could be coming from wrapper during poll but it seems like it has
something to do with fit_sed starting. Possibly some memory allocation
problems as the app is starting? It's proving quite difficult to get dumps
out of people's phones.

I'm going to continue prodding and poking and try and get a stack-trace into
stderr from the wrapper itself. I'll also try different fit_sed compilation
as well, including a C port.

I just wish I could fire up gdb server on their phones and attach over the
internet =D.

Regards

Daniel

On Sat, Aug 3, 2013 at 12:23 AM, Daniel Carrion <[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>> <mailto:[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>>__> wrote:

Thanks David, I'll have a look into that...

I'll have the "debug ready" wrapper to push onto the test instance on Monday.
Hopefully we can then grab a few users having this problem to jump on and try
it.

I'll have to think of a way of getting those crash dumps out without having
to use NativeBOINC . Some Android devices seem to drop in /data/tombstones
and most just log crash dump data directly to system log for viewing via
logcat. Could probably get the rooted users to open a terminal and leave
"logcat -s DEBUG > /sdcard/Download/logcat.txt" running before jobs run so we
get dumps in the attached format (this is just me testing/causing segfault).
Non rooted can either call this via adb or try an app that essentially does
the same thing.

Sorry, just me thinking out loud.

Regards

Daniel

On Fri, Aug 2, 2013 at 4:28 PM, David Anderson <[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>__> <mailto:[email protected]
<mailto:[email protected]>

<mailto:[email protected] <mailto:[email protected]>__>__>> wrote:



On 01-Aug-2013 11:15 PM, Daniel Carrion wrote:


David, is there anything else you suggest for debugging purpose? E.g.
Catching SIGILL and SIGSEGV somehow?


That should do it; catching the signals would help only if we can then
generate a stack trace. In principle backtrace(3) can do this, but it may
not work on Anroid, see
http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc

<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc>


<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc

<http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc>>




<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc

<http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc>


<http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc
<http://stackoverflow.com/questions/10864882/stacktrace-arm-linux-gcc>>>

-- David

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] [boinc-android-testing] Re: POGS Computation Errors - Calling for help.

Reply via email to