Re: [boinc_dev] [boinc-android-testing] Re: POGS Computation Errors - Calling for help.

Daniel Carrion Wed, 07 Aug 2013 23:21:20 -0700

Hi David

Thanks for that explanation.


It would definitely be useful to get a stack trace. Unfortunately, there is
only one person with this problem volunteering their phone time and I
cannot get an Android stack trace from their phone. Nothing when they run
logcat or look in /data/tombstones. This seems to be a known problem with
new Android releases. I actually think they pulled it out. The NativeBOINC
stack trace (which uses ptrace I think) doesn't give us anything useful. It
would be handy to attach gdbserver to the task but probably not feasible.
I'll continue to think of ways to get a useful stack trace.

I released a version on pogstest that omits checking cpu_time just to see
if it runs through on the problematic device. I will put more printf()s in
the next release. I'm guessing this time in functions called
in lib/procinfo.cpp? I'll probably release this version to the test project
site I setup as I can more easily control the worker task (shorter).

Regards

Daniel


On Thu, Aug 8, 2013 at 3:19 PM, David Anderson <[email protected]>wrote:

> I couldn't immediately see the problem.
> A stack trace would help;
> all we know is it's crashing somewhere in process_tree_cpu_time().
>
> Let me explain how this code works, in case anyone else wants to review it.
> The purpose of process_tree_cpu_time(pid) is to get the current CPU time
> of the given process and all its descendants.
> Unix doesn't provide an easy way to do this;
> getrusage() only reports on children that have exited.
> So instead we enumerate the set of all processes (using /proc),
> find the ones that are descendants of pid,
> and add up their CPU time.
>
> The functions involved are:
> procinfo_setup()
>     scans /proc, and build a data structure,
>     namely a std::map that maps pid to PROCINFO
>     (a structure that describes a process, and which contains
>     a std::vector of the PIDs of its children)
> procinfo_app()
>     given a PID, sum the CPU time of its descendants.
>     The summing is done by a recursive function add_child_totals().
>     add_child_totals(), BTW, has a safeguard that prevents infinite
> recursion
>     if the process tree has a cycle (i.e. is not actually a tree);
>     this happens sometimes in Windows.
>
> Daniel, if you want to put in more printf()s would could try to narrow
> the crash down more among these functions.
>
> -- David
>
>
> On 07-Aug-2013 8:46 PM, Daniel Carrion wrote:
>
>> Hello
>>
>> Iris has attached to my test project site. Looks like all tasks exited
>> out at
>>  task.cpu_time as per before.
>>
>> See task stderr:
>> http://akira.onburde.net/**burdetest/results.php?hostid=**
>> 2&offset=0&show_names=0&state=**6&appid=<http://akira.onburde.net/burdetest/results.php?hostid=2&offset=0&show_names=0&state=6&appid=>
>>
>>  I'm going to pull this section out to confirm that it is this function
>> call
>>  causing problems and release new version.
>>
>> Cheers
>>
>> Daniel
>>
>> On Thu, Aug 8, 2013 at 7:43 AM, Daniel Carrion <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>> Oh and Thanks again David for helping out.
>>
>>
>> On Thu, Aug 8, 2013 at 7:42 AM, Daniel Carrion <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>> Hopefully we can get a few more occurrences in case that was a one off.
>> If it
>> occurs again I might just have it spit out before and after that point and
>> remove the "first 180 second" threshold. Every 10 seconds won't make the
>> stderr.txt file too full.
>>
>> I've got a test project with POGS like tasks set up here:
>> http://akira.onburde.net/**burdetest/<http://akira.onburde.net/burdetest/>.
>> At the moment the worker (fit_sed) is a
>> dummy app that outputs like fit_sed but runs shorter. I'll see if Iris
>> wants
>> to jump on there for a few runs to see if it faults with short tasks as
>> well.
>> If pogstest has to be brought down next week I can continue
>> troubleshooting
>> on this site.
>>
>> Cheers
>>
>> Daniel
>>
>>
>> On Thu, Aug 8, 2013 at 7:12 AM, David Anderson <[email protected]
>> <mailto:[email protected]**>> wrote:
>>
>> That's a help. I'll take a close look at that code. -- David
>>
>>
>> On 07-Aug-2013 12:39 PM, Daniel Carrion wrote:
>>
>> Hey David/Kevin
>>
>> Please see attached. Managed to catch one. Looks like it was calling
>> task.cpu_time().
>>
>> Watching out for more to see if it consistently happens at this point. My
>> test user base is minimal for this so have to wait a bit. I'm pretty much
>> relying on Iris' device and one other.
>>
>> Regards
>>
>> Daniel
>>
>> On Wed, Aug 7, 2013 at 2:35 PM, Daniel Carrion <[email protected]
>> <mailto:[email protected]> <mailto:[email protected]
>> <mailto:[email protected]>>**> wrote:
>>
>> Actually, I'll have the output cycle as the stuff we're interested in is
>> during the polling loop.
>>
>>
>> On Wed, Aug 7, 2013 at 2:13 PM, Daniel Carrion <[email protected]
>> <mailto:[email protected]> <mailto:[email protected]
>> <mailto:[email protected]>>**> wrote:
>>
>> Hi David
>>
>> I'm just testing this out now. A bit worried about the size of this
>> stderr.txt file on users devices. Could end up over 50MB. I'll give
>> people a
>> heads up before releasing.
>>
>> Regards
>>
>> Daniel
>>
>>
>> On Wed, Aug 7, 2013 at 4:31 AM, David Anderson <[email protected]
>> <mailto:[email protected]**> <mailto:[email protected]
>> <mailto:[email protected]**>__>> wrote:
>>
>> The code below (execv()) is executed in the child process. It looks like
>> what's getting the SIGSEGV is the parent process. I'd try putting a
>> fprintf(stderr, ...) at the start and end of TASK::poll(), and before the
>> call to task.cpu_time(), and before the call to boinc_report_app_start().
>> The
>> problem is likely in one of those. -- David
>>
>>
>> On 06-Aug-2013 7:47 AM, Daniel Carrion wrote:
>>
>> Hey David/Kevin
>>
>> Iris has run a couple jobs on the test instances along with a couple of
>> others. Here's the output of one of his tasks:
>>
>> http://54.208.29.24/pogstest/_**___result.php?resultid=294<http://54.208.29.24/pogstest/____result.php?resultid=294>
>> <http://54.208.29.24/pogstest/**__result.php?resultid=294<http://54.208.29.24/pogstest/__result.php?resultid=294>
>> >
>>
>>
>>
>> <http://54.208.29.24/pogstest/**__result.php?resultid=294<http://54.208.29.24/pogstest/__result.php?resultid=294>
>> <http://54.208.29.24/pogstest/**result.php?resultid=294<http://54.208.29.24/pogstest/result.php?resultid=294>
>> >>
>>
>> Seems as though it's crashing as it's going to execute task?
>>
>> 08:05:25 (5741): wrapper (main): task poll begin 08:05:25 (6818): wrapper
>> (TASK::run): in child proc of the fork 08:05:25 (6818): wrapper
>> (TASK::run):
>> construct argv 08:05:25 (6818): wrapper (TASK::run): set up env variables
>> 08:05:25 (6818): wrapper (TASK::run): executing app SIGSEGV: segmentation
>> violation
>>
>> From modified code (TASK::run):
>>
>> 808 809         if (nvars > 0) { 810 set_up_env_vars(&env_vars, nvars);
>> 811
>> fprintf(stderr, "%s wrapper (TASK::run): executing app with vars\n", 812
>>  boinc_msg_prefix(buf, sizeof(buf)) 813             ); 814
>> retval
>> = execve(app_path, argv, env_vars); 815         } else { 816
>> fprintf(stderr, "%s wrapper (TASK::run): executing app\n", 817
>>  boinc_msg_prefix(buf, sizeof(buf)) 818             ); 819
>> retval
>> = execv(app_path, argv); 820         }
>>
>>
>> I guess it could be coming from wrapper during poll but it seems like it
>> has
>> something to do with fit_sed starting. Possibly some memory allocation
>> problems as the app is starting? It's proving quite difficult to get dumps
>> out of people's phones.
>>
>> I'm going to continue prodding and poking and try and get a stack-trace
>> into
>> stderr from the wrapper itself. I'll also try different fit_sed
>> compilation
>> as well, including a C port.
>>
>> I just wish I could fire up gdb server on their phones and attach over the
>> internet =D.
>>
>> Regards
>>
>> Daniel
>>
>> On Sat, Aug 3, 2013 at 12:23 AM, Daniel Carrion <[email protected]
>> <mailto:[email protected]> <mailto:[email protected]
>> <mailto:[email protected]>> <mailto:[email protected]
>> <mailto:[email protected]> <mailto:[email protected]
>>
>> <mailto:[email protected]>>**>__> wrote:
>>
>> Thanks David, I'll have a look into that...
>>
>> I'll have the "debug ready" wrapper to push onto the test instance on
>> Monday.
>> Hopefully we can then grab a few users having this problem to jump on and
>> try
>> it.
>>
>> I'll have to think of a way of getting those crash dumps out without
>> having
>> to use NativeBOINC . Some Android devices seem to drop in /data/tombstones
>> and most just log crash dump data directly to system log for viewing via
>> logcat. Could probably get the rooted users to open a terminal and leave
>> "logcat -s DEBUG > /sdcard/Download/logcat.txt" running before jobs run
>> so we
>> get dumps in the attached format (this is just me testing/causing
>> segfault).
>> Non rooted can either call this via adb or try an app that essentially
>> does
>> the same thing.
>>
>> Sorry, just me thinking out loud.
>>
>> Regards
>>
>> Daniel
>>
>> On Fri, Aug 2, 2013 at 4:28 PM, David Anderson <[email protected]
>> <mailto:[email protected]**> <mailto:[email protected]
>> <mailto:[email protected]**>__> <mailto:[email protected]
>> <mailto:[email protected]**>
>>
>>
>> <mailto:[email protected] <mailto:[email protected]**>__>__>>
>> wrote:
>>
>>
>>
>> On 01-Aug-2013 11:15 PM, Daniel Carrion wrote:
>>
>>
>> David, is there anything else you suggest for debugging purpose? E.g.
>> Catching SIGILL and SIGSEGV somehow?
>>
>>
>> That should do it; catching the signals would help only if we can then
>> generate a stack trace. In principle backtrace(3) can do this, but it may
>> not work on Anroid, see
>> http://stackoverflow.com/_____**_questions/10864882/**
>> stacktrace-______arm-linux-gcc<http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc>
>>
>>
>>  <http://stackoverflow.com/____**questions/10864882/stacktrace-**
> ____arm-linux-gcc<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc>
> >
>
>
>> <http://stackoverflow.com/____**questions/10864882/stacktrace-**
>> ____arm-linux-gcc<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc>
>>
>>
>>  <http://stackoverflow.com/__**questions/10864882/stacktrace-**
> __arm-linux-gcc<http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc>
> >>
>
>>
>>
>>
>> <http://stackoverflow.com/____**questions/10864882/stacktrace-**
>> ____arm-linux-gcc<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc>
>>
>>
>>  <http://stackoverflow.com/__**questions/10864882/stacktrace-**
> __arm-linux-gcc<http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc>
> >
>
>>
>> <http://stackoverflow.com/__**questions/10864882/stacktrace-**
>> __arm-linux-gcc<http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc>
>> <http://stackoverflow.com/**questions/10864882/stacktrace-**arm-linux-gcc<http://stackoverflow.com/questions/10864882/stacktrace-arm-linux-gcc>
>> >>>
>>
>> -- David
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] [boinc-android-testing] Re: POGS Computation Errors - Calling for help.

Reply via email to