Re: [boinc_dev] [boinc-android-testing] Re: POGS Computation Errors - Calling for help.

Daniel Carrion Thu, 08 Aug 2013 12:09:43 -0700

Hi David

Two of these reported back on my test server:


DEBUG: START
DEBUG: - process_tree_cpu_time(pid)
DEBUG: -- procinfo_setup(pm)
DEBUG: -- procinfo_app(procinfo, NULL, pm, NULL)
DEBUG: END
DEBUG: START
DEBUG: - process_tree_cpu_time(pid)
DEBUG: -- procinfo_setup(pm)
SIGSEGV: segmentation violation

Adding more logging in procinfo_setup in procinfo_unix.cpp and
generating more work.

Cheers

Daniel



On Thu, Aug 8, 2013 at 5:32 PM, David Anderson <[email protected]>wrote:

> If it's reproducible and happens quickly,
> one can find the location of the crash by putting in lots of printf()s,
> although of course it's tedious.
> I can help with this if needed.
> -- David
>
>
> On 07-Aug-2013 11:20 PM, Daniel Carrion wrote:
>
>> Hi David
>>
>> Thanks for that explanation.
>>
>> It would definitely be useful to get a stack trace. Unfortunately, there
>> is only
>> one person with this problem volunteering their phone time and I cannot
>> get an
>> Android stack trace from their phone. Nothing when they run logcat or
>> look in
>> /data/tombstones. This seems to be a known problem with new Android
>> releases. I
>> actually think they pulled it out. The NativeBOINC stack trace (which uses
>> ptrace I think) doesn't give us anything useful. It would be handy to
>> attach
>> gdbserver to the task but probably not feasible. I'll continue to think
>> of ways
>> to get a useful stack trace.
>>
>> I released a version on pogstest that omits checking cpu_time just to see
>> if it
>> runs through on the problematic device. I will put more printf()s in the
>> next
>> release. I'm guessing this time in functions called in lib/procinfo.cpp?
>> I'll
>> probably release this version to the test project site I setup as I can
>> more
>> easily control the worker task (shorter).
>>
>> Regards
>>
>> Daniel
>>
>>
>> On Thu, Aug 8, 2013 at 3:19 PM, David Anderson <[email protected]
>> <mailto:[email protected]**>> wrote:
>>
>>     I couldn't immediately see the problem.
>>     A stack trace would help;
>>     all we know is it's crashing somewhere in process_tree_cpu_time().
>>
>>     Let me explain how this code works, in case anyone else wants to
>> review it.
>>     The purpose of process_tree_cpu_time(pid) is to get the current CPU
>> time
>>     of the given process and all its descendants.
>>     Unix doesn't provide an easy way to do this;
>>     getrusage() only reports on children that have exited.
>>     So instead we enumerate the set of all processes (using /proc),
>>     find the ones that are descendants of pid,
>>     and add up their CPU time.
>>
>>     The functions involved are:
>>     procinfo_setup()
>>          scans /proc, and build a data structure,
>>          namely a std::map that maps pid to PROCINFO
>>          (a structure that describes a process, and which contains
>>          a std::vector of the PIDs of its children)
>>     procinfo_app()
>>          given a PID, sum the CPU time of its descendants.
>>          The summing is done by a recursive function add_child_totals().
>>          add_child_totals(), BTW, has a safeguard that prevents infinite
>> recursion
>>          if the process tree has a cycle (i.e. is not actually a tree);
>>          this happens sometimes in Windows.
>>
>>     Daniel, if you want to put in more printf()s would could try to narrow
>>     the crash down more among these functions.
>>
>>     -- David
>>
>>
>>     On 07-Aug-2013 8:46 PM, Daniel Carrion wrote:
>>
>>         Hello
>>
>>         Iris has attached to my test project site. Looks like all tasks
>> exited
>>         out at
>>           task.cpu_time as per before.
>>
>>         See task stderr:
>>         http://akira.onburde.net/__**burdetest/results.php?hostid=_**
>> _2&offset=0&show_names=0&**state=__6&appid=<http://akira.onburde.net/__burdetest/results.php?hostid=__2&offset=0&show_names=0&state=__6&appid=>
>>
>>         <http://akira.onburde.net/**burdetest/results.php?hostid=**
>> 2&offset=0&show_names=0&state=**6&appid=<http://akira.onburde.net/burdetest/results.php?hostid=2&offset=0&show_names=0&state=6&appid=>
>> >
>>
>>           I'm going to pull this section out to confirm that it is this
>> function
>>         call
>>           causing problems and release new version.
>>
>>         Cheers
>>
>>         Daniel
>>
>>         On Thu, Aug 8, 2013 at 7:43 AM, Daniel Carrion <
>> [email protected]
>>         <mailto:[email protected]>
>>         <mailto:[email protected] <mailto:[email protected]>>**>
>> wrote:
>>
>>         Oh and Thanks again David for helping out.
>>
>>
>>         On Thu, Aug 8, 2013 at 7:42 AM, Daniel Carrion <
>> [email protected]
>>         <mailto:[email protected]>
>>         <mailto:[email protected] <mailto:[email protected]>>**>
>> wrote:
>>
>>         Hopefully we can get a few more occurrences in case that was a
>> one off.
>>         If it
>>         occurs again I might just have it spit out before and after that
>> point and
>>         remove the "first 180 second" threshold. Every 10 seconds won't
>> make the
>>         stderr.txt file too full.
>>
>>         I've got a test project with POGS like tasks set up here:
>>         
>> http://akira.onburde.net/__**burdetest/<http://akira.onburde.net/__burdetest/>
>>         
>> <http://akira.onburde.net/**burdetest/<http://akira.onburde.net/burdetest/>>.
>> At the moment the worker
>>
>>         (fit_sed) is a
>>         dummy app that outputs like fit_sed but runs shorter. I'll see if
>> Iris wants
>>         to jump on there for a few runs to see if it faults with short
>> tasks as
>>         well.
>>         If pogstest has to be brought down next week I can continue
>> troubleshooting
>>         on this site.
>>
>>         Cheers
>>
>>         Daniel
>>
>>
>>         On Thu, Aug 8, 2013 at 7:12 AM, David Anderson <
>> [email protected]
>>         <mailto:[email protected]**>
>>         <mailto:[email protected] <mailto:[email protected]**>__>>
>> wrote:
>>
>>         That's a help. I'll take a close look at that code. -- David
>>
>>
>>         On 07-Aug-2013 12:39 PM, Daniel Carrion wrote:
>>
>>         Hey David/Kevin
>>
>>         Please see attached. Managed to catch one. Looks like it was
>> calling
>>         task.cpu_time().
>>
>>         Watching out for more to see if it consistently happens at this
>> point. My
>>         test user base is minimal for this so have to wait a bit. I'm
>> pretty much
>>         relying on Iris' device and one other.
>>
>>         Regards
>>
>>         Daniel
>>
>>         On Wed, Aug 7, 2013 at 2:35 PM, Daniel Carrion <
>> [email protected]
>>         <mailto:[email protected]>
>>         <mailto:[email protected] <mailto:[email protected]>>
>>         <mailto:[email protected] <mailto:[email protected]>
>>         <mailto:[email protected] <mailto:[email protected]>>**>__>
>> wrote:
>>
>>         Actually, I'll have the output cycle as the stuff we're
>> interested in is
>>         during the polling loop.
>>
>>
>>         On Wed, Aug 7, 2013 at 2:13 PM, Daniel Carrion <
>> [email protected]
>>         <mailto:[email protected]>
>>         <mailto:[email protected] <mailto:[email protected]>>
>>         <mailto:[email protected] <mailto:[email protected]>
>>         <mailto:[email protected] <mailto:[email protected]>>**>__>
>> wrote:
>>
>>         Hi David
>>
>>         I'm just testing this out now. A bit worried about the size of
>> this
>>         stderr.txt file on users devices. Could end up over 50MB. I'll
>> give people a
>>         heads up before releasing.
>>
>>         Regards
>>
>>         Daniel
>>
>>
>>         On Wed, Aug 7, 2013 at 4:31 AM, David Anderson <
>> [email protected]
>>         <mailto:[email protected]**>
>>         <mailto:[email protected] <mailto:[email protected]**
>> >__>
>>         <mailto:[email protected] <mailto:[email protected]**>
>>         <mailto:[email protected] 
>> <mailto:[email protected]**>__>__>>
>> wrote:
>>
>>         The code below (execv()) is executed in the child process. It
>> looks like
>>         what's getting the SIGSEGV is the parent process. I'd try putting
>> a
>>         fprintf(stderr, ...) at the start and end of TASK::poll(), and
>> before the
>>         call to task.cpu_time(), and before the call to
>>         boinc_report_app_start(). The
>>         problem is likely in one of those. -- David
>>
>>
>>         On 06-Aug-2013 7:47 AM, Daniel Carrion wrote:
>>
>>         Hey David/Kevin
>>
>>         Iris has run a couple jobs on the test instances along with a
>> couple of
>>         others. Here's the output of one of his tasks:
>>
>>         
>> http://54.208.29.24/pogstest/_**_____result.php?resultid=294<http://54.208.29.24/pogstest/______result.php?resultid=294>
>>         
>> <http://54.208.29.24/pogstest/**____result.php?resultid=294<http://54.208.29.24/pogstest/____result.php?resultid=294>
>> >
>>
>>         
>> <http://54.208.29.24/pogstest/**____result.php?resultid=294<http://54.208.29.24/pogstest/____result.php?resultid=294>
>>         
>> <http://54.208.29.24/pogstest/**__result.php?resultid=294<http://54.208.29.24/pogstest/__result.php?resultid=294>
>> >>
>>
>>
>>
>>         
>> <http://54.208.29.24/pogstest/**____result.php?resultid=294<http://54.208.29.24/pogstest/____result.php?resultid=294>
>>         
>> <http://54.208.29.24/pogstest/**__result.php?resultid=294<http://54.208.29.24/pogstest/__result.php?resultid=294>
>> >
>>         
>> <http://54.208.29.24/pogstest/**__result.php?resultid=294<http://54.208.29.24/pogstest/__result.php?resultid=294>
>>         
>> <http://54.208.29.24/pogstest/**result.php?resultid=294<http://54.208.29.24/pogstest/result.php?resultid=294>
>> >>>
>>
>>         Seems as though it's crashing as it's going to execute task?
>>
>>         08:05:25 (5741): wrapper (main): task poll begin 08:05:25 (6818):
>> wrapper
>>         (TASK::run): in child proc of the fork 08:05:25 (6818): wrapper
>> (TASK::run):
>>         construct argv 08:05:25 (6818): wrapper (TASK::run): set up env
>> variables
>>         08:05:25 (6818): wrapper (TASK::run): executing app SIGSEGV:
>> segmentation
>>         violation
>>
>>          From modified code (TASK::run):
>>
>>         808 809         if (nvars > 0) { 810 set_up_env_vars(&env_vars,
>> nvars); 811
>>         fprintf(stderr, "%s wrapper (TASK::run): executing app with
>> vars\n", 812
>>           boinc_msg_prefix(buf, sizeof(buf)) 813             ); 814
>>         retval
>>         = execve(app_path, argv, env_vars); 815         } else { 816
>>         fprintf(stderr, "%s wrapper (TASK::run): executing app\n", 817
>>           boinc_msg_prefix(buf, sizeof(buf)) 818             ); 819
>>         retval
>>         = execv(app_path, argv); 820         }
>>
>>
>>         I guess it could be coming from wrapper during poll but it seems
>> like it has
>>         something to do with fit_sed starting. Possibly some memory
>> allocation
>>         problems as the app is starting? It's proving quite difficult to
>> get dumps
>>         out of people's phones.
>>
>>         I'm going to continue prodding and poking and try and get a
>> stack-trace into
>>         stderr from the wrapper itself. I'll also try different fit_sed
>> compilation
>>         as well, including a C port.
>>
>>         I just wish I could fire up gdb server on their phones and attach
>> over the
>>         internet =D.
>>
>>         Regards
>>
>>         Daniel
>>
>>         On Sat, Aug 3, 2013 at 12:23 AM, Daniel Carrion <
>> [email protected]
>>         <mailto:[email protected]>
>>         <mailto:[email protected] <mailto:[email protected]>>
>>         <mailto:[email protected] <mailto:[email protected]>
>>         <mailto:[email protected] <mailto:[email protected]>>**>
>>         <mailto:[email protected] <mailto:[email protected]>
>>         <mailto:[email protected] <mailto:[email protected]>>
>>         <mailto:[email protected] <mailto:[email protected]>
>>
>>         <mailto:[email protected] <mailto:[email protected]>>**>__>__>
>> wrote:
>>
>>         Thanks David, I'll have a look into that...
>>
>>         I'll have the "debug ready" wrapper to push onto the test
>> instance on
>>         Monday.
>>         Hopefully we can then grab a few users having this problem to
>> jump on
>>         and try
>>         it.
>>
>>         I'll have to think of a way of getting those crash dumps out
>> without having
>>         to use NativeBOINC . Some Android devices seem to drop in
>> /data/tombstones
>>         and most just log crash dump data directly to system log for
>> viewing via
>>         logcat. Could probably get the rooted users to open a terminal
>> and leave
>>         "logcat -s DEBUG > /sdcard/Download/logcat.txt" running before
>> jobs run
>>         so we
>>         get dumps in the attached format (this is just me testing/causing
>> segfault).
>>         Non rooted can either call this via adb or try an app that
>> essentially does
>>         the same thing.
>>
>>         Sorry, just me thinking out loud.
>>
>>         Regards
>>
>>         Daniel
>>
>>         On Fri, Aug 2, 2013 at 4:28 PM, David Anderson <
>> [email protected]
>>         <mailto:[email protected]**>
>>         <mailto:[email protected] <mailto:[email protected]**
>> >__>
>>         <mailto:[email protected] <mailto:[email protected]**>
>>         <mailto:[email protected] <mailto:[email protected]**
>> >__>__>
>>         <mailto:[email protected] <mailto:[email protected]**>
>>         <mailto:[email protected] <mailto:[email protected]**
>> >__>
>>
>>
>>         <mailto:[email protected] <mailto:[email protected]**>
>>         <mailto:[email protected] <mailto:[email protected]**
>> >__>__>__>>
>>
>>         wrote:
>>
>>
>>
>>         On 01-Aug-2013 11:15 PM, Daniel Carrion wrote:
>>
>>
>>         David, is there anything else you suggest for debugging purpose?
>> E.g.
>>         Catching SIGILL and SIGSEGV somehow?
>>
>>
>>         That should do it; catching the signals would help only if we can
>> then
>>         generate a stack trace. In principle backtrace(3) can do this,
>> but it may
>>         not work on Anroid, see
>>         http://stackoverflow.com/_____**___questions/10864882/__**
>> stacktrace-______arm-linux-gcc<http://stackoverflow.com/________questions/10864882/__stacktrace-______arm-linux-gcc>
>>         <http://stackoverflow.com/____**__questions/10864882/**
>> stacktrace-______arm-linux-gcc<http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc>
>> **>
>>
>>
>>
>>     <http://stackoverflow.com/____**__questions/10864882/**
>> stacktrace-______arm-linux-gcc<http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc>
>>     <http://stackoverflow.com/____**questions/10864882/stacktrace-**
>> ____arm-linux-gcc<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc>
>> >>
>>
>>
>>         <http://stackoverflow.com/____**__questions/10864882/**
>> stacktrace-______arm-linux-gcc<http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc>
>>         <http://stackoverflow.com/____**questions/10864882/stacktrace-**
>> ____arm-linux-gcc<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc>
>> >
>>
>>
>>     <http://stackoverflow.com/____**questions/10864882/stacktrace-**
>> ____arm-linux-gcc<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc><
>> http://stackoverflow.com/__**questions/10864882/stacktrace-**
>> __arm-linux-gcc<http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc>
>> >>>
>>
>>
>>
>>
>>         <http://stackoverflow.com/____**__questions/10864882/**
>> stacktrace-______arm-linux-gcc<http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc>
>>         <http://stackoverflow.com/____**questions/10864882/stacktrace-**
>> ____arm-linux-gcc<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc>
>> >
>>
>>
>>     <http://stackoverflow.com/____**questions/10864882/stacktrace-**
>> ____arm-linux-gcc<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc><
>> http://stackoverflow.com/__**questions/10864882/stacktrace-**
>> __arm-linux-gcc<http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc>
>> >>
>>
>>
>>         <http://stackoverflow.com/____**questions/10864882/stacktrace-**
>> ____arm-linux-gcc<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc>
>>         <http://stackoverflow.com/__**questions/10864882/stacktrace-**
>> __arm-linux-gcc<http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc>
>> >
>>         <http://stackoverflow.com/__**questions/10864882/stacktrace-**
>> __arm-linux-gcc<http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc><
>> http://stackoverflow.com/**questions/10864882/stacktrace-**arm-linux-gcc<http://stackoverflow.com/questions/10864882/stacktrace-arm-linux-gcc>
>> >>>>
>>
>>         -- David
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] [boinc-android-testing] Re: POGS Computation Errors - Calling for help.

Reply via email to