Hi David Thanks for that explanation.
It would definitely be useful to get a stack trace. Unfortunately, there is only one person with this problem volunteering their phone time and I cannot get an Android stack trace from their phone. Nothing when they run logcat or look in /data/tombstones. This seems to be a known problem with new Android releases. I actually think they pulled it out. The NativeBOINC stack trace (which uses ptrace I think) doesn't give us anything useful. It would be handy to attach gdbserver to the task but probably not feasible. I'll continue to think of ways to get a useful stack trace. I released a version on pogstest that omits checking cpu_time just to see if it runs through on the problematic device. I will put more printf()s in the next release. I'm guessing this time in functions called in lib/procinfo.cpp? I'll probably release this version to the test project site I setup as I can more easily control the worker task (shorter). Regards Daniel On Thu, Aug 8, 2013 at 3:19 PM, David Anderson <[email protected]>wrote: > I couldn't immediately see the problem. > A stack trace would help; > all we know is it's crashing somewhere in process_tree_cpu_time(). > > Let me explain how this code works, in case anyone else wants to review it. > The purpose of process_tree_cpu_time(pid) is to get the current CPU time > of the given process and all its descendants. > Unix doesn't provide an easy way to do this; > getrusage() only reports on children that have exited. > So instead we enumerate the set of all processes (using /proc), > find the ones that are descendants of pid, > and add up their CPU time. > > The functions involved are: > procinfo_setup() > scans /proc, and build a data structure, > namely a std::map that maps pid to PROCINFO > (a structure that describes a process, and which contains > a std::vector of the PIDs of its children) > procinfo_app() > given a PID, sum the CPU time of its descendants. > The summing is done by a recursive function add_child_totals(). > add_child_totals(), BTW, has a safeguard that prevents infinite > recursion > if the process tree has a cycle (i.e. is not actually a tree); > this happens sometimes in Windows. > > Daniel, if you want to put in more printf()s would could try to narrow > the crash down more among these functions. > > -- David > > > On 07-Aug-2013 8:46 PM, Daniel Carrion wrote: > >> Hello >> >> Iris has attached to my test project site. Looks like all tasks exited >> out at >> task.cpu_time as per before. >> >> See task stderr: >> http://akira.onburde.net/**burdetest/results.php?hostid=** >> 2&offset=0&show_names=0&state=**6&appid=<http://akira.onburde.net/burdetest/results.php?hostid=2&offset=0&show_names=0&state=6&appid=> >> >> I'm going to pull this section out to confirm that it is this function >> call >> causing problems and release new version. >> >> Cheers >> >> Daniel >> >> On Thu, Aug 8, 2013 at 7:43 AM, Daniel Carrion <[email protected] >> <mailto:[email protected]>> wrote: >> >> Oh and Thanks again David for helping out. >> >> >> On Thu, Aug 8, 2013 at 7:42 AM, Daniel Carrion <[email protected] >> <mailto:[email protected]>> wrote: >> >> Hopefully we can get a few more occurrences in case that was a one off. >> If it >> occurs again I might just have it spit out before and after that point and >> remove the "first 180 second" threshold. Every 10 seconds won't make the >> stderr.txt file too full. >> >> I've got a test project with POGS like tasks set up here: >> http://akira.onburde.net/**burdetest/<http://akira.onburde.net/burdetest/>. >> At the moment the worker (fit_sed) is a >> dummy app that outputs like fit_sed but runs shorter. I'll see if Iris >> wants >> to jump on there for a few runs to see if it faults with short tasks as >> well. >> If pogstest has to be brought down next week I can continue >> troubleshooting >> on this site. >> >> Cheers >> >> Daniel >> >> >> On Thu, Aug 8, 2013 at 7:12 AM, David Anderson <[email protected] >> <mailto:[email protected]**>> wrote: >> >> That's a help. I'll take a close look at that code. -- David >> >> >> On 07-Aug-2013 12:39 PM, Daniel Carrion wrote: >> >> Hey David/Kevin >> >> Please see attached. Managed to catch one. Looks like it was calling >> task.cpu_time(). >> >> Watching out for more to see if it consistently happens at this point. My >> test user base is minimal for this so have to wait a bit. I'm pretty much >> relying on Iris' device and one other. >> >> Regards >> >> Daniel >> >> On Wed, Aug 7, 2013 at 2:35 PM, Daniel Carrion <[email protected] >> <mailto:[email protected]> <mailto:[email protected] >> <mailto:[email protected]>>**> wrote: >> >> Actually, I'll have the output cycle as the stuff we're interested in is >> during the polling loop. >> >> >> On Wed, Aug 7, 2013 at 2:13 PM, Daniel Carrion <[email protected] >> <mailto:[email protected]> <mailto:[email protected] >> <mailto:[email protected]>>**> wrote: >> >> Hi David >> >> I'm just testing this out now. A bit worried about the size of this >> stderr.txt file on users devices. Could end up over 50MB. I'll give >> people a >> heads up before releasing. >> >> Regards >> >> Daniel >> >> >> On Wed, Aug 7, 2013 at 4:31 AM, David Anderson <[email protected] >> <mailto:[email protected]**> <mailto:[email protected] >> <mailto:[email protected]**>__>> wrote: >> >> The code below (execv()) is executed in the child process. It looks like >> what's getting the SIGSEGV is the parent process. I'd try putting a >> fprintf(stderr, ...) at the start and end of TASK::poll(), and before the >> call to task.cpu_time(), and before the call to boinc_report_app_start(). >> The >> problem is likely in one of those. -- David >> >> >> On 06-Aug-2013 7:47 AM, Daniel Carrion wrote: >> >> Hey David/Kevin >> >> Iris has run a couple jobs on the test instances along with a couple of >> others. Here's the output of one of his tasks: >> >> http://54.208.29.24/pogstest/_**___result.php?resultid=294<http://54.208.29.24/pogstest/____result.php?resultid=294> >> <http://54.208.29.24/pogstest/**__result.php?resultid=294<http://54.208.29.24/pogstest/__result.php?resultid=294> >> > >> >> >> >> <http://54.208.29.24/pogstest/**__result.php?resultid=294<http://54.208.29.24/pogstest/__result.php?resultid=294> >> <http://54.208.29.24/pogstest/**result.php?resultid=294<http://54.208.29.24/pogstest/result.php?resultid=294> >> >> >> >> Seems as though it's crashing as it's going to execute task? >> >> 08:05:25 (5741): wrapper (main): task poll begin 08:05:25 (6818): wrapper >> (TASK::run): in child proc of the fork 08:05:25 (6818): wrapper >> (TASK::run): >> construct argv 08:05:25 (6818): wrapper (TASK::run): set up env variables >> 08:05:25 (6818): wrapper (TASK::run): executing app SIGSEGV: segmentation >> violation >> >> From modified code (TASK::run): >> >> 808 809 if (nvars > 0) { 810 set_up_env_vars(&env_vars, nvars); >> 811 >> fprintf(stderr, "%s wrapper (TASK::run): executing app with vars\n", 812 >> boinc_msg_prefix(buf, sizeof(buf)) 813 ); 814 >> retval >> = execve(app_path, argv, env_vars); 815 } else { 816 >> fprintf(stderr, "%s wrapper (TASK::run): executing app\n", 817 >> boinc_msg_prefix(buf, sizeof(buf)) 818 ); 819 >> retval >> = execv(app_path, argv); 820 } >> >> >> I guess it could be coming from wrapper during poll but it seems like it >> has >> something to do with fit_sed starting. Possibly some memory allocation >> problems as the app is starting? It's proving quite difficult to get dumps >> out of people's phones. >> >> I'm going to continue prodding and poking and try and get a stack-trace >> into >> stderr from the wrapper itself. I'll also try different fit_sed >> compilation >> as well, including a C port. >> >> I just wish I could fire up gdb server on their phones and attach over the >> internet =D. >> >> Regards >> >> Daniel >> >> On Sat, Aug 3, 2013 at 12:23 AM, Daniel Carrion <[email protected] >> <mailto:[email protected]> <mailto:[email protected] >> <mailto:[email protected]>> <mailto:[email protected] >> <mailto:[email protected]> <mailto:[email protected] >> >> <mailto:[email protected]>>**>__> wrote: >> >> Thanks David, I'll have a look into that... >> >> I'll have the "debug ready" wrapper to push onto the test instance on >> Monday. >> Hopefully we can then grab a few users having this problem to jump on and >> try >> it. >> >> I'll have to think of a way of getting those crash dumps out without >> having >> to use NativeBOINC . Some Android devices seem to drop in /data/tombstones >> and most just log crash dump data directly to system log for viewing via >> logcat. Could probably get the rooted users to open a terminal and leave >> "logcat -s DEBUG > /sdcard/Download/logcat.txt" running before jobs run >> so we >> get dumps in the attached format (this is just me testing/causing >> segfault). >> Non rooted can either call this via adb or try an app that essentially >> does >> the same thing. >> >> Sorry, just me thinking out loud. >> >> Regards >> >> Daniel >> >> On Fri, Aug 2, 2013 at 4:28 PM, David Anderson <[email protected] >> <mailto:[email protected]**> <mailto:[email protected] >> <mailto:[email protected]**>__> <mailto:[email protected] >> <mailto:[email protected]**> >> >> >> <mailto:[email protected] <mailto:[email protected]**>__>__>> >> wrote: >> >> >> >> On 01-Aug-2013 11:15 PM, Daniel Carrion wrote: >> >> >> David, is there anything else you suggest for debugging purpose? E.g. >> Catching SIGILL and SIGSEGV somehow? >> >> >> That should do it; catching the signals would help only if we can then >> generate a stack trace. In principle backtrace(3) can do this, but it may >> not work on Anroid, see >> http://stackoverflow.com/_____**_questions/10864882/** >> stacktrace-______arm-linux-gcc<http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc> >> >> >> <http://stackoverflow.com/____**questions/10864882/stacktrace-** > ____arm-linux-gcc<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc> > > > > >> <http://stackoverflow.com/____**questions/10864882/stacktrace-** >> ____arm-linux-gcc<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc> >> >> >> <http://stackoverflow.com/__**questions/10864882/stacktrace-** > __arm-linux-gcc<http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc> > >> > >> >> >> >> <http://stackoverflow.com/____**questions/10864882/stacktrace-** >> ____arm-linux-gcc<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc> >> >> >> <http://stackoverflow.com/__**questions/10864882/stacktrace-** > __arm-linux-gcc<http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc> > > > >> >> <http://stackoverflow.com/__**questions/10864882/stacktrace-** >> __arm-linux-gcc<http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc> >> <http://stackoverflow.com/**questions/10864882/stacktrace-**arm-linux-gcc<http://stackoverflow.com/questions/10864882/stacktrace-arm-linux-gcc> >> >>> >> >> -- David >> >> >> >> >> >> >> >> >> >> _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
