Hi David Two of these reported back on my test server:
DEBUG: START DEBUG: - process_tree_cpu_time(pid) DEBUG: -- procinfo_setup(pm) DEBUG: -- procinfo_app(procinfo, NULL, pm, NULL) DEBUG: END DEBUG: START DEBUG: - process_tree_cpu_time(pid) DEBUG: -- procinfo_setup(pm) SIGSEGV: segmentation violation Adding more logging in procinfo_setup in procinfo_unix.cpp and generating more work. Cheers Daniel On Thu, Aug 8, 2013 at 5:32 PM, David Anderson <[email protected]>wrote: > If it's reproducible and happens quickly, > one can find the location of the crash by putting in lots of printf()s, > although of course it's tedious. > I can help with this if needed. > -- David > > > On 07-Aug-2013 11:20 PM, Daniel Carrion wrote: > >> Hi David >> >> Thanks for that explanation. >> >> It would definitely be useful to get a stack trace. Unfortunately, there >> is only >> one person with this problem volunteering their phone time and I cannot >> get an >> Android stack trace from their phone. Nothing when they run logcat or >> look in >> /data/tombstones. This seems to be a known problem with new Android >> releases. I >> actually think they pulled it out. The NativeBOINC stack trace (which uses >> ptrace I think) doesn't give us anything useful. It would be handy to >> attach >> gdbserver to the task but probably not feasible. I'll continue to think >> of ways >> to get a useful stack trace. >> >> I released a version on pogstest that omits checking cpu_time just to see >> if it >> runs through on the problematic device. I will put more printf()s in the >> next >> release. I'm guessing this time in functions called in lib/procinfo.cpp? >> I'll >> probably release this version to the test project site I setup as I can >> more >> easily control the worker task (shorter). >> >> Regards >> >> Daniel >> >> >> On Thu, Aug 8, 2013 at 3:19 PM, David Anderson <[email protected] >> <mailto:[email protected]**>> wrote: >> >> I couldn't immediately see the problem. >> A stack trace would help; >> all we know is it's crashing somewhere in process_tree_cpu_time(). >> >> Let me explain how this code works, in case anyone else wants to >> review it. >> The purpose of process_tree_cpu_time(pid) is to get the current CPU >> time >> of the given process and all its descendants. >> Unix doesn't provide an easy way to do this; >> getrusage() only reports on children that have exited. >> So instead we enumerate the set of all processes (using /proc), >> find the ones that are descendants of pid, >> and add up their CPU time. >> >> The functions involved are: >> procinfo_setup() >> scans /proc, and build a data structure, >> namely a std::map that maps pid to PROCINFO >> (a structure that describes a process, and which contains >> a std::vector of the PIDs of its children) >> procinfo_app() >> given a PID, sum the CPU time of its descendants. >> The summing is done by a recursive function add_child_totals(). >> add_child_totals(), BTW, has a safeguard that prevents infinite >> recursion >> if the process tree has a cycle (i.e. is not actually a tree); >> this happens sometimes in Windows. >> >> Daniel, if you want to put in more printf()s would could try to narrow >> the crash down more among these functions. >> >> -- David >> >> >> On 07-Aug-2013 8:46 PM, Daniel Carrion wrote: >> >> Hello >> >> Iris has attached to my test project site. Looks like all tasks >> exited >> out at >> task.cpu_time as per before. >> >> See task stderr: >> http://akira.onburde.net/__**burdetest/results.php?hostid=_** >> _2&offset=0&show_names=0&**state=__6&appid=<http://akira.onburde.net/__burdetest/results.php?hostid=__2&offset=0&show_names=0&state=__6&appid=> >> >> <http://akira.onburde.net/**burdetest/results.php?hostid=** >> 2&offset=0&show_names=0&state=**6&appid=<http://akira.onburde.net/burdetest/results.php?hostid=2&offset=0&show_names=0&state=6&appid=> >> > >> >> I'm going to pull this section out to confirm that it is this >> function >> call >> causing problems and release new version. >> >> Cheers >> >> Daniel >> >> On Thu, Aug 8, 2013 at 7:43 AM, Daniel Carrion < >> [email protected] >> <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>**> >> wrote: >> >> Oh and Thanks again David for helping out. >> >> >> On Thu, Aug 8, 2013 at 7:42 AM, Daniel Carrion < >> [email protected] >> <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>**> >> wrote: >> >> Hopefully we can get a few more occurrences in case that was a >> one off. >> If it >> occurs again I might just have it spit out before and after that >> point and >> remove the "first 180 second" threshold. Every 10 seconds won't >> make the >> stderr.txt file too full. >> >> I've got a test project with POGS like tasks set up here: >> >> http://akira.onburde.net/__**burdetest/<http://akira.onburde.net/__burdetest/> >> >> <http://akira.onburde.net/**burdetest/<http://akira.onburde.net/burdetest/>>. >> At the moment the worker >> >> (fit_sed) is a >> dummy app that outputs like fit_sed but runs shorter. I'll see if >> Iris wants >> to jump on there for a few runs to see if it faults with short >> tasks as >> well. >> If pogstest has to be brought down next week I can continue >> troubleshooting >> on this site. >> >> Cheers >> >> Daniel >> >> >> On Thu, Aug 8, 2013 at 7:12 AM, David Anderson < >> [email protected] >> <mailto:[email protected]**> >> <mailto:[email protected] <mailto:[email protected]**>__>> >> wrote: >> >> That's a help. I'll take a close look at that code. -- David >> >> >> On 07-Aug-2013 12:39 PM, Daniel Carrion wrote: >> >> Hey David/Kevin >> >> Please see attached. Managed to catch one. Looks like it was >> calling >> task.cpu_time(). >> >> Watching out for more to see if it consistently happens at this >> point. My >> test user base is minimal for this so have to wait a bit. I'm >> pretty much >> relying on Iris' device and one other. >> >> Regards >> >> Daniel >> >> On Wed, Aug 7, 2013 at 2:35 PM, Daniel Carrion < >> [email protected] >> <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>> >> <mailto:[email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>**>__> >> wrote: >> >> Actually, I'll have the output cycle as the stuff we're >> interested in is >> during the polling loop. >> >> >> On Wed, Aug 7, 2013 at 2:13 PM, Daniel Carrion < >> [email protected] >> <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>> >> <mailto:[email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>**>__> >> wrote: >> >> Hi David >> >> I'm just testing this out now. A bit worried about the size of >> this >> stderr.txt file on users devices. Could end up over 50MB. I'll >> give people a >> heads up before releasing. >> >> Regards >> >> Daniel >> >> >> On Wed, Aug 7, 2013 at 4:31 AM, David Anderson < >> [email protected] >> <mailto:[email protected]**> >> <mailto:[email protected] <mailto:[email protected]** >> >__> >> <mailto:[email protected] <mailto:[email protected]**> >> <mailto:[email protected] >> <mailto:[email protected]**>__>__>> >> wrote: >> >> The code below (execv()) is executed in the child process. It >> looks like >> what's getting the SIGSEGV is the parent process. I'd try putting >> a >> fprintf(stderr, ...) at the start and end of TASK::poll(), and >> before the >> call to task.cpu_time(), and before the call to >> boinc_report_app_start(). The >> problem is likely in one of those. -- David >> >> >> On 06-Aug-2013 7:47 AM, Daniel Carrion wrote: >> >> Hey David/Kevin >> >> Iris has run a couple jobs on the test instances along with a >> couple of >> others. Here's the output of one of his tasks: >> >> >> http://54.208.29.24/pogstest/_**_____result.php?resultid=294<http://54.208.29.24/pogstest/______result.php?resultid=294> >> >> <http://54.208.29.24/pogstest/**____result.php?resultid=294<http://54.208.29.24/pogstest/____result.php?resultid=294> >> > >> >> >> <http://54.208.29.24/pogstest/**____result.php?resultid=294<http://54.208.29.24/pogstest/____result.php?resultid=294> >> >> <http://54.208.29.24/pogstest/**__result.php?resultid=294<http://54.208.29.24/pogstest/__result.php?resultid=294> >> >> >> >> >> >> >> <http://54.208.29.24/pogstest/**____result.php?resultid=294<http://54.208.29.24/pogstest/____result.php?resultid=294> >> >> <http://54.208.29.24/pogstest/**__result.php?resultid=294<http://54.208.29.24/pogstest/__result.php?resultid=294> >> > >> >> <http://54.208.29.24/pogstest/**__result.php?resultid=294<http://54.208.29.24/pogstest/__result.php?resultid=294> >> >> <http://54.208.29.24/pogstest/**result.php?resultid=294<http://54.208.29.24/pogstest/result.php?resultid=294> >> >>> >> >> Seems as though it's crashing as it's going to execute task? >> >> 08:05:25 (5741): wrapper (main): task poll begin 08:05:25 (6818): >> wrapper >> (TASK::run): in child proc of the fork 08:05:25 (6818): wrapper >> (TASK::run): >> construct argv 08:05:25 (6818): wrapper (TASK::run): set up env >> variables >> 08:05:25 (6818): wrapper (TASK::run): executing app SIGSEGV: >> segmentation >> violation >> >> From modified code (TASK::run): >> >> 808 809 if (nvars > 0) { 810 set_up_env_vars(&env_vars, >> nvars); 811 >> fprintf(stderr, "%s wrapper (TASK::run): executing app with >> vars\n", 812 >> boinc_msg_prefix(buf, sizeof(buf)) 813 ); 814 >> retval >> = execve(app_path, argv, env_vars); 815 } else { 816 >> fprintf(stderr, "%s wrapper (TASK::run): executing app\n", 817 >> boinc_msg_prefix(buf, sizeof(buf)) 818 ); 819 >> retval >> = execv(app_path, argv); 820 } >> >> >> I guess it could be coming from wrapper during poll but it seems >> like it has >> something to do with fit_sed starting. Possibly some memory >> allocation >> problems as the app is starting? It's proving quite difficult to >> get dumps >> out of people's phones. >> >> I'm going to continue prodding and poking and try and get a >> stack-trace into >> stderr from the wrapper itself. I'll also try different fit_sed >> compilation >> as well, including a C port. >> >> I just wish I could fire up gdb server on their phones and attach >> over the >> internet =D. >> >> Regards >> >> Daniel >> >> On Sat, Aug 3, 2013 at 12:23 AM, Daniel Carrion < >> [email protected] >> <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>> >> <mailto:[email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>>**> >> <mailto:[email protected] <mailto:[email protected]> >> <mailto:[email protected] <mailto:[email protected]>> >> <mailto:[email protected] <mailto:[email protected]> >> >> <mailto:[email protected] <mailto:[email protected]>>**>__>__> >> wrote: >> >> Thanks David, I'll have a look into that... >> >> I'll have the "debug ready" wrapper to push onto the test >> instance on >> Monday. >> Hopefully we can then grab a few users having this problem to >> jump on >> and try >> it. >> >> I'll have to think of a way of getting those crash dumps out >> without having >> to use NativeBOINC . Some Android devices seem to drop in >> /data/tombstones >> and most just log crash dump data directly to system log for >> viewing via >> logcat. Could probably get the rooted users to open a terminal >> and leave >> "logcat -s DEBUG > /sdcard/Download/logcat.txt" running before >> jobs run >> so we >> get dumps in the attached format (this is just me testing/causing >> segfault). >> Non rooted can either call this via adb or try an app that >> essentially does >> the same thing. >> >> Sorry, just me thinking out loud. >> >> Regards >> >> Daniel >> >> On Fri, Aug 2, 2013 at 4:28 PM, David Anderson < >> [email protected] >> <mailto:[email protected]**> >> <mailto:[email protected] <mailto:[email protected]** >> >__> >> <mailto:[email protected] <mailto:[email protected]**> >> <mailto:[email protected] <mailto:[email protected]** >> >__>__> >> <mailto:[email protected] <mailto:[email protected]**> >> <mailto:[email protected] <mailto:[email protected]** >> >__> >> >> >> <mailto:[email protected] <mailto:[email protected]**> >> <mailto:[email protected] <mailto:[email protected]** >> >__>__>__>> >> >> wrote: >> >> >> >> On 01-Aug-2013 11:15 PM, Daniel Carrion wrote: >> >> >> David, is there anything else you suggest for debugging purpose? >> E.g. >> Catching SIGILL and SIGSEGV somehow? >> >> >> That should do it; catching the signals would help only if we can >> then >> generate a stack trace. In principle backtrace(3) can do this, >> but it may >> not work on Anroid, see >> http://stackoverflow.com/_____**___questions/10864882/__** >> stacktrace-______arm-linux-gcc<http://stackoverflow.com/________questions/10864882/__stacktrace-______arm-linux-gcc> >> <http://stackoverflow.com/____**__questions/10864882/** >> stacktrace-______arm-linux-gcc<http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc> >> **> >> >> >> >> <http://stackoverflow.com/____**__questions/10864882/** >> stacktrace-______arm-linux-gcc<http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc> >> <http://stackoverflow.com/____**questions/10864882/stacktrace-** >> ____arm-linux-gcc<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc> >> >> >> >> >> <http://stackoverflow.com/____**__questions/10864882/** >> stacktrace-______arm-linux-gcc<http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc> >> <http://stackoverflow.com/____**questions/10864882/stacktrace-** >> ____arm-linux-gcc<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc> >> > >> >> >> <http://stackoverflow.com/____**questions/10864882/stacktrace-** >> ____arm-linux-gcc<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc>< >> http://stackoverflow.com/__**questions/10864882/stacktrace-** >> __arm-linux-gcc<http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc> >> >>> >> >> >> >> >> <http://stackoverflow.com/____**__questions/10864882/** >> stacktrace-______arm-linux-gcc<http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc> >> <http://stackoverflow.com/____**questions/10864882/stacktrace-** >> ____arm-linux-gcc<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc> >> > >> >> >> <http://stackoverflow.com/____**questions/10864882/stacktrace-** >> ____arm-linux-gcc<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc>< >> http://stackoverflow.com/__**questions/10864882/stacktrace-** >> __arm-linux-gcc<http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc> >> >> >> >> >> <http://stackoverflow.com/____**questions/10864882/stacktrace-** >> ____arm-linux-gcc<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc> >> <http://stackoverflow.com/__**questions/10864882/stacktrace-** >> __arm-linux-gcc<http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc> >> > >> <http://stackoverflow.com/__**questions/10864882/stacktrace-** >> __arm-linux-gcc<http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc>< >> http://stackoverflow.com/**questions/10864882/stacktrace-**arm-linux-gcc<http://stackoverflow.com/questions/10864882/stacktrace-arm-linux-gcc> >> >>>> >> >> -- David >> >> >> >> >> >> >> >> >> >> >> _______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
