Good! Next question: is it crashing in find_children()?
On 08-Aug-2013 12:08 PM, Daniel Carrion wrote:
Hi David Two of these reported back on my test server: DEBUG: START DEBUG: - process_tree_cpu_time(pid) DEBUG: -- procinfo_setup(pm) DEBUG: -- procinfo_app(procinfo, NULL, pm, NULL) DEBUG: END DEBUG: START DEBUG: - process_tree_cpu_time(pid) DEBUG: -- procinfo_setup(pm) SIGSEGV: segmentation violation Adding more logging in procinfo_setup in procinfo_unix.cpp and generating more work. Cheers Daniel On Thu, Aug 8, 2013 at 5:32 PM, David Anderson <[email protected] <mailto:[email protected]>> wrote: If it's reproducible and happens quickly, one can find the location of the crash by putting in lots of printf()s, although of course it's tedious. I can help with this if needed. -- David On 07-Aug-2013 11:20 PM, Daniel Carrion wrote: Hi David Thanks for that explanation. It would definitely be useful to get a stack trace. Unfortunately, there is only one person with this problem volunteering their phone time and I cannot get an Android stack trace from their phone. Nothing when they run logcat or look in /data/tombstones. This seems to be a known problem with new Android releases. I actually think they pulled it out. The NativeBOINC stack trace (which uses ptrace I think) doesn't give us anything useful. It would be handy to attach gdbserver to the task but probably not feasible. I'll continue to think of ways to get a useful stack trace. I released a version on pogstest that omits checking cpu_time just to see if it runs through on the problematic device. I will put more printf()s in the next release. I'm guessing this time in functions called in lib/procinfo.cpp? I'll probably release this version to the test project site I setup as I can more easily control the worker task (shorter). Regards Daniel On Thu, Aug 8, 2013 at 3:19 PM, David Anderson <[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>__>> wrote: I couldn't immediately see the problem. A stack trace would help; all we know is it's crashing somewhere in process_tree_cpu_time(). Let me explain how this code works, in case anyone else wants to review it. The purpose of process_tree_cpu_time(pid) is to get the current CPU time of the given process and all its descendants. Unix doesn't provide an easy way to do this; getrusage() only reports on children that have exited. So instead we enumerate the set of all processes (using /proc), find the ones that are descendants of pid, and add up their CPU time. The functions involved are: procinfo_setup() scans /proc, and build a data structure, namely a std::map that maps pid to PROCINFO (a structure that describes a process, and which contains a std::vector of the PIDs of its children) procinfo_app() given a PID, sum the CPU time of its descendants. The summing is done by a recursive function add_child_totals(). add_child_totals(), BTW, has a safeguard that prevents infinite recursion if the process tree has a cycle (i.e. is not actually a tree); this happens sometimes in Windows. Daniel, if you want to put in more printf()s would could try to narrow the crash down more among these functions. -- David On 07-Aug-2013 8:46 PM, Daniel Carrion wrote: Hello Iris has attached to my test project site. Looks like all tasks exited out at task.cpu_time as per before. See task stderr: http://akira.onburde.net/____burdetest/results.php?hostid=____2&offset=0&show_names=0&__state=__6&appid= <http://akira.onburde.net/__burdetest/results.php?hostid=__2&offset=0&show_names=0&state=__6&appid=> <http://akira.onburde.net/__burdetest/results.php?hostid=__2&offset=0&show_names=0&state=__6&appid= <http://akira.onburde.net/burdetest/results.php?hostid=2&offset=0&show_names=0&state=6&appid=>> I'm going to pull this section out to confirm that it is this function call causing problems and release new version. Cheers Daniel On Thu, Aug 8, 2013 at 7:43 AM, Daniel Carrion <[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>>__> wrote: Oh and Thanks again David for helping out. On Thu, Aug 8, 2013 at 7:42 AM, Daniel Carrion <[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>>__> wrote: Hopefully we can get a few more occurrences in case that was a one off. If it occurs again I might just have it spit out before and after that point and remove the "first 180 second" threshold. Every 10 seconds won't make the stderr.txt file too full. I've got a test project with POGS like tasks set up here: http://akira.onburde.net/____burdetest/ <http://akira.onburde.net/__burdetest/> <http://akira.onburde.net/__burdetest/ <http://akira.onburde.net/burdetest/>>. At the moment the worker (fit_sed) is a dummy app that outputs like fit_sed but runs shorter. I'll see if Iris wants to jump on there for a few runs to see if it faults with short tasks as well. If pogstest has to be brought down next week I can continue troubleshooting on this site. Cheers Daniel On Thu, Aug 8, 2013 at 7:12 AM, David Anderson <[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>__> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>__>__>> wrote: That's a help. I'll take a close look at that code. -- David On 07-Aug-2013 12:39 PM, Daniel Carrion wrote: Hey David/Kevin Please see attached. Managed to catch one. Looks like it was calling task.cpu_time(). Watching out for more to see if it consistently happens at this point. My test user base is minimal for this so have to wait a bit. I'm pretty much relying on Iris' device and one other. Regards Daniel On Wed, Aug 7, 2013 at 2:35 PM, Daniel Carrion <[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>>__>__> wrote: Actually, I'll have the output cycle as the stuff we're interested in is during the polling loop. On Wed, Aug 7, 2013 at 2:13 PM, Daniel Carrion <[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>>__>__> wrote: Hi David I'm just testing this out now. A bit worried about the size of this stderr.txt file on users devices. Could end up over 50MB. I'll give people a heads up before releasing. Regards Daniel On Wed, Aug 7, 2013 at 4:31 AM, David Anderson <[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>__> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>__>__> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>__> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>__>__>__>> wrote: The code below (execv()) is executed in the child process. It looks like what's getting the SIGSEGV is the parent process. I'd try putting a fprintf(stderr, ...) at the start and end of TASK::poll(), and before the call to task.cpu_time(), and before the call to boinc_report_app_start(). The problem is likely in one of those. -- David On 06-Aug-2013 7:47 AM, Daniel Carrion wrote: Hey David/Kevin Iris has run a couple jobs on the test instances along with a couple of others. Here's the output of one of his tasks: http://54.208.29.24/pogstest/________result.php?resultid=294 <http://54.208.29.24/pogstest/______result.php?resultid=294> <http://54.208.29.24/pogstest/______result.php?resultid=294 <http://54.208.29.24/pogstest/____result.php?resultid=294>> <http://54.208.29.24/pogstest/______result.php?resultid=294 <http://54.208.29.24/pogstest/____result.php?resultid=294> <http://54.208.29.24/pogstest/____result.php?resultid=294 <http://54.208.29.24/pogstest/__result.php?resultid=294>>> <http://54.208.29.24/pogstest/______result.php?resultid=294 <http://54.208.29.24/pogstest/____result.php?resultid=294> <http://54.208.29.24/pogstest/____result.php?resultid=294 <http://54.208.29.24/pogstest/__result.php?resultid=294>> <http://54.208.29.24/pogstest/____result.php?resultid=294 <http://54.208.29.24/pogstest/__result.php?resultid=294> <http://54.208.29.24/pogstest/__result.php?resultid=294 <http://54.208.29.24/pogstest/result.php?resultid=294>>>> Seems as though it's crashing as it's going to execute task? 08:05:25 (5741): wrapper (main): task poll begin 08:05:25 (6818): wrapper (TASK::run): in child proc of the fork 08:05:25 (6818): wrapper (TASK::run): construct argv 08:05:25 (6818): wrapper (TASK::run): set up env variables 08:05:25 (6818): wrapper (TASK::run): executing app SIGSEGV: segmentation violation From modified code (TASK::run): 808 809 if (nvars > 0) { 810 set_up_env_vars(&env_vars, nvars); 811 fprintf(stderr, "%s wrapper (TASK::run): executing app with vars\n", 812 boinc_msg_prefix(buf, sizeof(buf)) 813 ); 814 retval = execve(app_path, argv, env_vars); 815 } else { 816 fprintf(stderr, "%s wrapper (TASK::run): executing app\n", 817 boinc_msg_prefix(buf, sizeof(buf)) 818 ); 819 retval = execv(app_path, argv); 820 } I guess it could be coming from wrapper during poll but it seems like it has something to do with fit_sed starting. Possibly some memory allocation problems as the app is starting? It's proving quite difficult to get dumps out of people's phones. I'm going to continue prodding and poking and try and get a stack-trace into stderr from the wrapper itself. I'll also try different fit_sed compilation as well, including a C port. I just wish I could fire up gdb server on their phones and attach over the internet =D. Regards Daniel On Sat, Aug 3, 2013 at 12:23 AM, Daniel Carrion <[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>>__> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>>>__>__>__> wrote: Thanks David, I'll have a look into that... I'll have the "debug ready" wrapper to push onto the test instance on Monday. Hopefully we can then grab a few users having this problem to jump on and try it. I'll have to think of a way of getting those crash dumps out without having to use NativeBOINC . Some Android devices seem to drop in /data/tombstones and most just log crash dump data directly to system log for viewing via logcat. Could probably get the rooted users to open a terminal and leave "logcat -s DEBUG > /sdcard/Download/logcat.txt" running before jobs run so we get dumps in the attached format (this is just me testing/causing segfault). Non rooted can either call this via adb or try an app that essentially does the same thing. Sorry, just me thinking out loud. Regards Daniel On Fri, Aug 2, 2013 at 4:28 PM, David Anderson <[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>__> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>__>__> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>__> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>__>__>__> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>__> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>__>__> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>__> <mailto:[email protected] <mailto:[email protected]> <mailto:[email protected] <mailto:[email protected]>__>__>__>__>> wrote: On 01-Aug-2013 11:15 PM, Daniel Carrion wrote: David, is there anything else you suggest for debugging purpose? E.g. Catching SIGILL and SIGSEGV somehow? That should do it; catching the signals would help only if we can then generate a stack trace. In principle backtrace(3) can do this, but it may not work on Anroid, see http://stackoverflow.com/__________questions/10864882/____stacktrace-______arm-linux-gcc <http://stackoverflow.com/________questions/10864882/__stacktrace-______arm-linux-gcc> <http://stackoverflow.com/________questions/10864882/__stacktrace-______arm-linux-gcc <http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc>__> <http://stackoverflow.com/________questions/10864882/__stacktrace-______arm-linux-gcc <http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc> <http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc <http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc>>> <http://stackoverflow.com/________questions/10864882/__stacktrace-______arm-linux-gcc <http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc> <http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc <http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc>> <http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc <http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc> <http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc <http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc>>>> <http://stackoverflow.com/________questions/10864882/__stacktrace-______arm-linux-gcc <http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc> <http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc <http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc>> <http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc <http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc> <http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc <http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc>>> <http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc <http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc> <http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc <http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc>> <http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc <http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc> <http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc <http://stackoverflow.com/questions/10864882/stacktrace-arm-linux-gcc>>>>> -- David
_______________________________________________ boinc_dev mailing list [email protected] http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev To unsubscribe, visit the above URL and (near bottom of page) enter your email address.
