Re: [boinc_dev] [boinc-android-testing] Re: POGS Computation Errors - Calling for help.

David Anderson Thu, 08 Aug 2013 12:25:22 -0700

Good!  Next question: is it crashing in find_children()?


On 08-Aug-2013 12:08 PM, Daniel Carrion wrote:

Hi David

Two of these reported back on my test server:

DEBUG: START
DEBUG: - process_tree_cpu_time(pid)
DEBUG: -- procinfo_setup(pm)
DEBUG: -- procinfo_app(procinfo, NULL, pm, NULL)
DEBUG: END
DEBUG: START
DEBUG: - process_tree_cpu_time(pid)
DEBUG: -- procinfo_setup(pm)
SIGSEGV: segmentation violation

Adding more logging in procinfo_setup in procinfo_unix.cpp and generating more 
work.

Cheers

Daniel



On Thu, Aug 8, 2013 at 5:32 PM, David Anderson <[email protected]
<mailto:[email protected]>> wrote:

    If it's reproducible and happens quickly,
    one can find the location of the crash by putting in lots of printf()s,
    although of course it's tedious.
    I can help with this if needed.
    -- David


    On 07-Aug-2013 11:20 PM, Daniel Carrion wrote:

        Hi David

        Thanks for that explanation.

        It would definitely be useful to get a stack trace. Unfortunately, there
        is only
        one person with this problem volunteering their phone time and I cannot
        get an
        Android stack trace from their phone. Nothing when they run logcat or
        look in
        /data/tombstones. This seems to be a known problem with new Android
        releases. I
        actually think they pulled it out. The NativeBOINC stack trace (which 
uses
        ptrace I think) doesn't give us anything useful. It would be handy to 
attach
        gdbserver to the task but probably not feasible. I'll continue to think
        of ways
        to get a useful stack trace.

        I released a version on pogstest that omits checking cpu_time just to
        see if it
        runs through on the problematic device. I will put more printf()s in the
        next
        release. I'm guessing this time in functions called in lib/procinfo.cpp?
        I'll
        probably release this version to the test project site I setup as I can 
more
        easily control the worker task (shorter).

        Regards

        Daniel


        On Thu, Aug 8, 2013 at 3:19 PM, David Anderson <[email protected]
        <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>__>> 
wrote:

             I couldn't immediately see the problem.
             A stack trace would help;
             all we know is it's crashing somewhere in process_tree_cpu_time().

             Let me explain how this code works, in case anyone else wants to
        review it.
             The purpose of process_tree_cpu_time(pid) is to get the current CPU
        time
             of the given process and all its descendants.
             Unix doesn't provide an easy way to do this;
             getrusage() only reports on children that have exited.
             So instead we enumerate the set of all processes (using /proc),
             find the ones that are descendants of pid,
             and add up their CPU time.

             The functions involved are:
             procinfo_setup()
                  scans /proc, and build a data structure,
                  namely a std::map that maps pid to PROCINFO
                  (a structure that describes a process, and which contains
                  a std::vector of the PIDs of its children)
             procinfo_app()
                  given a PID, sum the CPU time of its descendants.
                  The summing is done by a recursive function 
add_child_totals().
                  add_child_totals(), BTW, has a safeguard that prevents
        infinite recursion
                  if the process tree has a cycle (i.e. is not actually a tree);
                  this happens sometimes in Windows.

             Daniel, if you want to put in more printf()s would could try to 
narrow
             the crash down more among these functions.

             -- David


             On 07-Aug-2013 8:46 PM, Daniel Carrion wrote:

                 Hello

                 Iris has attached to my test project site. Looks like all tasks
        exited
                 out at
                   task.cpu_time as per before.

                 See task stderr:
        
http://akira.onburde.net/____burdetest/results.php?hostid=____2&offset=0&show_names=0&__state=__6&appid=
        
<http://akira.onburde.net/__burdetest/results.php?hostid=__2&offset=0&show_names=0&state=__6&appid=>


        
<http://akira.onburde.net/__burdetest/results.php?hostid=__2&offset=0&show_names=0&state=__6&appid=
        
<http://akira.onburde.net/burdetest/results.php?hostid=2&offset=0&show_names=0&state=6&appid=>>

                   I'm going to pull this section out to confirm that it is this
        function
                 call
                   causing problems and release new version.

                 Cheers

                 Daniel

                 On Thu, Aug 8, 2013 at 7:43 AM, Daniel Carrion
        <[email protected] <mailto:[email protected]>
                 <mailto:[email protected] <mailto:[email protected]>>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>__> wrote:

                 Oh and Thanks again David for helping out.


                 On Thu, Aug 8, 2013 at 7:42 AM, Daniel Carrion
        <[email protected] <mailto:[email protected]>
                 <mailto:[email protected] <mailto:[email protected]>>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>__> wrote:

                 Hopefully we can get a few more occurrences in case that was a
        one off.
                 If it
                 occurs again I might just have it spit out before and after
        that point and
                 remove the "first 180 second" threshold. Every 10 seconds won't
        make the
                 stderr.txt file too full.

                 I've got a test project with POGS like tasks set up here:
        http://akira.onburde.net/____burdetest/
        <http://akira.onburde.net/__burdetest/>
                 <http://akira.onburde.net/__burdetest/
        <http://akira.onburde.net/burdetest/>>. At the moment the worker

                 (fit_sed) is a
                 dummy app that outputs like fit_sed but runs shorter. I'll see
        if Iris wants
                 to jump on there for a few runs to see if it faults with short
        tasks as
                 well.
                 If pogstest has to be brought down next week I can continue
        troubleshooting
                 on this site.

                 Cheers

                 Daniel


                 On Thu, Aug 8, 2013 at 7:12 AM, David Anderson
        <[email protected] <mailto:[email protected]>
                 <mailto:[email protected] 
<mailto:[email protected]>__>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>__>__>> 
wrote:

                 That's a help. I'll take a close look at that code. -- David


                 On 07-Aug-2013 12:39 PM, Daniel Carrion wrote:

                 Hey David/Kevin

                 Please see attached. Managed to catch one. Looks like it was
        calling
                 task.cpu_time().

                 Watching out for more to see if it consistently happens at this
        point. My
                 test user base is minimal for this so have to wait a bit. I'm
        pretty much
                 relying on Iris' device and one other.

                 Regards

                 Daniel

                 On Wed, Aug 7, 2013 at 2:35 PM, Daniel Carrion
        <[email protected] <mailto:[email protected]>
                 <mailto:[email protected] <mailto:[email protected]>>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>__>__> 
wrote:

                 Actually, I'll have the output cycle as the stuff we're
        interested in is
                 during the polling loop.


                 On Wed, Aug 7, 2013 at 2:13 PM, Daniel Carrion
        <[email protected] <mailto:[email protected]>
                 <mailto:[email protected] <mailto:[email protected]>>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>__>__> 
wrote:

                 Hi David

                 I'm just testing this out now. A bit worried about the size of 
this
                 stderr.txt file on users devices. Could end up over 50MB. I'll
        give people a
                 heads up before releasing.

                 Regards

                 Daniel


                 On Wed, Aug 7, 2013 at 4:31 AM, David Anderson
        <[email protected] <mailto:[email protected]>
                 <mailto:[email protected] 
<mailto:[email protected]>__>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>__>__>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>__>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>__>__>__>>
        wrote:

                 The code below (execv()) is executed in the child process. It
        looks like
                 what's getting the SIGSEGV is the parent process. I'd try 
putting a
                 fprintf(stderr, ...) at the start and end of TASK::poll(), and
        before the
                 call to task.cpu_time(), and before the call to
                 boinc_report_app_start(). The
                 problem is likely in one of those. -- David


                 On 06-Aug-2013 7:47 AM, Daniel Carrion wrote:

                 Hey David/Kevin

                 Iris has run a couple jobs on the test instances along with a
        couple of
                 others. Here's the output of one of his tasks:

        http://54.208.29.24/pogstest/________result.php?resultid=294
        <http://54.208.29.24/pogstest/______result.php?resultid=294>
                 <http://54.208.29.24/pogstest/______result.php?resultid=294
        <http://54.208.29.24/pogstest/____result.php?resultid=294>>

                 <http://54.208.29.24/pogstest/______result.php?resultid=294
        <http://54.208.29.24/pogstest/____result.php?resultid=294>
                 <http://54.208.29.24/pogstest/____result.php?resultid=294
        <http://54.208.29.24/pogstest/__result.php?resultid=294>>>



                 <http://54.208.29.24/pogstest/______result.php?resultid=294
        <http://54.208.29.24/pogstest/____result.php?resultid=294>
                 <http://54.208.29.24/pogstest/____result.php?resultid=294
        <http://54.208.29.24/pogstest/__result.php?resultid=294>>
                 <http://54.208.29.24/pogstest/____result.php?resultid=294
        <http://54.208.29.24/pogstest/__result.php?resultid=294>
                 <http://54.208.29.24/pogstest/__result.php?resultid=294
        <http://54.208.29.24/pogstest/result.php?resultid=294>>>>

                 Seems as though it's crashing as it's going to execute task?

                 08:05:25 (5741): wrapper (main): task poll begin 08:05:25
        (6818): wrapper
                 (TASK::run): in child proc of the fork 08:05:25 (6818): wrapper
        (TASK::run):
                 construct argv 08:05:25 (6818): wrapper (TASK::run): set up env
        variables
                 08:05:25 (6818): wrapper (TASK::run): executing app SIGSEGV:
        segmentation
                 violation

                  From modified code (TASK::run):

                 808 809         if (nvars > 0) { 810 set_up_env_vars(&env_vars,
        nvars); 811
                 fprintf(stderr, "%s wrapper (TASK::run): executing app with
        vars\n", 812
                   boinc_msg_prefix(buf, sizeof(buf)) 813             ); 814
                 retval
                 = execve(app_path, argv, env_vars); 815         } else { 816
                 fprintf(stderr, "%s wrapper (TASK::run): executing app\n", 817
                   boinc_msg_prefix(buf, sizeof(buf)) 818             ); 819
                 retval
                 = execv(app_path, argv); 820         }


                 I guess it could be coming from wrapper during poll but it
        seems like it has
                 something to do with fit_sed starting. Possibly some memory
        allocation
                 problems as the app is starting? It's proving quite difficult
        to get dumps
                 out of people's phones.

                 I'm going to continue prodding and poking and try and get a
        stack-trace into
                 stderr from the wrapper itself. I'll also try different fit_sed
        compilation
                 as well, including a C port.

                 I just wish I could fire up gdb server on their phones and
        attach over the
                 internet =D.

                 Regards

                 Daniel

                 On Sat, Aug 3, 2013 at 12:23 AM, Daniel Carrion
        <[email protected] <mailto:[email protected]>
                 <mailto:[email protected] <mailto:[email protected]>>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>__>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>

                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>__>__>__> 
wrote:

                 Thanks David, I'll have a look into that...

                 I'll have the "debug ready" wrapper to push onto the test
        instance on
                 Monday.
                 Hopefully we can then grab a few users having this problem to
        jump on
                 and try
                 it.

                 I'll have to think of a way of getting those crash dumps out
        without having
                 to use NativeBOINC . Some Android devices seem to drop in
        /data/tombstones
                 and most just log crash dump data directly to system log for
        viewing via
                 logcat. Could probably get the rooted users to open a terminal
        and leave
                 "logcat -s DEBUG > /sdcard/Download/logcat.txt" running before
        jobs run
                 so we
                 get dumps in the attached format (this is just me
        testing/causing segfault).
                 Non rooted can either call this via adb or try an app that
        essentially does
                 the same thing.

                 Sorry, just me thinking out loud.

                 Regards

                 Daniel

                 On Fri, Aug 2, 2013 at 4:28 PM, David Anderson
        <[email protected] <mailto:[email protected]>
                 <mailto:[email protected] 
<mailto:[email protected]>__>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>__>__>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>__>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>__>__>__>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>__>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>__>__>


                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>__>
                 <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] 
<mailto:[email protected]>__>__>__>__>>

                 wrote:



                 On 01-Aug-2013 11:15 PM, Daniel Carrion wrote:


                 David, is there anything else you suggest for debugging
        purpose? E.g.
                 Catching SIGILL and SIGSEGV somehow?


                 That should do it; catching the signals would help only if we
        can then
                 generate a stack trace. In principle backtrace(3) can do this,
        but it may
                 not work on Anroid, see
        
http://stackoverflow.com/__________questions/10864882/____stacktrace-______arm-linux-gcc
        
<http://stackoverflow.com/________questions/10864882/__stacktrace-______arm-linux-gcc>

        
<http://stackoverflow.com/________questions/10864882/__stacktrace-______arm-linux-gcc
        
<http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc>__>




        
<http://stackoverflow.com/________questions/10864882/__stacktrace-______arm-linux-gcc
        
<http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc>

        
<http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc
        
<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc>>>



        
<http://stackoverflow.com/________questions/10864882/__stacktrace-______arm-linux-gcc
        
<http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc>

        
<http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc
        
<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc>>



        
<http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc
        
<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc>
        
<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc
        
<http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc>>>>





        
<http://stackoverflow.com/________questions/10864882/__stacktrace-______arm-linux-gcc
        
<http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc>

        
<http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc
        
<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc>>



        
<http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc
        
<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc>
        
<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc
        
<http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc>>>



        
<http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc
        
<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc>

        
<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc
        
<http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc>>

        
<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc
        
<http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc>
        <http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc 
<http://stackoverflow.com/questions/10864882/stacktrace-arm-linux-gcc>>>>>

                 -- David

_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Re: [boinc_dev] [boinc-android-testing] Re: POGS Computation Errors - Calling for help.

Reply via email to