If it's reproducible and happens quickly,
one can find the location of the crash by putting in lots of printf()s,
although of course it's tedious.
I can help with this if needed.
-- David

On 07-Aug-2013 11:20 PM, Daniel Carrion wrote:
Hi David

Thanks for that explanation.

It would definitely be useful to get a stack trace. Unfortunately, there is only
one person with this problem volunteering their phone time and I cannot get an
Android stack trace from their phone. Nothing when they run logcat or look in
/data/tombstones. This seems to be a known problem with new Android releases. I
actually think they pulled it out. The NativeBOINC stack trace (which uses
ptrace I think) doesn't give us anything useful. It would be handy to attach
gdbserver to the task but probably not feasible. I'll continue to think of ways
to get a useful stack trace.

I released a version on pogstest that omits checking cpu_time just to see if it
runs through on the problematic device. I will put more printf()s in the next
release. I'm guessing this time in functions called in lib/procinfo.cpp? I'll
probably release this version to the test project site I setup as I can more
easily control the worker task (shorter).

Regards

Daniel


On Thu, Aug 8, 2013 at 3:19 PM, David Anderson <[email protected]
<mailto:[email protected]>> wrote:

    I couldn't immediately see the problem.
    A stack trace would help;
    all we know is it's crashing somewhere in process_tree_cpu_time().

    Let me explain how this code works, in case anyone else wants to review it.
    The purpose of process_tree_cpu_time(pid) is to get the current CPU time
    of the given process and all its descendants.
    Unix doesn't provide an easy way to do this;
    getrusage() only reports on children that have exited.
    So instead we enumerate the set of all processes (using /proc),
    find the ones that are descendants of pid,
    and add up their CPU time.

    The functions involved are:
    procinfo_setup()
         scans /proc, and build a data structure,
         namely a std::map that maps pid to PROCINFO
         (a structure that describes a process, and which contains
         a std::vector of the PIDs of its children)
    procinfo_app()
         given a PID, sum the CPU time of its descendants.
         The summing is done by a recursive function add_child_totals().
         add_child_totals(), BTW, has a safeguard that prevents infinite 
recursion
         if the process tree has a cycle (i.e. is not actually a tree);
         this happens sometimes in Windows.

    Daniel, if you want to put in more printf()s would could try to narrow
    the crash down more among these functions.

    -- David


    On 07-Aug-2013 8:46 PM, Daniel Carrion wrote:

        Hello

        Iris has attached to my test project site. Looks like all tasks exited
        out at
          task.cpu_time as per before.

        See task stderr:
        
http://akira.onburde.net/__burdetest/results.php?hostid=__2&offset=0&show_names=0&state=__6&appid=
        
<http://akira.onburde.net/burdetest/results.php?hostid=2&offset=0&show_names=0&state=6&appid=>

          I'm going to pull this section out to confirm that it is this function
        call
          causing problems and release new version.

        Cheers

        Daniel

        On Thu, Aug 8, 2013 at 7:43 AM, Daniel Carrion <[email protected]
        <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>> wrote:

        Oh and Thanks again David for helping out.


        On Thu, Aug 8, 2013 at 7:42 AM, Daniel Carrion <[email protected]
        <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>> wrote:

        Hopefully we can get a few more occurrences in case that was a one off.
        If it
        occurs again I might just have it spit out before and after that point 
and
        remove the "first 180 second" threshold. Every 10 seconds won't make the
        stderr.txt file too full.

        I've got a test project with POGS like tasks set up here:
        http://akira.onburde.net/__burdetest/
        <http://akira.onburde.net/burdetest/>. At the moment the worker
        (fit_sed) is a
        dummy app that outputs like fit_sed but runs shorter. I'll see if Iris 
wants
        to jump on there for a few runs to see if it faults with short tasks as
        well.
        If pogstest has to be brought down next week I can continue 
troubleshooting
        on this site.

        Cheers

        Daniel


        On Thu, Aug 8, 2013 at 7:12 AM, David Anderson <[email protected]
        <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>__>> 
wrote:

        That's a help. I'll take a close look at that code. -- David


        On 07-Aug-2013 12:39 PM, Daniel Carrion wrote:

        Hey David/Kevin

        Please see attached. Managed to catch one. Looks like it was calling
        task.cpu_time().

        Watching out for more to see if it consistently happens at this point. 
My
        test user base is minimal for this so have to wait a bit. I'm pretty 
much
        relying on Iris' device and one other.

        Regards

        Daniel

        On Wed, Aug 7, 2013 at 2:35 PM, Daniel Carrion <[email protected]
        <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
        <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>__> wrote:

        Actually, I'll have the output cycle as the stuff we're interested in is
        during the polling loop.


        On Wed, Aug 7, 2013 at 2:13 PM, Daniel Carrion <[email protected]
        <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
        <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>__> wrote:

        Hi David

        I'm just testing this out now. A bit worried about the size of this
        stderr.txt file on users devices. Could end up over 50MB. I'll give 
people a
        heads up before releasing.

        Regards

        Daniel


        On Wed, Aug 7, 2013 at 4:31 AM, David Anderson <[email protected]
        <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>__>
        <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>__>__>> 
wrote:

        The code below (execv()) is executed in the child process. It looks like
        what's getting the SIGSEGV is the parent process. I'd try putting a
        fprintf(stderr, ...) at the start and end of TASK::poll(), and before 
the
        call to task.cpu_time(), and before the call to
        boinc_report_app_start(). The
        problem is likely in one of those. -- David


        On 06-Aug-2013 7:47 AM, Daniel Carrion wrote:

        Hey David/Kevin

        Iris has run a couple jobs on the test instances along with a couple of
        others. Here's the output of one of his tasks:

        http://54.208.29.24/pogstest/______result.php?resultid=294
        <http://54.208.29.24/pogstest/____result.php?resultid=294>
        <http://54.208.29.24/pogstest/____result.php?resultid=294
        <http://54.208.29.24/pogstest/__result.php?resultid=294>>



        <http://54.208.29.24/pogstest/____result.php?resultid=294
        <http://54.208.29.24/pogstest/__result.php?resultid=294>
        <http://54.208.29.24/pogstest/__result.php?resultid=294
        <http://54.208.29.24/pogstest/result.php?resultid=294>>>

        Seems as though it's crashing as it's going to execute task?

        08:05:25 (5741): wrapper (main): task poll begin 08:05:25 (6818): 
wrapper
        (TASK::run): in child proc of the fork 08:05:25 (6818): wrapper 
(TASK::run):
        construct argv 08:05:25 (6818): wrapper (TASK::run): set up env 
variables
        08:05:25 (6818): wrapper (TASK::run): executing app SIGSEGV: 
segmentation
        violation

         From modified code (TASK::run):

        808 809         if (nvars > 0) { 810 set_up_env_vars(&env_vars, nvars); 
811
        fprintf(stderr, "%s wrapper (TASK::run): executing app with vars\n", 812
          boinc_msg_prefix(buf, sizeof(buf)) 813             ); 814
        retval
        = execve(app_path, argv, env_vars); 815         } else { 816
        fprintf(stderr, "%s wrapper (TASK::run): executing app\n", 817
          boinc_msg_prefix(buf, sizeof(buf)) 818             ); 819
        retval
        = execv(app_path, argv); 820         }


        I guess it could be coming from wrapper during poll but it seems like 
it has
        something to do with fit_sed starting. Possibly some memory allocation
        problems as the app is starting? It's proving quite difficult to get 
dumps
        out of people's phones.

        I'm going to continue prodding and poking and try and get a stack-trace 
into
        stderr from the wrapper itself. I'll also try different fit_sed 
compilation
        as well, including a C port.

        I just wish I could fire up gdb server on their phones and attach over 
the
        internet =D.

        Regards

        Daniel

        On Sat, Aug 3, 2013 at 12:23 AM, Daniel Carrion <[email protected]
        <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
        <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>>
        <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
        <mailto:[email protected] <mailto:[email protected]>

        <mailto:[email protected] <mailto:[email protected]>>>__>__> 
wrote:

        Thanks David, I'll have a look into that...

        I'll have the "debug ready" wrapper to push onto the test instance on
        Monday.
        Hopefully we can then grab a few users having this problem to jump on
        and try
        it.

        I'll have to think of a way of getting those crash dumps out without 
having
        to use NativeBOINC . Some Android devices seem to drop in 
/data/tombstones
        and most just log crash dump data directly to system log for viewing via
        logcat. Could probably get the rooted users to open a terminal and leave
        "logcat -s DEBUG > /sdcard/Download/logcat.txt" running before jobs run
        so we
        get dumps in the attached format (this is just me testing/causing 
segfault).
        Non rooted can either call this via adb or try an app that essentially 
does
        the same thing.

        Sorry, just me thinking out loud.

        Regards

        Daniel

        On Fri, Aug 2, 2013 at 4:28 PM, David Anderson <[email protected]
        <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>__>
        <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>__>__>
        <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>__>


        <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>__>__>__>>
        wrote:



        On 01-Aug-2013 11:15 PM, Daniel Carrion wrote:


        David, is there anything else you suggest for debugging purpose? E.g.
        Catching SIGILL and SIGSEGV somehow?


        That should do it; catching the signals would help only if we can then
        generate a stack trace. In principle backtrace(3) can do this, but it 
may
        not work on Anroid, see
        
http://stackoverflow.com/________questions/10864882/__stacktrace-______arm-linux-gcc
        
<http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc>


    
<http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc
    
<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc>>


        
<http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc
        
<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc>


    <http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc 
<http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc>>>




        
<http://stackoverflow.com/______questions/10864882/stacktrace-______arm-linux-gcc
        
<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc>


    <http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc 
<http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc>>


        
<http://stackoverflow.com/____questions/10864882/stacktrace-____arm-linux-gcc
        
<http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc>
        <http://stackoverflow.com/__questions/10864882/stacktrace-__arm-linux-gcc 
<http://stackoverflow.com/questions/10864882/stacktrace-arm-linux-gcc>>>>

        -- David










_______________________________________________
boinc_dev mailing list
[email protected]
http://lists.ssl.berkeley.edu/mailman/listinfo/boinc_dev
To unsubscribe, visit the above URL and
(near bottom of page) enter your email address.

Reply via email to