Even with NFSv3? It seems like fsid=0 is required for NFSv4, but does it have any impact on NFSv3? I honestly am not an expert of the details of NFS. For me, it's always "just worked", and performance was never an issue, so I never had much reason to dig into the details of tweaking/debugging/optimizing NFS.

Prentice

On 04/19/2017 02:07 PM, John Hanks wrote:
I've had far fewer unexplained (although admittedly there was a limited search for the guilty) NFS issues since I started using fsid= in my NFS exports. If you aren't setting that it might be worth a try. NFS seems to be much better at recovering from problems with an fsid assigned to the root of exports.

jbh

On Wed, Apr 19, 2017 at 8:58 PM Prentice Bisbal <pbis...@pppl.gov <mailto:pbis...@pppl.gov>> wrote:

    Here's the sequence of events:

    1. First job(s) run fine on the node and complete without error.

    2. Eventually a job fails with a 'permission denied' error when it
    tries
    to access /l/hostname.

    Since no jobs fail with a file I/O error, it's hard to confirm
    that the
    jobs themselves are causing the problem. However, if these particular
    jobs are the only thing running on the cluster and should be the only
    jobs accessing these NFS shares, what else could be causing them.

    All these systems are getting their user information from LDAP. Since
    some jobs run before these errors appear, lack of, or inaccurate user
    info doesn't seem to be a likely source of this problem, but I'm not
    ruling anything out at this point.

    Important detail: This is NFSv3.

    Prentice Bisbal
    Lead Software Engineer
    Princeton Plasma Physics Laboratory
    http://www.pppl.gov

    On 04/19/2017 12:20 PM, Ryan Novosielski wrote:
    > Are you saying they can’t mount the filesystem, or they can’t
    write to a mounted filesystem? Where does this system get its user
    information from, if the latter?
    >
    > --
    > ____
    > || \\UTGERS,
     |---------------------------*O*---------------------------
    > ||_// the State        |         Ryan Novosielski -
    novos...@rutgers.edu <mailto:novos...@rutgers.edu>
    > || \\ University | Sr. Technologist - 973/972.0922
    <tel:%28973%29%20972-0922> (2x0922) ~*~ RBHS Campus
    > ||  \\    of NJ        | Office of Advanced Research Computing -
    MSB C630, Newark
    >       `'
    >
    >> On Apr 19, 2017, at 12:09, Prentice Bisbal <pbis...@pppl.gov
    <mailto:pbis...@pppl.gov>> wrote:
    >>
    >> Beowulfers,
    >>
    >> I've been trying to troubleshoot a problem for the past two
    weeks with no luck. We have a cluster here that runs only one
    application (although the details of that application change
    significantly from run-to-run.). Each node in the cluster has an
    NFS export, /local, that can be automounted by every other node in
    the cluster as /l/hostname.
    >>
    >> Starting about two weeks ago, when jobs would try to access
    /l/hostname, they would get permission denied messages. I tried
    analyzing this problem by turning on all NFS/RPC logging with
    rpcdebug and also using tcpdump while trying to manually mount one
    of the remote systems. Both approaches indicated state file
    handles were prevent the share from being mounted.
    >>
    >> Since it has been 6-8 weeks since there were any seemingly
    relevant system config changes, I suspect it's an application
    problem (naturally). On the other hand, the application
    developers/users insist that they haven't made any changes, to
    their code, either. To be honest, there's no significant evidence
    indicating either is at fault. Any suggestions on how to debug
    this and definitively find the root cause of these stale file handles?
    >>
    >> --
    >> Prentice
    >> _______________________________________________
    >> Beowulf mailing list, Beowulf@beowulf.org
    <mailto:Beowulf@beowulf.org> sponsored by Penguin Computing
    >> To change your subscription (digest mode or unsubscribe) visit
    http://www.beowulf.org/mailman/listinfo/beowulf

    _______________________________________________
    Beowulf mailing list, Beowulf@beowulf.org
    <mailto:Beowulf@beowulf.org> sponsored by Penguin Computing
    To change your subscription (digest mode or unsubscribe) visit
    http://www.beowulf.org/mailman/listinfo/beowulf

--
‘[A] talent for following the ways of yesterday, is not sufficient to improve the world of today.’
 - King Wu-Ling, ruler of the Zhao state in northern China, 307 BC

_______________________________________________
Beowulf mailing list, Beowulf@beowulf.org sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to