On Tue, Jan 10, 2023 at 10:14 PM Vladislav Bogdanov <[email protected]> wrote: > > I suspect that valudate action is run as a non-root user.
As far as I know, both the direct command and crm_resource **should** be running the agent as the same user, as long as Madison is running both commands as the same user. For what it's worth, I copied your test script to my machine (Fedora 36 using the current upstream main of Pacemaker) and it worked fine both directly and via crm_resource. At the moment I'm not able to dig very deeply, but I do wonder if it's either a bug that's since been fixed, or perhaps an environment issue. To try to rule out the former, do you have a test environment where you can try to reproduce it on the latest Pacemaker from upstream? > > Madison Kelly <[email protected]> 11 января 2023 г. 07:06:55 написал: > >> On 2023-01-11 00:21, Madison Kelly wrote: >>> >>> On 2023-01-11 00:14, Madison Kelly wrote: >>>> >>>> Hi all, >>>> >>>> Edit: Last message was in HTML format, sorry about that. >>>> >>>> I've got a hell of a weird problem, and I am absolutely stumped on >>>> what's going on. >>>> >>>> The short of it is; if my RA is called from the command line, it's >>>> fine. If a resource exists, monitor, enable, disable, all that stuff >>>> works just fine. If I try to create a resource, it hangs on the >>>> validate stage. Specifically, it hangs when 'pcs' calls: >>>> >>>> crm_resource --validate --output-as xml --class ocf --agent server >>>> --provider alteeve --option name=<resource_name> >>>> >>>> Specifically, it hangs when it tries to make a shell call (to >>>> virsh, specifically, but that doesn't matter). So to debug, I started >>>> stripping down my RA simpler and simpler until I was left with the >>>> very most basic of programs; >>>> >>>> https://pastebin.com/VtSpkwMr >>>> >>>> That is literally the simplest program I could write that made the >>>> shell call. The 'open()' call is where it hangs. >>>> >>>> When I call directly; >>>> >>>> time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server >>>> srv04-test; echo rc:$? >>>> >>>> ==== >>>> real 0m0.061s >>>> user 0m0.037s >>>> sys 0m0.014s >>>> rc:0 >>>> ==== >>>> >>>> It's just fine. I can see in the log the output from the 'virsh' call >>>> as well. However, when I call from crm_resource; >>>> >>>> time crm_resource --validate --output-as xml --class ocf --agent >>>> server --provider alteeve --option name=srv04-test; echo rc:$? >>>> >>>> ==== >>>> <pacemaker-result api-version="2.25" request="crm_resource --validate >>>> --output-as xml --class ocf --agent server --provider alteeve --option >>>> name=srv04-test"> >>>> <resource-agent-action action="validate" class="ocf" type="server" >>>> provider="alteeve"> >>>> <overrides/> >>>> <agent-status code="1" message="error" execution_code="2" >>>> execution_message="Timed Out" reason="Resource agent did not exit >>>> within specified timeout"/> >>>> </resource-agent-action> >>>> <status code="1" message="Error occurred"> >>>> <errors> >>>> <error>crm_resource: Error performing operation: Error >>>> occurred</error> >>>> </errors> >>>> </status> >>>> </pacemaker-result> >>>> >>>> real 0m20.521s >>>> user 0m0.022s >>>> sys 0m0.010s >>>> rc:1 >>>> ==== >>>> >>>> In the log file, I see (from line 20 of the super-simple-test-script): >>>> >>>> ==== >>>> Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1; >>>> /usr/bin/echo return_code:0 |] >>>> ==== >>>> >>>> Then nothing else. >>>> >>>> The strace output is: https://pastebin.com/raw/UCEUdBeP >>>> >>>> Environment; >>>> >>>> * selinux is permissive >>>> * Pacemaker 2.1.5-4.el8 >>>> * pcs 0.10.15 >>>> * 4.18.0-408.el8.x86_64 >>>> * CentOS Stream release 8 >>>> >>>> Any help is appreciated, I am stumped. :/ >>> >>> >>> After sending this, I tried having my "RA" call 'hostname', and that >>> worked fine. I switched back to 'virsh list --all', and that hangs. So >>> it seems to somehow be related to call 'virsh' specifically. >>> >> >> OK, so more info... Knowing now that it's a problem with the virsh call >> specifically (but only when validating, existing VMs monitor, enable, >> disable fine, all which repeatedly call virsh), I noticed a few things. >> >> First, I see in the logs: >> >> ==== >> Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data: >> Connection reset by peer >> ==== >> >> So with this, I further simplified my test script to this: >> >> https://pastebin.com/Ey8FdL1t >> >> Then when I ran my test script directly, the strace output is: >> >> Good: https://pastebin.com/Trbq67ub >> >> When my script is called via crm_resource, the strace is this: >> >> Bad: https://pastebin.com/jtbzHrUM >> >> The first difference I can see happens around line 929 in the good >> paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0" >> exists, which doesn't in the bad paste. Shortly after, I start seeing: >> >> ==== >> line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8] >> line: [brk(NULL) = 0x562b7877d000] >> line: [brk(0x562b787aa000) = 0x562b787aa000] >> line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8] >> ==== >> >> Around line 959 in the bad paste. There are more brk() lines, and not >> long after the output stops. >> >> -- >> Madison Kelly >> Alteeve's Niche! >> Chief Technical Officer >> c: +1-647-471-0951 >> https://alteeve.com/ >> >> _______________________________________________ >> Manage your subscription: >> https://lists.clusterlabs.org/mailman/listinfo/users >> >> ClusterLabs home: https://www.clusterlabs.org/ > > > _______________________________________________ > Manage your subscription: > https://lists.clusterlabs.org/mailman/listinfo/users > > ClusterLabs home: https://www.clusterlabs.org/ -- Regards, Reid Wahl (He/Him) Senior Software Engineer, Red Hat RHEL High Availability - Pacemaker _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
