On Wed, Jan 11, 2023 at 12:48 PM Madison Kelly <[email protected]> wrote: > > On 2023-01-11 01:59, Reid Wahl wrote: > > On Tue, Jan 10, 2023 at 10:14 PM Vladislav Bogdanov > > <[email protected]> wrote: > >> > >> I suspect that valudate action is run as a non-root user. > > > > As far as I know, both the direct command and crm_resource **should** > > be running the agent as the same user, as long as Madison is running > > both commands as the same user. > > > > For what it's worth, I copied your test script to my machine (Fedora > > 36 using the current upstream main of Pacemaker) and it worked fine > > both directly and via crm_resource. At the moment I'm not able to dig > > very deeply, but I do wonder if it's either a bug that's since been > > fixed, or perhaps an environment issue. > > > > To try to rule out the former, do you have a test environment where > > you can try to reproduce it on the latest Pacemaker from upstream? > > I built the pacemaker source RPM from Fedora 37, then realized I'm > already running 2.1.5 on CS8, so I'm already on the latest release. > Looking at git, 2.1.5 is the latest tagged release... Are you running > newer than that?
I'm running on the current main, which contains commits that came after the 2.1.5 release. I don't really expect this to be a Pacemaker bug, especially with how recent your version is, but I would like to rule that out if possible. > > >> Madison Kelly <[email protected]> 11 января 2023 г. 07:06:55 написал: > >> > >>> On 2023-01-11 00:21, Madison Kelly wrote: > >>>> > >>>> On 2023-01-11 00:14, Madison Kelly wrote: > >>>>> > >>>>> Hi all, > >>>>> > >>>>> Edit: Last message was in HTML format, sorry about that. > >>>>> > >>>>> I've got a hell of a weird problem, and I am absolutely stumped on > >>>>> what's going on. > >>>>> > >>>>> The short of it is; if my RA is called from the command line, it's > >>>>> fine. If a resource exists, monitor, enable, disable, all that stuff > >>>>> works just fine. If I try to create a resource, it hangs on the > >>>>> validate stage. Specifically, it hangs when 'pcs' calls: > >>>>> > >>>>> crm_resource --validate --output-as xml --class ocf --agent server > >>>>> --provider alteeve --option name=<resource_name> > >>>>> > >>>>> Specifically, it hangs when it tries to make a shell call (to > >>>>> virsh, specifically, but that doesn't matter). So to debug, I started > >>>>> stripping down my RA simpler and simpler until I was left with the > >>>>> very most basic of programs; > >>>>> > >>>>> https://pastebin.com/VtSpkwMr > >>>>> > >>>>> That is literally the simplest program I could write that made the > >>>>> shell call. The 'open()' call is where it hangs. > >>>>> > >>>>> When I call directly; > >>>>> > >>>>> time /usr/lib/ocf/resource.d/alteeve/server --validate-all --server > >>>>> srv04-test; echo rc:$? > >>>>> > >>>>> ==== > >>>>> real 0m0.061s > >>>>> user 0m0.037s > >>>>> sys 0m0.014s > >>>>> rc:0 > >>>>> ==== > >>>>> > >>>>> It's just fine. I can see in the log the output from the 'virsh' call > >>>>> as well. However, when I call from crm_resource; > >>>>> > >>>>> time crm_resource --validate --output-as xml --class ocf --agent > >>>>> server --provider alteeve --option name=srv04-test; echo rc:$? > >>>>> > >>>>> ==== > >>>>> <pacemaker-result api-version="2.25" request="crm_resource --validate > >>>>> --output-as xml --class ocf --agent server --provider alteeve --option > >>>>> name=srv04-test"> > >>>>> <resource-agent-action action="validate" class="ocf" type="server" > >>>>> provider="alteeve"> > >>>>> <overrides/> > >>>>> <agent-status code="1" message="error" execution_code="2" > >>>>> execution_message="Timed Out" reason="Resource agent did not exit > >>>>> within specified timeout"/> > >>>>> </resource-agent-action> > >>>>> <status code="1" message="Error occurred"> > >>>>> <errors> > >>>>> <error>crm_resource: Error performing operation: Error > >>>>> occurred</error> > >>>>> </errors> > >>>>> </status> > >>>>> </pacemaker-result> > >>>>> > >>>>> real 0m20.521s > >>>>> user 0m0.022s > >>>>> sys 0m0.010s > >>>>> rc:1 > >>>>> ==== > >>>>> > >>>>> In the log file, I see (from line 20 of the super-simple-test-script): > >>>>> > >>>>> ==== > >>>>> Calling: [/usr/bin/virsh dumpxml --inactive srv04-test 2>&1; > >>>>> /usr/bin/echo return_code:0 |] > >>>>> ==== > >>>>> > >>>>> Then nothing else. > >>>>> > >>>>> The strace output is: https://pastebin.com/raw/UCEUdBeP > >>>>> > >>>>> Environment; > >>>>> > >>>>> * selinux is permissive > >>>>> * Pacemaker 2.1.5-4.el8 > >>>>> * pcs 0.10.15 > >>>>> * 4.18.0-408.el8.x86_64 > >>>>> * CentOS Stream release 8 > >>>>> > >>>>> Any help is appreciated, I am stumped. :/ > >>>> > >>>> > >>>> After sending this, I tried having my "RA" call 'hostname', and that > >>>> worked fine. I switched back to 'virsh list --all', and that hangs. So > >>>> it seems to somehow be related to call 'virsh' specifically. > >>>> > >>> > >>> OK, so more info... Knowing now that it's a problem with the virsh call > >>> specifically (but only when validating, existing VMs monitor, enable, > >>> disable fine, all which repeatedly call virsh), I noticed a few things. > >>> > >>> First, I see in the logs: > >>> > >>> ==== > >>> Jan 11 00:30:43 mk-a07n02.digimer.ca libvirtd[2937]: Cannot recv data: > >>> Connection reset by peer > >>> ==== > >>> > >>> So with this, I further simplified my test script to this: > >>> > >>> https://pastebin.com/Ey8FdL1t > >>> > >>> Then when I ran my test script directly, the strace output is: > >>> > >>> Good: https://pastebin.com/Trbq67ub > >>> > >>> When my script is called via crm_resource, the strace is this: > >>> > >>> Bad: https://pastebin.com/jtbzHrUM > >>> > >>> The first difference I can see happens around line 929 in the good > >>> paste, the line "futex(0x7f48b0001ca0, FUTEX_WAKE_PRIVATE, 1) = 0" > >>> exists, which doesn't in the bad paste. Shortly after, I start seeing: > >>> > >>> ==== > >>> line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8] > >>> line: [brk(NULL) = 0x562b7877d000] > >>> line: [brk(0x562b787aa000) = 0x562b787aa000] > >>> line: [write(4, "\1\0\0\0\0\0\0\0", 8) = 8] > >>> ==== > >>> > >>> Around line 959 in the bad paste. There are more brk() lines, and not > >>> long after the output stops. > >>> > >>> -- > >>> Madison Kelly > >>> Alteeve's Niche! > >>> Chief Technical Officer > >>> c: +1-647-471-0951 > >>> https://alteeve.com/ > >>> > >>> _______________________________________________ > >>> Manage your subscription: > >>> https://lists.clusterlabs.org/mailman/listinfo/users > >>> > >>> ClusterLabs home: https://www.clusterlabs.org/ > >> > >> > >> _______________________________________________ > >> Manage your subscription: > >> https://lists.clusterlabs.org/mailman/listinfo/users > >> > >> ClusterLabs home: https://www.clusterlabs.org/ > > > > > > > > -- > Madison Kelly > Alteeve's Niche! > Chief Technical Officer > c: +1-647-471-0951 > https://alteeve.com/ > -- Regards, Reid Wahl (He/Him) Senior Software Engineer, Red Hat RHEL High Availability - Pacemaker _______________________________________________ Manage your subscription: https://lists.clusterlabs.org/mailman/listinfo/users ClusterLabs home: https://www.clusterlabs.org/
