https://bugs.kde.org/show_bug.cgi?id=458915

            Bug ID: 458915
           Summary: syscall sometimes returns its number instead of return
                    code
           Product: valgrind
           Version: 3.18.1
          Platform: Ubuntu Packages
                OS: Linux
            Status: REPORTED
          Severity: normal
          Priority: NOR
         Component: general
          Assignee: jsew...@acm.org
          Reporter: libor.pel...@nic.cz
  Target Milestone: ---

We test our software (Knot DNS) with Valgrind heavily (on Linux/x86_64
architecture). Sometimes, very very rarely, we encounter weird bugs when
running under Valgrind. I tracked those down to the point that I can see what
sometimes happens:

When my application calls a syscall (e.g. futex), the syscall number (202 in
this case) is returned instead of proper return code (e.g. -110 == -ETIMEDOUT).
In this case, it leads to a crash of an assert inside glibc.

Similar behavior has been observed throughout history. For example, we created
a workaround for poll() sometimes returning the value 7, which happened
sometimes under valgrind and we did not understand it. Indeed, the syscall
number of poll is 7. We have also another workaround for epoll() returning 232
(again, it's its syscall number) sometimes under valgrind.

In any case, this happens extremely rarely, the program under valgrind must be
doing stuff for minutes to trigger it, and it can be usually reproduced only on
specific machine: running the same on different machine with the same system
usually doesn't lead to reproduction, or far more rarely. It must be somehow
related to timing of threads or dunno.

Looking at Valgrind source code, this tracked me down to the file
coregrind/m_syswrap/syscall-amd64-linux.S , but I'm not able to see any obvious
bug in this assembler. What intrigues me that both the syscall number and the
return value appear in the RAX register at some point.

I tried examining strace, but this didn't give me any hint. From the
kernel/strace perspective, the syscalls return proper codes. The issue must
happen somewhere between kernel and the application.

Thank you for any help and hints. I'd be happy to add more information as
requested, or run some experimental code.

-- 
You are receiving this mail because:
You are watching all bug changes.

Reply via email to