YottaDB is a key-value NoSQL databaase with a C API. To make it more
friendly to Golang users, we have created a Go wrapper (see
https://godoc.org/lang.yottadb.com/go/yottadb). A Rust wrapper is in
development, with other languages to follow.
With the Golang wrapper, we have run into a variety of panics and crashes
that take several forms. Those forms in order of frequency are as follows:
-
Panics due to “sweep increased allocation count”.
-
Panics due to invalid address (bad pointer) in Go heap.
-
SIG-11 failures in various places usually dealing with garbage
collection.
-
Panics due to “unknown return address” when trying to return from
functions.
-
Mutex deadlocks (everyone waiting and nobody holding) and HUGE memory
usage.
Researching these issues notes that the issue is “usually” either due to
bad cgo code or “race” conditions. The problems get MUCH worse when GOGC is
set to 1. Running with either of the options “-msan” or “-race” has not
flagged any issues. I am using Golang 1.12.5 on Ubuntu 18.04.
A bit of background:
The C API that we are wrapping has parameters that are pointers to
structures that themselves contain pointers to information. Consequently,
we decided to use Golang structures mostly as anchor points to C allocation
memory holding the actual structures since cgo does not allow Golang
structures passed to C to have pointers to Golang memory. Since Golang is
not a language where a coder needs to track memory and release it, we make
use of Finalizers when the C storage is allocated so when the base Golang
structure is garbage collected, a function is run that releases the
allocated C memory attached to it. So far so good.
Here is the basic C structure that this code uses (C.ydb_buffer_t):
typedef struct{ unsigned int len_alloc; unsigned int len_used;
char *buf_addr;} ydb_buffer_t;
Here are the related Golang structures:
type BufferT struct { cbuft *internalBufferT // Allows BufferT to be
copied via assignment ownsBuff bool // If true, we should clean the
cbuft when Free'd}
type internalBufferT struct { cbuft *C.ydb_buffer_t // C flavor of the
ydb_buffer_t struct}
type BufferTArray struct { cbuftary *internalBufferTArray}
type internalBufferTArray struct { elemUsed uint32 // Number
of elements used elemAlloc uint32 // Number of elements in array
cbuftary *[]C.ydb_buffer_t // C flavor of ydb_buffer_t array}
The internal variant of each of BufferT and BufferTArray is so the
BufferT/BufferTArray can be copied by an assignment statement without
double free issues or using of storage after free, etc. It is technique we
found was suggested to deal with this issue as was use of the Finalizer to
cause the internal structure to release its C storage which involves not
only releasing the C control block but also the buffer that it points to –
and in the case of the array, the same for each element in the array.
Our API necessarily does a lot with cgo and due to the array of C
structures (and cgo not supporting arrays or variadic plists – both of
which we require), has to do its own pointer arithmetic to access the
element we want when dealing with individual array elements anchored by
internalBufferTArray. We have a similar issue with calling one of the C API
calls (ydb_lock_st()) because it takes a variadic plist and again, cgo
doesn’t support variadic plists. So we’ve cooked up a C function that takes
an array of uintptr values and turns it into a variadic plist call. So
again, we’re loading up elements of this parameter array using pointer
arithmetic and converting those pointers to uinptr values which seems to
put them on GC’s radar if the go structures are not subsequently used.
One issue we found and solved was with this latter usage. The scenario was
this:
1.
Function call created a key that contained a variable name and two
subscripts that were passed in to the LockST() wrapper call. Each of these
was ultimately housed in a ydb_buffer_t structure though the subscripts are
in the array version.
2.
The LockST() code looked at the structures and pulled out the C pointers
to the ydb_buffer_t structures then turning those addresses into uintptr to
store in the parameter array.
3.
At this point, it seems, since there were no more references to the
passed-in Golang structures and the C pointers we had pulled out of them
had been morphed to uintptr values, that the Golang structures became
eligible for garbage collection. But because there was a Finalizer set for
these structures, *the C storage was released when we were not finished
with it yet – in fact had not even made the call yet to the C routine.*
4.
We got errors in the C code and saw that though the structures were
well-formed in the Go routine, the stuff that ended up being passed in was
complete garbage.
5.
We “fixed” this issue by putting a meaningless reference to the golang
variadic list that held all these values just prior to returning from the
routine which kept them from being garbage collected such that it ran
successfully.
That fixed one of around 6 different failure types we were having. The test
that mostly generates the failures (though others do it too) is where we
are launching 10 goroutines that each do random operation types (set, get,
lock, delete, next, previous, etc) for a set period of time, then exit. It
is only when launching goroutines simultaneously that we see these errors.
If we launch workers as separate processes, we do not see these issues. We
also do not see these issues when the driver program is C instead of Golang.
The main C code base can be found at https://gitlab.com/YottaDB/DB/YDB
(master branch) while the go wrapper can be found at
https://gitlab.com/YottaDB/Lang/YDBGo (develop branch) – note LockST() is
in simple_api.go and you can see the added reference “fix” at the end of
it. But quite simply, nearly every call in this API has failed in one way
or another with one of the symptoms mentioned above.
Is there some better way to do this? Should we be doing something different
to escape the Finalizer/GC race to eliminate storage while we’re still
using pointers to that storage? So far, we’ve been dealing with
best-guesses as we do not know Golang’s internals. Although I’ve read some
articles on Golang’s garbage collection, the best-guess I’ve been able to
come up with is we have a race between our usage of C and Golang’s
eagerness to GC the main structures which then pulls the C memory out from
under us while we are using it.
Note this gets slightly murkier when one understands that one of the API
calls (TpST()) initiates a transaction and then calls back into a given Go
routine to actually execute the transaction. And indeed some of the cores
we’ve seen show us that we are either in or just coming out of a
transaction so the transaction itself garbage collected things we are still
using – or perhaps its usage just reused some released storage from a
previously executed transaction. <head spins> But even cutting down that
test so that only one request type runs is getting both the “bad pointer in
Go heap” and the “sweep increased allocation count” errors when run in 10
goroutines.
Here are the tops of some actual failures. I’ve not included the entire
failure as it references a lot of different threads so it fairly long but
let me know if you want to see them:
runtime: nelems=21 nalloc=5 previous allocCount=4 nfreed=65535fatal error:
sweep increased allocation count
goroutine 38 [running]:runtime.throw(0x5288be, 0x20)
/snap/go/3739/src/runtime/panic.go:617 +0x72 fp=0xc000157ab8 sp=0xc000157a88
pc=0x4395c2runtime.(*mspan).sweep(0x7fae4f5d7258, 0x4c0600, 0x118e800)
/snap/go/3739/src/runtime/mgcsweep.go:329 +0x8ef fp=0xc000157ba0
sp=0xc000157ab8 pc=0x43015fruntime.sweepone(0x0)
/snap/go/3739/src/runtime/mgcsweep.go:136 +0x282 fp=0xc000157c08
sp=0xc000157ba0 pc=0x42f612runtime.deductSweepCredit(0x2000, 0x0)
/snap/go/3739/src/runtime/mgcsweep.go:438 +0x80 fp=0xc000157c30 sp=0xc000157c08
pc=0x4302c0runtime.(*mcentral).cacheSpan(0x7f5460, 0x4143f5)
/snap/go/3739/src/runtime/mcentral.go:43 +0x61 fp=0xc000157c90 sp=0xc000157c30
pc=0x424fa1runtime.(*mcache).refill(0x7fae4f5cc6d0, 0x4)
/snap/go/3739/src/runtime/mcache.go:135 +0x86 fp=0xc000157cb0 sp=0xc000157c90
pc=0x424cd6runtime.(*mcache).nextFree(0x7fae4f5cc6d0, 0x118ea04, 0x448441,
0x4d8420, 0x2582c18) /snap/go/3739/src/runtime/malloc.go:786 +0x88
fp=0xc000157ce8 sp=0xc000157cb0 pc=0x41b148runtime.mallocgc(0x10, 0x502040,
0x4c6e01, 0x44db90) /snap/go/3739/src/runtime/malloc.go:939 +0x7a8
fp=0xc000157d98 sp=0xc000157ce8 pc=0x41ba98runtime.growslice(0x502040, 0x0,
0x0, 0x0, 0x1, 0xc000157e98, 0x0, 0x7fae240015a0)
/snap/go/3739/src/runtime/slice.go:181 +0x228 fp=0xc000157e00 sp=0xc000157d98
pc=0x44ddb8main.genName(0x4c76a8, 0x0, 0x448441)
/home/estess/go/src/rwalk4/randomWalk.go:66 +0x2d3 fp=0xc000157eb8
sp=0xc000157e00 pc=0x4c6e13main.runProc(0x0, 0x0, 0xa, 0x3fc999999999999a, 0x0,
0x0) /home/estess/go/src/rwalk4/randomWalk.go:82 +0x45 fp=0xc000157f78
sp=0xc000157eb8 pc=0x4c7145main.main.func2(0xc0000b8020, 0xc0000b8028, 0x0,
0xc0000b8010) /home/estess/go/src/rwalk4/randomWalk.go:141 +0x68
fp=0xc000157fc0 sp=0xc000157f78 pc=0x4c76a8runtime.goexit()
/snap/go/3739/src/runtime/asm_amd64.s:1337 +0x1 fp=0xc000157fc8 sp=0xc000157fc0
pc=0x464671created by main.main /home/estess/go/src/rwalk4/randomWalk.go:133
+0x178
The other type of failure:
runtime: pointer 0xc000202648 to unallocated span span.base()=0xc000200000
span.limit=0xc000204000 span.state=0runtime: found in object at
*(0xc000137840+0x10)object=0xc000137840 s.base()=0xc000136000
s.limit=0xc000138000 s.spanclass=10 s.elemsize=64 s.state=mSpanInUse
*(object+0) = 0x524d82
*(object+8) = 0xa
*(object+16) = 0xc000202648 <==
*(object+24) = 0x4
*(object+32) = 0xc000202658
*(object+40) = 0x4
*(object+48) = 0xc000202668
*(object+56) = 0x4fatal error: found bad pointer in Go heap (incorrect use of
unsafe or cgo?)
runtime stack:runtime.throw(0x52b6cc, 0x3e)
/snap/go/3739/src/runtime/panic.go:617 +0x72 fp=0x7f593affcd00
sp=0x7f593affccd0 pc=0x4395c2runtime.findObject(0xc000202648, 0xc000137840,
0x10, 0x0, 0x0, 0x0) /snap/go/3739/src/runtime/mbitmap.go:397 +0x3bd
fp=0x7f593affcd50 sp=0x7f593affcd00 pc=0x42237druntime.scanobject(0xc000137840,
0xc00003a670) /snap/go/3739/src/runtime/mgcmark.go:1174 +0x240
fp=0x7f593affcde0 sp=0x7f593affcd50 pc=0x42dc20runtime.gcDrain(0xc00003a670,
0x7) /snap/go/3739/src/runtime/mgcmark.go:932 +0x209 fp=0x7f593affce38
sp=0x7f593affcde0 pc=0x42d3f9runtime.gcBgMarkWorker.func2()
/snap/go/3739/src/runtime/mgc.go:1926 +0x16c fp=0x7f593affce78
sp=0x7f593affce38 pc=0x46018cruntime.systemstack(0x0)
/snap/go/3739/src/runtime/asm_amd64.s:351 +0x66 fp=0x7f593affce80
sp=0x7f593affce78 pc=0x462616runtime.mstart()
/snap/go/3739/src/runtime/proc.go:1153 fp=0x7f593affce88 sp=0x7f593affce80
pc=0x43db80
goroutine 82 [GC worker (idle)]:runtime.systemstack_switch()
/snap/go/3739/src/runtime/asm_amd64.s:311 fp=0xc00004df60 sp=0xc00004df58
pc=0x4625a0runtime.gcBgMarkWorker(0xc000039400)
/snap/go/3739/src/runtime/mgc.go:1890 +0x1be fp=0xc00004dfd8 sp=0xc00004df60
pc=0x429aferuntime.goexit() /snap/go/3739/src/runtime/asm_amd64.s:1337
+0x1 fp=0xc00004dfe0 sp=0xc00004dfd8 pc=0x464671created by
runtime.gcBgMarkStartWorkers /snap/go/3739/src/runtime/mgc.go:1784 +0x77
Running these tests is a bit complicated. We are working on putting a
docker container together that has the runtime all setup so all that is
needed to fire up the test. If desired, this will be made available.
Anyway, any insights as to how to make progress with this issue would be
most appreciated. I’ve tried to keep this as brief as possible. We don’t
understand what the Golang errors are telling us. So if further
explanations somewhere would help, please do let us know.
Steve
--
You received this message because you are subscribed to the Google Groups
"golang-nuts" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/golang-nuts/8d879a7a-8031-4a6a-93d8-5960a4d448c2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.