I was able to create a fork of 3.7.3 with just the *flatbuffers* replaced with the pre-3.6.x version (2.0.0).
This seemed to only require changes to the version asserts and adding an *align* parameter to *Table::VerifyField()* to match the newer API. https://github.com/heavyai/gdal/tree/simon.eves/release/3.7/downgrade_to_flatbuffers_2.0.0 Our system works correctly and passes all GDAL I/O tests with that version. Obviously this isn't an ideal solution, but this is otherwise a release blocker for us. I would still very much like to discuss the original problem more deeply, and hopefully come up with a better solution. Yours hopefully, Simon On Thu, Feb 22, 2024 at 10:22 PM Simon Eves <simon.e...@heavy.ai> wrote: > Thank you, Robert, for the RR tip. I shall try it. > > I have new findings to report, however. > > First of all, I confirmed that a build against GDAL 3.4.1 (the version we > were on before) still works. I also confirmed that builds against 3.7.3 and > 3.8.4 still failed even with no additional library dependencies (just > sqlite3 and proj), in case it was a side-effect of us also adding more of > those. I then tried 3.5.3, with the CMake build (same config as we use for > 3.7.3) and that worked. I then tried 3.6.4 (again, same CMake config) and > that failed. These were all from bundles. > > I then started delving through the GDAL repo itself. I found the common > root commit of 3.5.3 and 3.6.4, and all the commits in the > *ogr/ogrsf_frmts/flatgeobuf* sub-project between that one and the final > of each. For 3.5.3, this was only two. I built and tested both, and they > were fine. I then tried the very first one that was new in the 3.6.4 chain > (not in the history of 3.5.3), which was actually a bulk update to the > *flatbuffers* sub-library, committed by Bjorn Harrtell on May 8 2022 (SHA > f7d8876). That one had the issue. I then tried the immediately-preceding > commit (an unrelated docs change) and that one was fine. > > My current hypothesis, therefore, is that the *flatbuffers* update > introduced the issue, or at least, the susceptibility of the issue. > > I still cannot explain why it only occurs in an all-static build, and even > less able to explain why it only occurs in our full system and not with the > simple test program against the very same static lib build that does the > very same sequence of GDAL API calls, but I repeated the build tests of the > commits either side and a few other random ones a bit further away in each > direction, and the results were consistent. Again, it happens with both GCC > 11 and Clang 14 builds, Debug or Release. > > I will continue tomorrow to look at the actual changes to *flatbuffers* in > that update, although they are quite significant. Certainly, the > *vector_downward* class, which is directly involved, was a new file in > that update (although on inspection of that file's history in the > *google/flatbuffers* repo, it seems it was just split out of another > header). > > Bjorn, I don't mean to call you out directly, but I am CC'ing you to > ensure you see this, as you appear to be a significant contributor to the > *flatbuffers* project itself. Any insight you may have would be very > welcome. I am of course happy to describe my debugging findings in more > detail, privately if you wish, rather than spamming the list. > > Simon > > > > > > > On Tue, Feb 20, 2024 at 1:49 PM Robert Coup <robert.c...@koordinates.com> > wrote: > >> Hi, >> >> On Tue, 20 Feb 2024 at 21:44, Robert Coup <robert.c...@koordinates.com> >> wrote: >> >>> Hi Simon, >>> >>> On Tue, 20 Feb 2024 at 21:11, Simon Eves <simon.e...@heavy.ai> wrote: >>> >>>> Here's the stack trace for the original assert. Something is stepping >>>> on scratch_ to make it 0x1000000000 instead of null, which it starts out as >>>> when the flatbuffer object is created, but by the time it gets to >>>> allocating memory, it's broken. >>>> >>> >>> What happens if you set a watchpoint in gdb when the flatbuffer is >>> created? >>> >>> watch -l myfb->scratch >>> or watch *0x1234c0ffee >>> >> >> Or I've also had success with Mozilla's rr: https://rr-project.org/ — >> you can run to a point where scratch is wrong, set a watchpoint on it, and >> then run the program backwards to find out what touched it. >> >> Rob :) >> >
_______________________________________________ gdal-dev mailing list gdal-dev@lists.osgeo.org https://lists.osgeo.org/mailman/listinfo/gdal-dev