On 2015-01-31 02:38, Russ Allbery wrote: > Niels Thykier <ni...@thykier.net> writes: > >> The html_reports process itself consumes up to 2GB while processing >> templates. It is possible that there is nothing we can do about that >> as there *is* a lot of data in play. But even then, we can free it as >> soon as possible (so we do not keep it while running gnuplot at the >> end of the run). > > I think the code currently takes a very naive approach and loads the > entire state of the world into memory, and Perl's memory allocation is > known to aggressively trade space for speed. >
It does try to share a lot of the inner data structures - there are indeed still some deficiencies to it. I really wish one could do things like string interning in perl. > If instead it stored the various things it cared about in a local SQLite > database, it would be a bit slower, but it would consume much less > memory. I bet the speed difference wouldn't be too bad. And this would > have the possibly useful side effect of creating a SQLite database full of > interesting statistics that one could run rich queries against. > That is definitely worth consideration - thanks for the suggestion. It would imply an immense rewrite of html_reports. While it is certainly long overdue, it is not something I suspect I will have time (or mental capacity) to do on my own. I have started a different approach (see [1] for WIP code). It is mostly a parallel track to your idea, so they can certainly co-exist. The goal of this approach is to: * Split harness into a "simple" coordinator * Remove the Lab as a (primary) data store (it is too fragile) * Harness state as datastore The details of my design decisions are: Harness - simple coordinator ============================ In my opinion, a lot of the (to quote private/TODO) "yuckness" of harness happens because we want very well determined failure handling, but never wrote harness with a structure that makes that trivial. Notably, we do not want harness to crash (without logging it first) and especially not while working on the Lab (see next section). By moving logic to of harness, this rewrite will become easier as there is less to juggle around with. Further, by moving it out of harness (and into an other process), we can ensure that any memory consumption caused by this task will definitely be freed when the child process terminates. I have previously tried to make harness free some of its memory with no luck. Removing the Lab as data store ============================== For me, there are several advantages in this. Firstly, the lab is very fragile - if anything crashes (or is interrupted) while updating the lab, the metadata is often trivially out of sync and the lab is (partly) corrupted[0]. The end result is often that lintian/harness croaks on importing stuff until someone manually runs a $lab->repair. However, this does not fix all types of corruptions (see the FIXME in L::Lab->repair), so... /o\ By removing the Lab as a data store, we can use a simpler and more robust data store (more on that in the next section) AND use throw away labs. I had a talk with DSA (I think weasel) about getting a tmpfs disk on another machine for the heavy lifting. This implies that we *can* in fact throw away the lab after every run. Harness state as datastore ========================== I introduced a "harness state cache" a couple of versions back to track which packages needed to be reprocessed, when we uploaded a new version of lintian. This (YAML) file can be trivially extended to contain all the necessary information required by harness and html_reports to replace the Lab as a data store. It already features several advantages to the Lab, namely: * Atomic updates of the content (see save_state_cache in harness) * Automatically recreated from scratch if it "vanishes". * We can add/remove information to/from without having to update the lab metadata. Certainly, this file can (also?) be replaced by an SQL(-lite) database. If someone is willing to do or help me with the SQL(-lite) part, I am definitely open for it. ~Niels [0] Unless you manage to successfully run $LAB->close - harness does not, lintian generally does. [1] http://anonscm.debian.org/cgit/users/nthykier/lintian.git/log/?h=reporting-rewrite NB: Rebased regularly. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org