Hi fellow devs, I created an amount of NEW packages as a DD, and reviewed an amount of NEW packages in the NEW queue as FTP trainee. Both of the two kinds of work involves an important part -- sometimes annoying -- license checking. People keeps complaining about it, and recently there were some related discussions[1][2] on -project, talking about possible ways to improve -- some in the workflow aspect, the other in the tooling aspect. In this mail I have an idea about tooling.
This is a long mail. I've alreay organized it in a structured format so you can apply more fast reading tricks. The problem we are trying to solve ---------------------------------- Given a arbitrary source tree, we shall examine the copyright & license information for each file node, make sure each node complies with DFSG, and make an overall assessment to the whole tree: ACCEPT/REJECT. Subsequently, the tree will be flattened (the tree structure being removed) and written into debian/copyright in machine-readable format. Note that, automatically parsing a machine-UNreadable debian/copyright requires a delicate recurrent neural network. That machine-UNreadable case is too complex, so let's ignore it for now. Existing tools, workflows; And limitations ---------------------------------------------------------- ## Tools https://wiki.debian.org/CopyrightReviewTools I'm unfamiliar with most of them. I'm only describing the two I'm familiar with. Both licensecheck (Jonas) and debmake (Osamu) do template/regex matching. ## workflows uploader: ??? there doesn't seem to be a standard process to generate debian/copyright for all uploaders. I personally do `licensecheck -r --deb-machine . > debian/copyright` and manually tweak the content. ftp-master: possibly manually reviewing with MC + custom plugin I didn't follow the recommended way. I use `ranger` (vim keybinding, fluent file browsing with preview panel) for reviewing packages on ftp-master.d.o. ## Limitations * Tree structure is always missing (and actually not possible to present) in debian/copyright. When reviewing other's NEW package as trainee, I feel torturous to locate the license information for a single file in debian/copyright. * Tree structure is always missing. after importing a new upstream release with significant directory layout change, it will be inconvenient to locate the parts of debian/copyright should be updated. Things will become more complex when new licenses/copyrights emerged. * licensecheck dumps garbage when it encounters a binary file, e.g. PNG image. This is not a BUG, as ftp-masters indeed checks the possible metadata in a binary file to make sure whether there is extra copyright/license info. But this is something needs to be improved... * Generic file browsers are not designed for our special purpose, neither does the commercial tools. * etc. My idea ------- ## Motivations License reviewing is certainly inevitable. Even if we can improve the efficiency of this process a tiny bit, it will greatly improve the efficiency of the community on the specific task we are talking about. I have a couple of other motivations but the above one is already strong enough. ## Core The core of my idea is a tree-structured intermediate representation (IR) for the "license reviewing tree". The IR is basically a directory tree with annotations on the file nodes. The IR can be stored as a, say, JSON file. To build such an tree-shaped IR, we need a couple of "backend" tools for checking the copyright & license info for a SINGLE file. Such "backend" includes but not limited to: * `licensecheck`. Given a file FILE, `licensecheck FILE` produces the license name. * `grep` or `ripgrep`. For example, `rg -i copyright FILE` always works well. * "neighbor". For example, given a source file "F/I/L/E" without any copyright & license info, looking for F/I/L/LICENSE, F/I/LICENSE, ..., etc like git does for the ".git" directory will help. The formated+filtered output of any combination of these backends can be attached to the corresponding IR. In contrast, a "frontend" tool is also needed for dealing with such IR in a higher level. My imagined "frontend" tool is a `ranger`-like file browser with specific designs. * the user can choose what backend(s) to use. If none is chosen, the frontend tool falls back into a general file browser with a preview panel. * the frontend invokes various backend to generate a template IR, and store it to debian/copyright.json. No wildcard or regex in file path is allowed in this file. * when viewing files, the suggestions from various backends are shown. the user could choose to accept of override the suggestion. These choices will also be recorded in the json file. Of course, when various backends do not agree with each other, the user has to override the suggestion, and manually annotate the node. * when finished reviewing/annotating the whole directory tree, the frontend will translate the IR (d/copyright.json) into machine-readable format. (d/copyright) * ... >From ftp-master's perspective: * can review the uploader's IR with the frontend. the good things is that ftp-master can collect all the informations for one file at a glance: file path, file preview (the header part), backend suggestion, human annotation, override or manual annotation history. * don't have to suffer from the "locating file in d/copyright" >From uploader's perspective: * in the past the IR is built in our mind. Instead we transform the raw directory and file data into the final flattened d/copyright. That means we have to build the IR everytime when we want to change/review d/copyright. Explicitly write that IR down may make the process more efficient. This frontend-backend design somehow resembles our apt+dpkg, where apt deals with the dependency tree, while dpkg deals with the nodes. How to proceed -------------- * a group of interested contributors. * GSoC / Outreachy sounds good. Several months ago I've already started a python script based on this idea. I'm struggling with UI programming (I'm really not good in this area). Specifically, when I found myself stuck at adding custom keybinding under the urwid framework, I postponed the idea indefinitely. [1] -project: "Do we still value contributions?" [2] -project: "possibly exhausted ftp-masters"

