On Wed, Aug 10, 2022 at 09:17:02AM +0200, Thomas Lange wrote: > in #1015198 I also reported useless search results > similar to #658227 (still open since 2012).
Sorting by date is unlikely to help #658227 - it would prefer the newest documents which mention the DFSG, and the social contract page was created a long time ago, and presumably changes very rarely. Interestingly it's not just our search which struggles with the DFSG case. Testing with a couple of popular web search engines in a private window (to try to minimise any bias from previous searches, etc, though there are likely geographical and maybe other variations still): On duckduckgo (which I think is currently bing underneath), "Diabetic Foot Study Group" is top. Second is the wikipedia page for "our" DFSG. First d.o hit is https://wiki.debian.org/DFSGLicenses at #5, second d.o hit is the social contract page which isn't until #10. Google ranks the wikipedia page top, first d.o hit is https://people.debian.org/~bap/dfsg-faq.html at #3 and the social contract page is #4. I suspect it doesn't help that the canonical "right answer" here is actually a page which is primarily about the "Debian Social Contract" (that's the page title and top heading, and what's talked about most in the initial part of the page which it's likely search engines put extra weight on). > I found this in the xapian docs. Do you think this would be the best > solution to get the results sorted by date? I'm not sure if it would > be easy to index all our html documents by date, since the time stamps > on the files do not reflect the date of the last modification of the > content. Do you know of any other solutions? That's the most efficient way to sort by date, but requires extra work at index time. You can also store the date in a "value slot" and sort by that, which is easier at index time, but slower at search time. The last modified date is actually already available to sort by behind the scenes - here's what it looks like for your `debconf` example: https://search.debian.org/?q=debconf&HITSPERPAGE=100&DB=en&SORT=-0 Note that paging through results doesn't preserve this sort setting because the template wasn't written expecting this, so I've set it to show 100 results, which are dominated by WNPP reports, translation reports, etc. Aside from such auto-generated documents (which could probably be excluded or penalised in the ranking), last modified isn't entirely helpful anyway - we don't want a page about debconf 10 to beat the one about debconf 22 just because someone recently fixed a typo or updated a link on the old page. Sorting by creation date probably doesn't really help either. The Social Contract page was created long ago, but the most recent Debconf page fairly recently so neither ascending nor descending is good across the two cases you've highlighted. I suspect boosting based on some link analysis within d.o would help a lot - both the social contract page and the latest debconf page will tend to have more incoming links compared to other pages matching the same terms, and the various autogenerated pages are unlikely to be linked to a lot from elsewhere. I did some initial work on that at Debconf 16, but ran out of time and sadly haven't managed to get back to it since. Cheers, Olly