Thanks Trey!
this will certainly greatly improve the intitle keyword as it uses the
field with stems for filtering and hopefully will find pages that were
ignored because of this filter ordering (e.g. intitle:louys can't find
User:Louÿs currently).
I think I'll do the same for French which suffers from the same problem.
IMO we should continue to work on this for other languages while we try
to switch from asciifolding (latin letters only) to icu folding.
We may require some guidance on some languages where diacritics removal
can be counter productive and maybe blacklist some letters (e.g. for
finnish: is it appropriate to fold Ä or Ö for example?)
Note on accent folding: cirrus tries to always prefer exact matches.
Searching for élément should always prefer élément over element. Users
that prefer exact matches can always force cirrus to discard stems by
wrapping the word in double quotes, e.g. "élément".
Le 10/08/2016 à 16:07, Trey Jones a écrit :
David and I had a discussion about moving ascii-folding to come before
stemming on English Wikipedia. It seemed like a good idea, but we
decided we should run some tests before implementing it, just to be sure.
Turns out it is a good idea!
Much more detail:
https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Re-Ordering_Stemming_and_Ascii-Folding_on_English_Wikipedia
<https://www.mediawiki.org/wiki/User:TJones_%28WMF%29/Notes/Re-Ordering_Stemming_and_Ascii-Folding_on_English_Wikipedia>
We won't deploy it until we deploy BM25 later in the year, since it
requires a full re-index of English Wikipedia, as does BM25. That's
something we should only do once.
—Trey
Trey Jones
Software Engineer, Discovery
Wikimedia Foundation
_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery
_______________________________________________
discovery mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/discovery