[ https://issues.apache.org/jira/browse/LUCENE-9191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-9191: --------------------------------------- Attachment: LUCENE-9191.patch Status: Open (was: Open) Another iteration of the patch, improving the previous hackity word boundary detection to use Python's regex {{\W}} with {{re.UNICODE}} flag (thanks [~rcmuir]!). I then regenerated all three line files + seek files, and confirmed Lucene tests pass with the 20 MB file. > Fix linefiledocs compression or replace in tests > ------------------------------------------------ > > Key: LUCENE-9191 > URL: https://issues.apache.org/jira/browse/LUCENE-9191 > Project: Lucene - Core > Issue Type: Task > Reporter: Robert Muir > Assignee: Michael McCandless > Priority: Major > Attachments: LUCENE-9191.patch, LUCENE-9191.patch > > > LineFileDocs(random) is very slow, even to open. It does a very slow "random > skip" through a gzip compressed file. > For the analyzers tests, in LUCENE-9186 I simply removed its usage, since > TestUtil.randomAnalysisString is superior, and fast. But we should address > other tests using it, since LineFileDocs(random) is slow! > I think it is also the case that every lucene test has probably tested every > LineFileDocs line many times now, whereas randomAnalysisString will invent > new ones. > Alternatively, we could "fix" LineFileDocs(random), e.g. special compression > options (in blocks)... deflate supports such stuff. But it would make it even > hairier than it is now. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org