[ 
https://issues.apache.org/jira/browse/LUCENE-9191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-9191:
---------------------------------------
    Attachment: LUCENE-9191.patch
        Status: Open  (was: Open)

Another iteration of the patch, improving the previous hackity word boundary 
detection to use Python's regex {{\W}} with {{re.UNICODE}} flag (thanks 
[~rcmuir]!).  I then regenerated all three line files + seek files, and 
confirmed Lucene tests pass with the 20 MB file.

> Fix linefiledocs compression or replace in tests
> ------------------------------------------------
>
>                 Key: LUCENE-9191
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9191
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Robert Muir
>            Assignee: Michael McCandless
>            Priority: Major
>         Attachments: LUCENE-9191.patch, LUCENE-9191.patch
>
>
> LineFileDocs(random) is very slow, even to open. It does a very slow "random 
> skip" through a gzip compressed file.
> For the analyzers tests, in LUCENE-9186 I simply removed its usage, since 
> TestUtil.randomAnalysisString is superior, and fast. But we should address 
> other tests using it, since LineFileDocs(random) is slow!
> I think it is also the case that every lucene test has probably tested every 
> LineFileDocs line many times now, whereas randomAnalysisString will invent 
> new ones.
> Alternatively, we could "fix" LineFileDocs(random), e.g. special compression 
> options (in blocks)... deflate supports such stuff. But it would make it even 
> hairier than it is now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to