[jira] [Updated] (LUCENE-9191) Fix linefiledocs compression or replace in tests

Michael McCandless (Jira) Mon, 17 Feb 2020 07:36:08 -0800


     [ 
https://issues.apache.org/jira/browse/LUCENE-9191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael McCandless updated LUCENE-9191:
---------------------------------------
    Attachment: LUCENE-9191.patch
        Status: Open  (was: Open)

First cut at this, I think it's close!

I created a simple Python tool, {{dev-tools/scripts/create_line_file_docs.py}}, 
that downloads Europarl v7 corpus, un-tgzs it into a temp dir, extracts all 
body text + titles + dates, breaks documents into "approximately" 1 K 
characters (trying to split at next word boundary) accord to clipped normal 
distribution, shuffles the resulting large full line docs file, and then spits 
out ~20, ~200, ~2000 MB sized files (before compression).

Then, it compresses these files in chunks, saving the valid seek points. It 
turns out you can concatenate multiple gzip'd chunks into a single file and the 
resulting file is also a valid {{.gz}} file.

Then I fixed {{LineFileDocs.java}} to also load the valid seek points (I stored 
in a separate {{.seek}} file side-by-side with the gzip'd line docs file), and 
finally, randomly seek to one of those points on init/reopen.

I confirmed Lucene core tests pass ({{./gradlew lucene:core:test}}) on each of 
three line doc + seek files. I plan to commit the smallest one, and make the 
medium and large ones available at {{home.apache.org}} after pushing.

Note that this is a hard break – after this change, {{LineFileDocs.java}} is no 
longer able to skip (scan) slowly through a line docs file, as it does today. I 
thought this was safest so people don't accidentally fall into an otherwise 
silent performance trap.  I think it's intensely unlikely that anyone is using 
a different line docs file than what's committed :)  I'm probably the only 
person on the planet that ever does this (for better test coverage)!

> Fix linefiledocs compression or replace in tests
> ------------------------------------------------
>
>                 Key: LUCENE-9191
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9191
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Robert Muir
>            Assignee: Michael McCandless
>            Priority: Major
>         Attachments: LUCENE-9191.patch
>
>
> LineFileDocs(random) is very slow, even to open. It does a very slow "random 
> skip" through a gzip compressed file.
> For the analyzers tests, in LUCENE-9186 I simply removed its usage, since 
> TestUtil.randomAnalysisString is superior, and fast. But we should address 
> other tests using it, since LineFileDocs(random) is slow!
> I think it is also the case that every lucene test has probably tested every 
> LineFileDocs line many times now, whereas randomAnalysisString will invent 
> new ones.
> Alternatively, we could "fix" LineFileDocs(random), e.g. special compression 
> options (in blocks)... deflate supports such stuff. But it would make it even 
> hairier than it is now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (LUCENE-9191) Fix linefiledocs compression or replace in tests

Reply via email to