[ https://issues.apache.org/jira/browse/LUCENE-9191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael McCandless updated LUCENE-9191: --------------------------------------- Attachment: LUCENE-9191.patch Status: Open (was: Open) First cut at this, I think it's close! I created a simple Python tool, {{dev-tools/scripts/create_line_file_docs.py}}, that downloads Europarl v7 corpus, un-tgzs it into a temp dir, extracts all body text + titles + dates, breaks documents into "approximately" 1 K characters (trying to split at next word boundary) accord to clipped normal distribution, shuffles the resulting large full line docs file, and then spits out ~20, ~200, ~2000 MB sized files (before compression). Then, it compresses these files in chunks, saving the valid seek points. It turns out you can concatenate multiple gzip'd chunks into a single file and the resulting file is also a valid {{.gz}} file. Then I fixed {{LineFileDocs.java}} to also load the valid seek points (I stored in a separate {{.seek}} file side-by-side with the gzip'd line docs file), and finally, randomly seek to one of those points on init/reopen. I confirmed Lucene core tests pass ({{./gradlew lucene:core:test}}) on each of three line doc + seek files. I plan to commit the smallest one, and make the medium and large ones available at {{home.apache.org}} after pushing. Note that this is a hard break – after this change, {{LineFileDocs.java}} is no longer able to skip (scan) slowly through a line docs file, as it does today. I thought this was safest so people don't accidentally fall into an otherwise silent performance trap. I think it's intensely unlikely that anyone is using a different line docs file than what's committed :) I'm probably the only person on the planet that ever does this (for better test coverage)! > Fix linefiledocs compression or replace in tests > ------------------------------------------------ > > Key: LUCENE-9191 > URL: https://issues.apache.org/jira/browse/LUCENE-9191 > Project: Lucene - Core > Issue Type: Task > Reporter: Robert Muir > Assignee: Michael McCandless > Priority: Major > Attachments: LUCENE-9191.patch > > > LineFileDocs(random) is very slow, even to open. It does a very slow "random > skip" through a gzip compressed file. > For the analyzers tests, in LUCENE-9186 I simply removed its usage, since > TestUtil.randomAnalysisString is superior, and fast. But we should address > other tests using it, since LineFileDocs(random) is slow! > I think it is also the case that every lucene test has probably tested every > LineFileDocs line many times now, whereas randomAnalysisString will invent > new ones. > Alternatively, we could "fix" LineFileDocs(random), e.g. special compression > options (in blocks)... deflate supports such stuff. But it would make it even > hairier than it is now. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org