[jira] [Commented] (LUCENE-9747) Missing package-info.java causes NPE in MissingDoclet.java
[ https://issues.apache.org/jira/browse/LUCENE-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281602#comment-17281602 ] Dawid Weiss commented on LUCENE-9747: - I get this with openjdk 14.0.1+7: {code} javadoc: error - org.apache.lucene.util (package): javadocs are missing C:\Work\apache\lucene-solr.master\lucene\core\src\java\org\apache\lucene\analysis\standard\StandardAnalyzer.java:84: javadoc empty but @Override declared, skipping. ... [lots more warnings] {code} But no NPE. Which Java version are you using? > Missing package-info.java causes NPE in MissingDoclet.java > -- > > Key: LUCENE-9747 > URL: https://issues.apache.org/jira/browse/LUCENE-9747 > Project: Lucene - Core > Issue Type: Improvement > Components: general/javadocs >Affects Versions: master (9.0) >Reporter: David Eric Pugh >Priority: Minor > Attachments: LUCENE-9747.patch > > Time Spent: 1h 50m > Remaining Estimate: 0h > > When running {{./gradlew :solr:core:javadoc}} discovered that if a package > directory is missing the {{package-info.java}} file you get a VERY cryptic > error: > > {{javadoc: error - fatal error encountered: java.lang.NullPointerException}} > {{javadoc: error - Please file a bug against the javadoc tool via the Java > bug reporting page}} > > I poked around and that the {{MissingDoclet.java}} method call to > \{{reporter.print(Diagnostic.Kind.ERROR, element, fullMessage.toString());}} > was failing, due to the element having some sort of null in it. I am > attaching a patch and a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9748) Hunspell: suggest inflected dictionary entries similar to the misspelled word
Peter Gromov created LUCENE-9748: Summary: Hunspell: suggest inflected dictionary entries similar to the misspelled word Key: LUCENE-9748 URL: https://issues.apache.org/jira/browse/LUCENE-9748 Project: Lucene - Core Issue Type: Sub-task Reporter: Peter Gromov -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2327: LUCENE-9740: scan affix stream once.
donnerpeter commented on a change in pull request #2327: URL: https://github.com/apache/lucene-solr/pull/2327#discussion_r572666204 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java ## @@ -70,6 +70,9 @@ /** In-memory structure for the dictionary (.dic) and affix (.aff) data of a hunspell dictionary. */ public class Dictionary { + // Derived from woorm/ openoffice dictionaries. Review comment: LibreOffice? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2327: LUCENE-9740: scan affix stream once.
donnerpeter commented on a change in pull request #2327: URL: https://github.com/apache/lucene-solr/pull/2327#discussion_r572666977 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java ## @@ -224,26 +227,37 @@ public Dictionary( this.needsInputCleaning = ignoreCase; this.needsOutputCleaning = false; // set if we have an OCONV -Path tempPath = getDefaultTempDir(); // TODO: make this configurable? -Path aff = Files.createTempFile(tempPath, "affix", "aff"); - -BufferedInputStream aff1 = null; -InputStream aff2 = null; -boolean success = false; -try { - // Copy contents of the affix stream to a temp file. - try (OutputStream os = Files.newOutputStream(aff)) { -affix.transferTo(os); +try (BufferedInputStream affixStream = +new BufferedInputStream(affix, MAX_PROLOGUE_SCAN_WINDOW) { + @Override + public void close() throws IOException { +// TODO: maybe we should consume and close it? Why does it need to stay open? Review comment: Probably so that the callers who opened the streams can close them safely using try-with-resources. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2327: LUCENE-9740: scan affix stream once.
donnerpeter commented on a change in pull request #2327: URL: https://github.com/apache/lucene-solr/pull/2327#discussion_r572667748 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java ## @@ -346,6 +352,7 @@ private void readAffixFile(InputStream affixStream, CharsetDecoder decoder, Flag if (line.isEmpty()) continue; String firstWord = line.split("\\s")[0]; + // TODO: convert to a switch? Review comment: I thought about that. Maybe switch expression, when the language level allows that. Switch with break statements would be too verbose for my taste This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2327: LUCENE-9740: scan affix stream once.
donnerpeter commented on a change in pull request #2327: URL: https://github.com/apache/lucene-solr/pull/2327#discussion_r572668532 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java ## @@ -778,31 +791,36 @@ char affixData(int affixIndex, int offset) { private static final byte[] BOM_UTF8 = {(byte) 0xef, (byte) 0xbb, (byte) 0xbf}; /** Parses the encoding and flag format specified in the provided InputStream */ - private void readConfig(BufferedInputStream stream) throws IOException, ParseException { -// I assume we don't support other BOMs (utf16, etc.)? We trivially could, -// by adding maybeConsume() with a proper bom... but I don't see hunspell repo to have -// any such exotic examples. -Charset streamCharset; -if (maybeConsume(stream, BOM_UTF8)) { - streamCharset = StandardCharsets.UTF_8; -} else { - streamCharset = DEFAULT_CHARSET; -} - -// TODO: can these flags change throughout the file? If not then we can abort sooner. And -// then we wouldn't even need to create a temp file for the affix stream - a large enough -// leading buffer (BufferedInputStream) would be sufficient? + private void readConfig(InputStream stream, Charset streamCharset) + throws IOException, ParseException { LineNumberReader reader = new LineNumberReader(new InputStreamReader(stream, streamCharset)); String line; +String flagLine = null; +boolean charsetFound = false; +boolean flagFound = false; while ((line = reader.readLine()) != null) { if (line.isBlank()) continue; String firstWord = line.split("\\s")[0]; if ("SET".equals(firstWord)) { decoder = getDecoder(singleArgument(reader, line)); +charsetFound = true; } else if ("FLAG".equals(firstWord)) { -flagParsingStrategy = getFlagParsingStrategy(line, decoder.charset()); +// Preserve the flag line for parsing later since we need the decoder's charset +// and just in case they come out of order. +flagLine = line; +flagFound = true; + } else { +continue; } + + if (charsetFound && flagFound) { +break; + } +} + +if (flagFound) { Review comment: flagLine != null? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9740) Avoid buffering and double-scan of flags in *.aff file
[ https://issues.apache.org/jira/browse/LUCENE-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281609#comment-17281609 ] Peter Gromov commented on LUCENE-9740: -- Very nice, thanks! I think this can be merged, and additional checks can be added later > Avoid buffering and double-scan of flags in *.aff file > -- > > Key: LUCENE-9740 > URL: https://issues.apache.org/jira/browse/LUCENE-9740 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Time Spent: 50m > Remaining Estimate: 0h > > I wrote a small utility test to scan through all the *.aff files from > openoffice and woorm - no file has double flags (SET or FLAG) and maximum > leading offsets until these flags appear are roughly: > {code} > Flag SET at maximum offset 10753 > Flag FLAG at maximum offset 4559 > {code} > I think we could just make an assumption that, say, affix files are read with > an 20kB buffered reader and this provides a maximum leading window for > scanning for those flags. The dictionary parsing could also fail if any of > these flags occurs more than once in the input file? > This would avoid having to read the file twice and perhaps simplify the API > (no need for a temporary spill). > I'll piggyback this test as part of LUCENE-9727 if you'd like to re-run it > locally. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9749) Hunspell: apply output conversion (OCONV) to the suggestions
Peter Gromov created LUCENE-9749: Summary: Hunspell: apply output conversion (OCONV) to the suggestions Key: LUCENE-9749 URL: https://issues.apache.org/jira/browse/LUCENE-9749 Project: Lucene - Core Issue Type: Sub-task Reporter: Peter Gromov -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] donnerpeter opened a new pull request #2329: LUCENE-9749: Hunspell: apply output conversion (OCONV) to the suggestions
donnerpeter opened a new pull request #2329: URL: https://github.com/apache/lucene-solr/pull/2329 # Description OCONV should be applied not only to stems, but also suggestions # Solution Call the method that applies it :) # Tests `oconv` from Hunspell repo # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `master` branch. - [x] I have run `./gradlew check`. - [x] I have added tests for my changes. - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] donnerpeter opened a new pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…
donnerpeter opened a new pull request #2330: URL: https://github.com/apache/lucene-solr/pull/2330 …o the misspelled word # Description A follow up of the "ngram" suggestion support that adds single prefixes and suffixes to dictionary entries to get better suggestions # Solution Copy Hunspell's logic, extract some common code for FST traversal # Tests `allcaps.sug` from Hunspell repo # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `master` branch. - [x] I have run `./gradlew check`. - [x] I have added tests for my changes. - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] iverase merged pull request #2268: LUCENE-9705: Move Lucene50CompoundFormat to Lucene90CompoundFormat
iverase merged pull request #2268: URL: https://github.com/apache/lucene-solr/pull/2268 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9705) Move all codec formats to the o.a.l.codecs.Lucene90 package
[ https://issues.apache.org/jira/browse/LUCENE-9705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281617#comment-17281617 ] ASF subversion and git services commented on LUCENE-9705: - Commit eafeb6643408e7e978f2fcb8d456b5eb3ca9c187 in lucene-solr's branch refs/heads/master from Ignacio Vera [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=eafeb66 ] LUCENE-9705: Move Lucene50CompoundFormat to Lucene90CompoundFormat (#2268) > Move all codec formats to the o.a.l.codecs.Lucene90 package > --- > > Key: LUCENE-9705 > URL: https://issues.apache.org/jira/browse/LUCENE-9705 > Project: Lucene - Core > Issue Type: Wish >Reporter: Ignacio Vera >Priority: Major > Time Spent: 3h 40m > Remaining Estimate: 0h > > Current formats are distributed in different packages, prefixed with the > Lucene version they were created. With the upcoming release of Lucene 9.0, it > would be nice to move all those formats to just the o.a.l.codecs.Lucene90 > package (and of course moving the current ones to the backwards-codecs). > This issue would actually facilitate moving the directory API to little > endian (LUCENE-9047) as the only codecs that would need to handle backwards > compatibility will be the codecs in backwards codecs. > In addition, it can help formalising the use of internal versions vs format > versioning ( LUCENE-9616) > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…
donnerpeter commented on a change in pull request #2330: URL: https://github.com/apache/lucene-solr/pull/2330#discussion_r572676276 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java ## @@ -33,44 +40,59 @@ */ class GeneratingSuggester { private static final int MAX_ROOTS = 100; - private static final int MAX_GUESSES = 100; + private static final int MAX_WORDS = 100; + private static final int MAX_GUESSES = 200; private final Dictionary dictionary; + private final SpellChecker speller; - GeneratingSuggester(Dictionary dictionary) { -this.dictionary = dictionary; + GeneratingSuggester(SpellChecker speller) { +this.dictionary = speller.dictionary; +this.speller = speller; } List suggest(String word, WordCase originalCase, Set prevSuggestions) { -List roots = findSimilarDictionaryEntries(word, originalCase); -List expanded = expandRoots(word, roots); -TreeSet bySimilarity = rankBySimilarity(word, expanded); +List> roots = findSimilarDictionaryEntries(word, originalCase); Review comment: Just renamed a parameterized `WeightedWord` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…
donnerpeter commented on a change in pull request #2330: URL: https://github.com/apache/lucene-solr/pull/2330#discussion_r572676594 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java ## @@ -33,44 +40,59 @@ */ class GeneratingSuggester { private static final int MAX_ROOTS = 100; - private static final int MAX_GUESSES = 100; + private static final int MAX_WORDS = 100; + private static final int MAX_GUESSES = 200; private final Dictionary dictionary; + private final SpellChecker speller; - GeneratingSuggester(Dictionary dictionary) { -this.dictionary = dictionary; + GeneratingSuggester(SpellChecker speller) { +this.dictionary = speller.dictionary; +this.speller = speller; } List suggest(String word, WordCase originalCase, Set prevSuggestions) { -List roots = findSimilarDictionaryEntries(word, originalCase); -List expanded = expandRoots(word, roots); -TreeSet bySimilarity = rankBySimilarity(word, expanded); +List> roots = findSimilarDictionaryEntries(word, originalCase); +List> expanded = expandRoots(word, roots); +TreeSet> bySimilarity = rankBySimilarity(word, expanded); return getMostRelevantSuggestions(bySimilarity, prevSuggestions); } - private List findSimilarDictionaryEntries(String word, WordCase originalCase) { -try { - IntsRefFSTEnum fstEnum = new IntsRefFSTEnum<>(dictionary.words); - TreeSet roots = new TreeSet<>(); + private List> findSimilarDictionaryEntries( + String word, WordCase originalCase) { +TreeSet> roots = new TreeSet<>(); +processFST( Review comment: extracted FST traversal into a separate method This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…
donnerpeter commented on a change in pull request #2330: URL: https://github.com/apache/lucene-solr/pull/2330#discussion_r572676998 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java ## @@ -84,33 +106,34 @@ private static String toString(IntsRef key) { return new String(chars); } - private boolean isSuitableRoot(IntsRef forms) { + private List filterSuitableEntries(String word, IntsRef forms) { +List result = new ArrayList<>(); for (int i = 0; i < forms.length; i += dictionary.formStep()) { int entryId = forms.ints[forms.offset + i]; - if (dictionary.hasFlag(entryId, dictionary.needaffix) Review comment: needaffix check is moved into `expandRoot` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…
donnerpeter commented on a change in pull request #2330: URL: https://github.com/apache/lucene-solr/pull/2330#discussion_r572677318 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java ## @@ -132,14 +155,105 @@ private static int calcThreshold(String word) { return thresh / 3 - 1; } - private TreeSet rankBySimilarity(String word, List expanded) { + private List expandRoot(DictEntry root, String misspelled) { Review comment: Main change here This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…
donnerpeter commented on a change in pull request #2330: URL: https://github.com/apache/lucene-solr/pull/2330#discussion_r572677980 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Stemmer.java ## @@ -798,8 +798,4 @@ private boolean isFlagAppendedByAffix(int affixId, char flag) { int appendId = dictionary.affixData(affixId, Dictionary.AFFIX_APPEND); return dictionary.hasFlag(appendId, flag); } - - private boolean isCrossProduct(int affix) { Review comment: moved to Dictionary This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] iverase merged pull request #2269: LUCENE-9322: Add TestLucene90FieldInfosFormat
iverase merged pull request #2269: URL: https://github.com/apache/lucene-solr/pull/2269 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…
donnerpeter commented on a change in pull request #2330: URL: https://github.com/apache/lucene-solr/pull/2330#discussion_r572696542 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java ## @@ -33,44 +40,59 @@ */ class GeneratingSuggester { private static final int MAX_ROOTS = 100; - private static final int MAX_GUESSES = 100; + private static final int MAX_WORDS = 100; + private static final int MAX_GUESSES = 200; private final Dictionary dictionary; + private final SpellChecker speller; - GeneratingSuggester(Dictionary dictionary) { -this.dictionary = dictionary; + GeneratingSuggester(SpellChecker speller) { +this.dictionary = speller.dictionary; +this.speller = speller; } List suggest(String word, WordCase originalCase, Set prevSuggestions) { -List roots = findSimilarDictionaryEntries(word, originalCase); -List expanded = expandRoots(word, roots); -TreeSet bySimilarity = rankBySimilarity(word, expanded); +List> roots = findSimilarDictionaryEntries(word, originalCase); +List> expanded = expandRoots(word, roots); +TreeSet> bySimilarity = rankBySimilarity(word, expanded); return getMostRelevantSuggestions(bySimilarity, prevSuggestions); } - private List findSimilarDictionaryEntries(String word, WordCase originalCase) { -try { - IntsRefFSTEnum fstEnum = new IntsRefFSTEnum<>(dictionary.words); - TreeSet roots = new TreeSet<>(); + private List> findSimilarDictionaryEntries( + String word, WordCase originalCase) { +TreeSet> roots = new TreeSet<>(); +processFST( +dictionary.words, +(key, forms) -> { + if (Math.abs(key.length - word.length()) > 4) return; + + String root = toString(key); + List entries = filterSuitableEntries(root, forms); + if (entries.isEmpty()) return; + + if (originalCase == WordCase.LOWER + && WordCase.caseOf(root) == WordCase.TITLE + && !dictionary.hasLanguage("de")) { +return; + } - IntsRefFSTEnum.InputOutput mapping; - while ((mapping = fstEnum.next()) != null) { -IntsRef key = mapping.input; -if (Math.abs(key.length - word.length()) > 4 || !isSuitableRoot(mapping.output)) continue; - -String root = toString(key); -if (originalCase == WordCase.LOWER -&& WordCase.caseOf(root) == WordCase.TITLE -&& !dictionary.hasLanguage("de")) { - continue; -} + String lower = dictionary.toLowerCase(root); + int sc = + ngram(3, word, lower, EnumSet.of(NGramOptions.LONGER_WORSE)) + + commonPrefix(word, root); -String lower = dictionary.toLowerCase(root); -int sc = -ngram(3, word, lower, EnumSet.of(NGramOptions.LONGER_WORSE)) + commonPrefix(word, root); + entries.forEach(e -> roots.add(new Weighted<>(e, sc))); +}); +return roots.stream().limit(MAX_ROOTS).collect(Collectors.toList()); + } -roots.add(new WeightedWord(root, sc)); + private void processFST(FST fst, BiConsumer keyValueConsumer) { Review comment: This might be worth moving to some util, e.g. `IntsRefFSTEnum` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] iverase opened a new pull request #2331: LUCENE-9322: Lucene90VectorWriter can leak open files
iverase opened a new pull request #2331: URL: https://github.com/apache/lucene-solr/pull/2331 While trying to add a Base test class for Vector based on `BaseIndexFileFormatTestCase`, a bug surface on the Lucene90VectorWriter constructor. If there is an exception if the middle, it might happen that files are not properly closed and therefore a leak. Here is the proposal, move the current `TestVectorValues` to a `BaseVectorFormatTestCase` which extends `BaseIndexFileFormatTestCase`. Fix the constructor so it handle closing files on error properly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on a change in pull request #2327: LUCENE-9740: scan affix stream once.
dweiss commented on a change in pull request #2327: URL: https://github.com/apache/lucene-solr/pull/2327#discussion_r572701976 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java ## @@ -778,31 +791,36 @@ char affixData(int affixIndex, int offset) { private static final byte[] BOM_UTF8 = {(byte) 0xef, (byte) 0xbb, (byte) 0xbf}; /** Parses the encoding and flag format specified in the provided InputStream */ - private void readConfig(BufferedInputStream stream) throws IOException, ParseException { -// I assume we don't support other BOMs (utf16, etc.)? We trivially could, -// by adding maybeConsume() with a proper bom... but I don't see hunspell repo to have -// any such exotic examples. -Charset streamCharset; -if (maybeConsume(stream, BOM_UTF8)) { - streamCharset = StandardCharsets.UTF_8; -} else { - streamCharset = DEFAULT_CHARSET; -} - -// TODO: can these flags change throughout the file? If not then we can abort sooner. And -// then we wouldn't even need to create a temp file for the affix stream - a large enough -// leading buffer (BufferedInputStream) would be sufficient? + private void readConfig(InputStream stream, Charset streamCharset) + throws IOException, ParseException { LineNumberReader reader = new LineNumberReader(new InputStreamReader(stream, streamCharset)); String line; +String flagLine = null; +boolean charsetFound = false; +boolean flagFound = false; while ((line = reader.readLine()) != null) { if (line.isBlank()) continue; String firstWord = line.split("\\s")[0]; if ("SET".equals(firstWord)) { decoder = getDecoder(singleArgument(reader, line)); +charsetFound = true; } else if ("FLAG".equals(firstWord)) { -flagParsingStrategy = getFlagParsingStrategy(line, decoder.charset()); +// Preserve the flag line for parsing later since we need the decoder's charset +// and just in case they come out of order. +flagLine = line; +flagFound = true; + } else { +continue; } + + if (charsetFound && flagFound) { +break; + } +} + +if (flagFound) { Review comment: If flagLine is true then line had to be != null, otherwise you'd get an NPE earlier on line.split? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss merged pull request #2327: LUCENE-9740: scan affix stream once.
dweiss merged pull request #2327: URL: https://github.com/apache/lucene-solr/pull/2327 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9740) Avoid buffering and double-scan of flags in *.aff file
[ https://issues.apache.org/jira/browse/LUCENE-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss resolved LUCENE-9740. - Fix Version/s: master (9.0) Resolution: Fixed > Avoid buffering and double-scan of flags in *.aff file > -- > > Key: LUCENE-9740 > URL: https://issues.apache.org/jira/browse/LUCENE-9740 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Fix For: master (9.0) > > Time Spent: 1h 10m > Remaining Estimate: 0h > > I wrote a small utility test to scan through all the *.aff files from > openoffice and woorm - no file has double flags (SET or FLAG) and maximum > leading offsets until these flags appear are roughly: > {code} > Flag SET at maximum offset 10753 > Flag FLAG at maximum offset 4559 > {code} > I think we could just make an assumption that, say, affix files are read with > an 20kB buffered reader and this provides a maximum leading window for > scanning for those flags. The dictionary parsing could also fail if any of > these flags occurs more than once in the input file? > This would avoid having to read the file twice and perhaps simplify the API > (no need for a temporary spill). > I'll piggyback this test as part of LUCENE-9727 if you'd like to re-run it > locally. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9740) Avoid buffering and double-scan of flags in *.aff file
[ https://issues.apache.org/jira/browse/LUCENE-9740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281640#comment-17281640 ] ASF subversion and git services commented on LUCENE-9740: - Commit 061b3f29c99cf4070677eeaf4525ff6f9fff0a56 in lucene-solr's branch refs/heads/master from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=061b3f2 ] LUCENE-9740: scan affix stream once. (#2327) > Avoid buffering and double-scan of flags in *.aff file > -- > > Key: LUCENE-9740 > URL: https://issues.apache.org/jira/browse/LUCENE-9740 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Minor > Fix For: master (9.0) > > Time Spent: 1h 10m > Remaining Estimate: 0h > > I wrote a small utility test to scan through all the *.aff files from > openoffice and woorm - no file has double flags (SET or FLAG) and maximum > leading offsets until these flags appear are roughly: > {code} > Flag SET at maximum offset 10753 > Flag FLAG at maximum offset 4559 > {code} > I think we could just make an assumption that, say, affix files are read with > an 20kB buffered reader and this provides a maximum leading window for > scanning for those flags. The dictionary parsing could also fail if any of > these flags occurs more than once in the input file? > This would avoid having to read the file twice and perhaps simplify the API > (no need for a temporary spill). > I'll piggyback this test as part of LUCENE-9727 if you'd like to re-run it > locally. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on a change in pull request #2327: LUCENE-9740: scan affix stream once.
dweiss commented on a change in pull request #2327: URL: https://github.com/apache/lucene-solr/pull/2327#discussion_r572702955 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java ## @@ -70,6 +70,9 @@ /** In-memory structure for the dictionary (.dic) and affix (.aff) data of a hunspell dictionary. */ public class Dictionary { + // Derived from woorm/ openoffice dictionaries. Review comment: Ouch. Can you correct it for my and piggyback on any other subsequent patch? I overlooked this one. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9750) Hunspell: improve suggestions for mixed-case misspelled words
Peter Gromov created LUCENE-9750: Summary: Hunspell: improve suggestions for mixed-case misspelled words Key: LUCENE-9750 URL: https://issues.apache.org/jira/browse/LUCENE-9750 Project: Lucene - Core Issue Type: Sub-task Reporter: Peter Gromov -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on a change in pull request #2327: LUCENE-9740: scan affix stream once.
dweiss commented on a change in pull request #2327: URL: https://github.com/apache/lucene-solr/pull/2327#discussion_r572703555 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java ## @@ -224,26 +227,37 @@ public Dictionary( this.needsInputCleaning = ignoreCase; this.needsOutputCleaning = false; // set if we have an OCONV -Path tempPath = getDefaultTempDir(); // TODO: make this configurable? -Path aff = Files.createTempFile(tempPath, "affix", "aff"); - -BufferedInputStream aff1 = null; -InputStream aff2 = null; -boolean success = false; -try { - // Copy contents of the affix stream to a temp file. - try (OutputStream os = Files.newOutputStream(aff)) { -affix.transferTo(os); +try (BufferedInputStream affixStream = +new BufferedInputStream(affix, MAX_PROLOGUE_SCAN_WINDOW) { + @Override + public void close() throws IOException { +// TODO: maybe we should consume and close it? Why does it need to stay open? Review comment: Closeable.close() can be invoked any number of times without side-effects - this is a contract stated in the javadoc. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on a change in pull request #2327: LUCENE-9740: scan affix stream once.
dweiss commented on a change in pull request #2327: URL: https://github.com/apache/lucene-solr/pull/2327#discussion_r572704827 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java ## @@ -346,6 +352,7 @@ private void readAffixFile(InputStream affixStream, CharsetDecoder decoder, Flag if (line.isEmpty()) continue; String firstWord = line.split("\\s")[0]; + // TODO: convert to a switch? Review comment: Ok. My taste tells me it'd be cleaner than that multi-level-if, especially that not all statements are identical there (one compares to the entire line if I believe). It may be worth considering the switch for performance reasons too but don't know if you'd see the difference. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2327: LUCENE-9740: scan affix stream once.
donnerpeter commented on a change in pull request #2327: URL: https://github.com/apache/lucene-solr/pull/2327#discussion_r572706742 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java ## @@ -224,26 +227,37 @@ public Dictionary( this.needsInputCleaning = ignoreCase; this.needsOutputCleaning = false; // set if we have an OCONV -Path tempPath = getDefaultTempDir(); // TODO: make this configurable? -Path aff = Files.createTempFile(tempPath, "affix", "aff"); - -BufferedInputStream aff1 = null; -InputStream aff2 = null; -boolean success = false; -try { - // Copy contents of the affix stream to a temp file. - try (OutputStream os = Files.newOutputStream(aff)) { -affix.transferTo(os); +try (BufferedInputStream affixStream = +new BufferedInputStream(affix, MAX_PROLOGUE_SCAN_WINDOW) { + @Override + public void close() throws IOException { +// TODO: maybe we should consume and close it? Why does it need to stay open? Review comment: True. It feels a bit more right to me when the ones creating the streams close them. But in fact I don't like the whole idea of passing streams to constructor. I believe most clients would be happier with passing paths (except the rare(?) cases when the content is created in memory or loaded from classpath). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2327: LUCENE-9740: scan affix stream once.
donnerpeter commented on a change in pull request #2327: URL: https://github.com/apache/lucene-solr/pull/2327#discussion_r572707863 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java ## @@ -778,31 +791,36 @@ char affixData(int affixIndex, int offset) { private static final byte[] BOM_UTF8 = {(byte) 0xef, (byte) 0xbb, (byte) 0xbf}; /** Parses the encoding and flag format specified in the provided InputStream */ - private void readConfig(BufferedInputStream stream) throws IOException, ParseException { -// I assume we don't support other BOMs (utf16, etc.)? We trivially could, -// by adding maybeConsume() with a proper bom... but I don't see hunspell repo to have -// any such exotic examples. -Charset streamCharset; -if (maybeConsume(stream, BOM_UTF8)) { - streamCharset = StandardCharsets.UTF_8; -} else { - streamCharset = DEFAULT_CHARSET; -} - -// TODO: can these flags change throughout the file? If not then we can abort sooner. And -// then we wouldn't even need to create a temp file for the affix stream - a large enough -// leading buffer (BufferedInputStream) would be sufficient? + private void readConfig(InputStream stream, Charset streamCharset) + throws IOException, ParseException { LineNumberReader reader = new LineNumberReader(new InputStreamReader(stream, streamCharset)); String line; +String flagLine = null; +boolean charsetFound = false; +boolean flagFound = false; while ((line = reader.readLine()) != null) { if (line.isBlank()) continue; String firstWord = line.split("\\s")[0]; if ("SET".equals(firstWord)) { decoder = getDecoder(singleArgument(reader, line)); +charsetFound = true; } else if ("FLAG".equals(firstWord)) { -flagParsingStrategy = getFlagParsingStrategy(line, decoder.charset()); +// Preserve the flag line for parsing later since we need the decoder's charset +// and just in case they come out of order. +flagLine = line; +flagFound = true; + } else { +continue; } + + if (charsetFound && flagFound) { +break; + } +} + +if (flagFound) { Review comment: Yes. I mean that `flagFound` is excessive since we already have `flagLine` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2327: LUCENE-9740: scan affix stream once.
donnerpeter commented on a change in pull request #2327: URL: https://github.com/apache/lucene-solr/pull/2327#discussion_r572710743 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java ## @@ -346,6 +352,7 @@ private void readAffixFile(InputStream affixStream, CharsetDecoder decoder, Flag if (line.isEmpty()) continue; String firstWord = line.split("\\s")[0]; + // TODO: convert to a switch? Review comment: A bit cleaner, yes, but verbosity and the risk of forgetting a `break` outweigh that for me. I also considered creating a map from first word into parsing lambdas, but decided it'd be quite verbose as well. Performance should be negligible here. Last time I checked, it was dominated by writing/sorting/reading dic entries and building FSTs. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] donnerpeter opened a new pull request #2332: LUCENE-9750: Hunspell: improve suggestions for mixed-case misspelled words
donnerpeter opened a new pull request #2332: URL: https://github.com/apache/lucene-solr/pull/2332 # Description Fix a failing Hunspell repo test # Solution Replicate Hunspell's logic around suggestion casing, especially mixed-case ones # Tests `i58202` from Hunspell repo, whatever that means # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `master` branch. - [x] I have run `./gradlew check`. - [x] I have added tests for my changes. - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9751) Assertion error (int overflow) in ByteSliceReader
Dawid Weiss created LUCENE-9751: --- Summary: Assertion error (int overflow) in ByteSliceReader Key: LUCENE-9751 URL: https://issues.apache.org/jira/browse/LUCENE-9751 Project: Lucene - Core Issue Type: Bug Affects Versions: 8.7 Reporter: Dawid Weiss New computers come with insane amounts of ram and heaps can get pretty big. If you adjust per-thread buffers to larger values strange things start happening. This happened to us today: {code} Caused by: java.lang.AssertionError at org.apache.lucene.index.ByteSliceReader.init(ByteSliceReader.java:44) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28] at org.apache.lucene.index.TermsHashPerField.initReader(TermsHashPerField.java:88) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28] at org.apache.lucene.index.FreqProxFields$FreqProxPostingsEnum.reset(FreqProxFields.java:430) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28] at org.apache.lucene.index.FreqProxFields$FreqProxTermsEnum.postings(FreqProxFields.java:247) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28] at org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:127) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28] at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:907) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28] at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:318) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28] at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write(PerFieldPostingsFormat.java:170) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28] at org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:120) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28] at org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:264) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28] at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28] at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:480) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28] at org.apache.lucene.index.DocumentsWriter.postUpdate(DocumentsWriter.java:394) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28] at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:440) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28] at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1471) ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - atrisharma - 2020-10-29 19:35:28] ... 7 more {code} Likely an int overflow in TermsHashPerField: {code} reader.init(bytePool, postingsArray.byteStarts[termID]+stream*ByteBlockPool.FIRST_LEVEL_SIZE, streamAddressBuffer[offsetInAddressBuffer+stream]); {code} Don't know if this can be prevented somehow. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on a change in pull request #2327: LUCENE-9740: scan affix stream once.
dweiss commented on a change in pull request #2327: URL: https://github.com/apache/lucene-solr/pull/2327#discussion_r572717985 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java ## @@ -224,26 +227,37 @@ public Dictionary( this.needsInputCleaning = ignoreCase; this.needsOutputCleaning = false; // set if we have an OCONV -Path tempPath = getDefaultTempDir(); // TODO: make this configurable? -Path aff = Files.createTempFile(tempPath, "affix", "aff"); - -BufferedInputStream aff1 = null; -InputStream aff2 = null; -boolean success = false; -try { - // Copy contents of the affix stream to a temp file. - try (OutputStream os = Files.newOutputStream(aff)) { -affix.transferTo(os); +try (BufferedInputStream affixStream = +new BufferedInputStream(affix, MAX_PROLOGUE_SCAN_WINDOW) { + @Override + public void close() throws IOException { +// TODO: maybe we should consume and close it? Why does it need to stay open? Review comment: Leave it a stream, it's hard to beat its flexibility. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on a change in pull request #2327: LUCENE-9740: scan affix stream once.
dweiss commented on a change in pull request #2327: URL: https://github.com/apache/lucene-solr/pull/2327#discussion_r572718170 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java ## @@ -346,6 +352,7 @@ private void readAffixFile(InputStream affixStream, CharsetDecoder decoder, Flag if (line.isEmpty()) continue; String firstWord = line.split("\\s")[0]; + // TODO: convert to a switch? Review comment: ok. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on a change in pull request #2327: LUCENE-9740: scan affix stream once.
dweiss commented on a change in pull request #2327: URL: https://github.com/apache/lucene-solr/pull/2327#discussion_r572718487 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java ## @@ -778,31 +791,36 @@ char affixData(int affixIndex, int offset) { private static final byte[] BOM_UTF8 = {(byte) 0xef, (byte) 0xbb, (byte) 0xbf}; /** Parses the encoding and flag format specified in the provided InputStream */ - private void readConfig(BufferedInputStream stream) throws IOException, ParseException { -// I assume we don't support other BOMs (utf16, etc.)? We trivially could, -// by adding maybeConsume() with a proper bom... but I don't see hunspell repo to have -// any such exotic examples. -Charset streamCharset; -if (maybeConsume(stream, BOM_UTF8)) { - streamCharset = StandardCharsets.UTF_8; -} else { - streamCharset = DEFAULT_CHARSET; -} - -// TODO: can these flags change throughout the file? If not then we can abort sooner. And -// then we wouldn't even need to create a temp file for the affix stream - a large enough -// leading buffer (BufferedInputStream) would be sufficient? + private void readConfig(InputStream stream, Charset streamCharset) + throws IOException, ParseException { LineNumberReader reader = new LineNumberReader(new InputStreamReader(stream, streamCharset)); String line; +String flagLine = null; +boolean charsetFound = false; +boolean flagFound = false; while ((line = reader.readLine()) != null) { if (line.isBlank()) continue; String firstWord = line.split("\\s")[0]; if ("SET".equals(firstWord)) { decoder = getDecoder(singleArgument(reader, line)); +charsetFound = true; } else if ("FLAG".equals(firstWord)) { -flagParsingStrategy = getFlagParsingStrategy(line, decoder.charset()); +// Preserve the flag line for parsing later since we need the decoder's charset +// and just in case they come out of order. +flagLine = line; +flagFound = true; + } else { +continue; } + + if (charsetFound && flagFound) { +break; + } +} + +if (flagFound) { Review comment: Yeah, I left that intentionally for parity with flagFound... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-15147) Hide jdbc credentials in data-config.xml
Ajay G created SOLR-15147: - Summary: Hide jdbc credentials in data-config.xml Key: SOLR-15147 URL: https://issues.apache.org/jira/browse/SOLR-15147 Project: Solr Issue Type: Bug Security Level: Public (Default Security Level. Issues are Public) Components: Admin UI, SolrCloud Affects Versions: 7.1.1 Reporter: Ajay G Team, is there any chance to hide data-config files in solr-7.x version -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on a change in pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…
dweiss commented on a change in pull request #2330: URL: https://github.com/apache/lucene-solr/pull/2330#discussion_r572744570 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java ## @@ -33,44 +40,59 @@ */ class GeneratingSuggester { private static final int MAX_ROOTS = 100; - private static final int MAX_GUESSES = 100; + private static final int MAX_WORDS = 100; + private static final int MAX_GUESSES = 200; private final Dictionary dictionary; + private final SpellChecker speller; - GeneratingSuggester(Dictionary dictionary) { -this.dictionary = dictionary; + GeneratingSuggester(SpellChecker speller) { +this.dictionary = speller.dictionary; +this.speller = speller; } List suggest(String word, WordCase originalCase, Set prevSuggestions) { -List roots = findSimilarDictionaryEntries(word, originalCase); -List expanded = expandRoots(word, roots); -TreeSet bySimilarity = rankBySimilarity(word, expanded); +List> roots = findSimilarDictionaryEntries(word, originalCase); +List> expanded = expandRoots(word, roots); +TreeSet> bySimilarity = rankBySimilarity(word, expanded); return getMostRelevantSuggestions(bySimilarity, prevSuggestions); } - private List findSimilarDictionaryEntries(String word, WordCase originalCase) { -try { - IntsRefFSTEnum fstEnum = new IntsRefFSTEnum<>(dictionary.words); - TreeSet roots = new TreeSet<>(); + private List> findSimilarDictionaryEntries( + String word, WordCase originalCase) { +TreeSet> roots = new TreeSet<>(); +processFST( +dictionary.words, +(key, forms) -> { + if (Math.abs(key.length - word.length()) > 4) return; + + String root = toString(key); + List entries = filterSuitableEntries(root, forms); + if (entries.isEmpty()) return; + + if (originalCase == WordCase.LOWER + && WordCase.caseOf(root) == WordCase.TITLE + && !dictionary.hasLanguage("de")) { +return; + } - IntsRefFSTEnum.InputOutput mapping; - while ((mapping = fstEnum.next()) != null) { -IntsRef key = mapping.input; -if (Math.abs(key.length - word.length()) > 4 || !isSuitableRoot(mapping.output)) continue; - -String root = toString(key); -if (originalCase == WordCase.LOWER -&& WordCase.caseOf(root) == WordCase.TITLE -&& !dictionary.hasLanguage("de")) { - continue; -} + String lower = dictionary.toLowerCase(root); + int sc = + ngram(3, word, lower, EnumSet.of(NGramOptions.LONGER_WORSE)) + + commonPrefix(word, root); -String lower = dictionary.toLowerCase(root); -int sc = -ngram(3, word, lower, EnumSet.of(NGramOptions.LONGER_WORSE)) + commonPrefix(word, root); + entries.forEach(e -> roots.add(new Weighted<>(e, sc))); +}); +return roots.stream().limit(MAX_ROOTS).collect(Collectors.toList()); + } -roots.add(new WeightedWord(root, sc)); + private void processFST(FST fst, BiConsumer keyValueConsumer) { Review comment: Add a "forEach" method to fstenum, maybe? It'd correspond to Java collections then. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-9854) Collect metrics for index merges and index store IO
[ https://issues.apache.org/jira/browse/SOLR-9854?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281682#comment-17281682 ] Andrzej Bialecki commented on SOLR-9854: Metrics Counter can only go forward but these integers must be able to go both ways because they represent the number of *currently* running merges (and the current number of docs / segments involved in the running merges), which naturally may vary from 0 to N. > Collect metrics for index merges and index store IO > --- > > Key: SOLR-9854 > URL: https://issues.apache.org/jira/browse/SOLR-9854 > Project: Solr > Issue Type: Improvement > Components: metrics >Affects Versions: 6.4, 7.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki >Priority: Minor > Fix For: 6.4, 7.0 > > Attachments: SOLR-9854.patch, SOLR-9854.patch > > > Using API for metrics management developed in SOLR-4735 we should also start > collecting metrics for major aspects of {{IndexWriter}} operation, such as > read / write IO rates, number of minor and major merges and IO during these > operations, etc. > This will provide a better insight into resource consumption and load at the > IO level. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9741) Add optimization for sequential access of stored fields
[ https://issues.apache.org/jira/browse/LUCENE-9741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281684#comment-17281684 ] Adrien Grand commented on LUCENE-9741: -- I've fallen into the trap of not optimizing merging for stored fields a couple times, typically by forgetting to override {{getMergeInstance()}} when passing a FilterCodecReader to {{IndexWriter#addIndexes}}, so I'd be supportive of making sequential access more a first-class citizen of stored fields. However the proposed API feels a bit too complex to me. I wonder if we could achieve the same benefits by changing the StoredFieldsReader API to return an iterator over stored fields that would keep state in order to avoid decompressing the same data over and over again? > Add optimization for sequential access of stored fields > --- > > Key: LUCENE-9741 > URL: https://issues.apache.org/jira/browse/LUCENE-9741 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Nhat Nguyen >Assignee: Nhat Nguyen >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > If we are reading the stored-fields of document ids (25, 27, 28, 26, 99), and > doc-25 triggers the stored-fields reader to decompress a block containing > document ids [10-50], then we can tell the reader to read not only 25, but > 26, 27, and 28 to avoid decompressing that block multiple times. > This issue proposes adding a new optimized instance of stored-fields reader > that allows users to select the preferred fetching range. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9741) Add optimization for sequential access of stored fields
[ https://issues.apache.org/jira/browse/LUCENE-9741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281687#comment-17281687 ] Adrien Grand commented on LUCENE-9741: -- To be clear, I'm thinking of only updating the {{StoredFieldsReader}} API, the {{LeafReaderAPI}} could remain the same and {{CodecReader#document}} could be implemented by creating an iterator, advancing it to the desired doc, doing what it has to do, and then throwing away the iterator immediately to allow the JVM to garbage-collect memory that is needed for the internal state of the iterator. > Add optimization for sequential access of stored fields > --- > > Key: LUCENE-9741 > URL: https://issues.apache.org/jira/browse/LUCENE-9741 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Nhat Nguyen >Assignee: Nhat Nguyen >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > If we are reading the stored-fields of document ids (25, 27, 28, 26, 99), and > doc-25 triggers the stored-fields reader to decompress a block containing > document ids [10-50], then we can tell the reader to read not only 25, but > 26, 27, and 28 to avoid decompressing that block multiple times. > This issue proposes adding a new optimized instance of stored-fields reader > that allows users to select the preferred fetching range. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9741) Add optimization for sequential access of stored fields
[ https://issues.apache.org/jira/browse/LUCENE-9741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281690#comment-17281690 ] Robert Muir commented on LUCENE-9741: - The getMergeInstance() is already optimized for this case though, why do we need additional apis? > Add optimization for sequential access of stored fields > --- > > Key: LUCENE-9741 > URL: https://issues.apache.org/jira/browse/LUCENE-9741 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Nhat Nguyen >Assignee: Nhat Nguyen >Priority: Major > Time Spent: 0.5h > Remaining Estimate: 0h > > If we are reading the stored-fields of document ids (25, 27, 28, 26, 99), and > doc-25 triggers the stored-fields reader to decompress a block containing > document ids [10-50], then we can tell the reader to read not only 25, but > 26, 27, and 28 to avoid decompressing that block multiple times. > This issue proposes adding a new optimized instance of stored-fields reader > that allows users to select the preferred fetching range. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…
donnerpeter commented on a change in pull request #2330: URL: https://github.com/apache/lucene-solr/pull/2330#discussion_r572760859 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java ## @@ -33,44 +40,59 @@ */ class GeneratingSuggester { private static final int MAX_ROOTS = 100; - private static final int MAX_GUESSES = 100; + private static final int MAX_WORDS = 100; + private static final int MAX_GUESSES = 200; private final Dictionary dictionary; + private final SpellChecker speller; - GeneratingSuggester(Dictionary dictionary) { -this.dictionary = dictionary; + GeneratingSuggester(SpellChecker speller) { +this.dictionary = speller.dictionary; +this.speller = speller; } List suggest(String word, WordCase originalCase, Set prevSuggestions) { -List roots = findSimilarDictionaryEntries(word, originalCase); -List expanded = expandRoots(word, roots); -TreeSet bySimilarity = rankBySimilarity(word, expanded); +List> roots = findSimilarDictionaryEntries(word, originalCase); +List> expanded = expandRoots(word, roots); +TreeSet> bySimilarity = rankBySimilarity(word, expanded); return getMostRelevantSuggestions(bySimilarity, prevSuggestions); } - private List findSimilarDictionaryEntries(String word, WordCase originalCase) { -try { - IntsRefFSTEnum fstEnum = new IntsRefFSTEnum<>(dictionary.words); - TreeSet roots = new TreeSet<>(); + private List> findSimilarDictionaryEntries( + String word, WordCase originalCase) { +TreeSet> roots = new TreeSet<>(); +processFST( +dictionary.words, +(key, forms) -> { + if (Math.abs(key.length - word.length()) > 4) return; + + String root = toString(key); + List entries = filterSuitableEntries(root, forms); + if (entries.isEmpty()) return; + + if (originalCase == WordCase.LOWER + && WordCase.caseOf(root) == WordCase.TITLE + && !dictionary.hasLanguage("de")) { +return; + } - IntsRefFSTEnum.InputOutput mapping; - while ((mapping = fstEnum.next()) != null) { -IntsRef key = mapping.input; -if (Math.abs(key.length - word.length()) > 4 || !isSuitableRoot(mapping.output)) continue; - -String root = toString(key); -if (originalCase == WordCase.LOWER -&& WordCase.caseOf(root) == WordCase.TITLE -&& !dictionary.hasLanguage("de")) { - continue; -} + String lower = dictionary.toLowerCase(root); + int sc = + ngram(3, word, lower, EnumSet.of(NGramOptions.LONGER_WORSE)) + + commonPrefix(word, root); -String lower = dictionary.toLowerCase(root); -int sc = -ngram(3, word, lower, EnumSet.of(NGramOptions.LONGER_WORSE)) + commonPrefix(word, root); + entries.forEach(e -> roots.add(new Weighted<>(e, sc))); +}); +return roots.stream().limit(MAX_ROOTS).collect(Collectors.toList()); + } -roots.add(new WeightedWord(root, sc)); + private void processFST(FST fst, BiConsumer keyValueConsumer) { Review comment: I wonder if it makes sense to add something breakable in the middle, e.g. accepting some processor (unfortunately neither BiFunction nor BiPredicate convey that semantics for me :( ). OTOH I don't need it right now, and breakability can be added later. Or, it could be made a `Stream` or `Iterable`. One complication though: here I ignore all `IOException`s, but that's probably not a good idea in a general FST case. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…
donnerpeter commented on a change in pull request #2330: URL: https://github.com/apache/lucene-solr/pull/2330#discussion_r572760859 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java ## @@ -33,44 +40,59 @@ */ class GeneratingSuggester { private static final int MAX_ROOTS = 100; - private static final int MAX_GUESSES = 100; + private static final int MAX_WORDS = 100; + private static final int MAX_GUESSES = 200; private final Dictionary dictionary; + private final SpellChecker speller; - GeneratingSuggester(Dictionary dictionary) { -this.dictionary = dictionary; + GeneratingSuggester(SpellChecker speller) { +this.dictionary = speller.dictionary; +this.speller = speller; } List suggest(String word, WordCase originalCase, Set prevSuggestions) { -List roots = findSimilarDictionaryEntries(word, originalCase); -List expanded = expandRoots(word, roots); -TreeSet bySimilarity = rankBySimilarity(word, expanded); +List> roots = findSimilarDictionaryEntries(word, originalCase); +List> expanded = expandRoots(word, roots); +TreeSet> bySimilarity = rankBySimilarity(word, expanded); return getMostRelevantSuggestions(bySimilarity, prevSuggestions); } - private List findSimilarDictionaryEntries(String word, WordCase originalCase) { -try { - IntsRefFSTEnum fstEnum = new IntsRefFSTEnum<>(dictionary.words); - TreeSet roots = new TreeSet<>(); + private List> findSimilarDictionaryEntries( + String word, WordCase originalCase) { +TreeSet> roots = new TreeSet<>(); +processFST( +dictionary.words, +(key, forms) -> { + if (Math.abs(key.length - word.length()) > 4) return; + + String root = toString(key); + List entries = filterSuitableEntries(root, forms); + if (entries.isEmpty()) return; + + if (originalCase == WordCase.LOWER + && WordCase.caseOf(root) == WordCase.TITLE + && !dictionary.hasLanguage("de")) { +return; + } - IntsRefFSTEnum.InputOutput mapping; - while ((mapping = fstEnum.next()) != null) { -IntsRef key = mapping.input; -if (Math.abs(key.length - word.length()) > 4 || !isSuitableRoot(mapping.output)) continue; - -String root = toString(key); -if (originalCase == WordCase.LOWER -&& WordCase.caseOf(root) == WordCase.TITLE -&& !dictionary.hasLanguage("de")) { - continue; -} + String lower = dictionary.toLowerCase(root); + int sc = + ngram(3, word, lower, EnumSet.of(NGramOptions.LONGER_WORSE)) + + commonPrefix(word, root); -String lower = dictionary.toLowerCase(root); -int sc = -ngram(3, word, lower, EnumSet.of(NGramOptions.LONGER_WORSE)) + commonPrefix(word, root); + entries.forEach(e -> roots.add(new Weighted<>(e, sc))); +}); +return roots.stream().limit(MAX_ROOTS).collect(Collectors.toList()); + } -roots.add(new WeightedWord(root, sc)); + private void processFST(FST fst, BiConsumer keyValueConsumer) { Review comment: I wonder if it makes sense to add something breakable in the middle, e.g. accepting some processor (unfortunately neither BiFunction nor BiPredicate convey that semantics for me :( ). OTOH I don't need it right now, and breakability can be added later. Or, it could be made a `Stream` or `Iterable`. One complication though: here I wrap all `IOException`s, but that's probably not a good idea in a general FST case. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on a change in pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…
dweiss commented on a change in pull request #2330: URL: https://github.com/apache/lucene-solr/pull/2330#discussion_r572764206 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java ## @@ -33,44 +40,59 @@ */ class GeneratingSuggester { private static final int MAX_ROOTS = 100; - private static final int MAX_GUESSES = 100; + private static final int MAX_WORDS = 100; + private static final int MAX_GUESSES = 200; private final Dictionary dictionary; + private final SpellChecker speller; - GeneratingSuggester(Dictionary dictionary) { -this.dictionary = dictionary; + GeneratingSuggester(SpellChecker speller) { +this.dictionary = speller.dictionary; +this.speller = speller; } List suggest(String word, WordCase originalCase, Set prevSuggestions) { -List roots = findSimilarDictionaryEntries(word, originalCase); -List expanded = expandRoots(word, roots); -TreeSet bySimilarity = rankBySimilarity(word, expanded); +List> roots = findSimilarDictionaryEntries(word, originalCase); +List> expanded = expandRoots(word, roots); +TreeSet> bySimilarity = rankBySimilarity(word, expanded); return getMostRelevantSuggestions(bySimilarity, prevSuggestions); } - private List findSimilarDictionaryEntries(String word, WordCase originalCase) { -try { - IntsRefFSTEnum fstEnum = new IntsRefFSTEnum<>(dictionary.words); - TreeSet roots = new TreeSet<>(); + private List> findSimilarDictionaryEntries( + String word, WordCase originalCase) { +TreeSet> roots = new TreeSet<>(); +processFST( +dictionary.words, +(key, forms) -> { + if (Math.abs(key.length - word.length()) > 4) return; + + String root = toString(key); + List entries = filterSuitableEntries(root, forms); + if (entries.isEmpty()) return; + + if (originalCase == WordCase.LOWER + && WordCase.caseOf(root) == WordCase.TITLE + && !dictionary.hasLanguage("de")) { +return; + } - IntsRefFSTEnum.InputOutput mapping; - while ((mapping = fstEnum.next()) != null) { -IntsRef key = mapping.input; -if (Math.abs(key.length - word.length()) > 4 || !isSuitableRoot(mapping.output)) continue; - -String root = toString(key); -if (originalCase == WordCase.LOWER -&& WordCase.caseOf(root) == WordCase.TITLE -&& !dictionary.hasLanguage("de")) { - continue; -} + String lower = dictionary.toLowerCase(root); + int sc = + ngram(3, word, lower, EnumSet.of(NGramOptions.LONGER_WORSE)) + + commonPrefix(word, root); -String lower = dictionary.toLowerCase(root); -int sc = -ngram(3, word, lower, EnumSet.of(NGramOptions.LONGER_WORSE)) + commonPrefix(word, root); + entries.forEach(e -> roots.add(new Weighted<>(e, sc))); +}); +return roots.stream().limit(MAX_ROOTS).collect(Collectors.toList()); + } -roots.add(new WeightedWord(root, sc)); + private void processFST(FST fst, BiConsumer keyValueConsumer) { Review comment: A BiPredicate sounds good to me, actually... But if IOExceptions are to be allowed then you'd need a custom visitor interface anyway. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9752) Hunspell Stemmer: reduce parameter count
Peter Gromov created LUCENE-9752: Summary: Hunspell Stemmer: reduce parameter count Key: LUCENE-9752 URL: https://issues.apache.org/jira/browse/LUCENE-9752 Project: Lucene - Core Issue Type: Sub-task Reporter: Peter Gromov -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2330: LUCENE-9748: Hunspell: suggest inflected dictionary entries similar t…
donnerpeter commented on a change in pull request #2330: URL: https://github.com/apache/lucene-solr/pull/2330#discussion_r572766983 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/GeneratingSuggester.java ## @@ -33,44 +40,59 @@ */ class GeneratingSuggester { private static final int MAX_ROOTS = 100; - private static final int MAX_GUESSES = 100; + private static final int MAX_WORDS = 100; + private static final int MAX_GUESSES = 200; private final Dictionary dictionary; + private final SpellChecker speller; - GeneratingSuggester(Dictionary dictionary) { -this.dictionary = dictionary; + GeneratingSuggester(SpellChecker speller) { +this.dictionary = speller.dictionary; +this.speller = speller; } List suggest(String word, WordCase originalCase, Set prevSuggestions) { -List roots = findSimilarDictionaryEntries(word, originalCase); -List expanded = expandRoots(word, roots); -TreeSet bySimilarity = rankBySimilarity(word, expanded); +List> roots = findSimilarDictionaryEntries(word, originalCase); +List> expanded = expandRoots(word, roots); +TreeSet> bySimilarity = rankBySimilarity(word, expanded); return getMostRelevantSuggestions(bySimilarity, prevSuggestions); } - private List findSimilarDictionaryEntries(String word, WordCase originalCase) { -try { - IntsRefFSTEnum fstEnum = new IntsRefFSTEnum<>(dictionary.words); - TreeSet roots = new TreeSet<>(); + private List> findSimilarDictionaryEntries( + String word, WordCase originalCase) { +TreeSet> roots = new TreeSet<>(); +processFST( +dictionary.words, +(key, forms) -> { + if (Math.abs(key.length - word.length()) > 4) return; + + String root = toString(key); + List entries = filterSuitableEntries(root, forms); + if (entries.isEmpty()) return; + + if (originalCase == WordCase.LOWER + && WordCase.caseOf(root) == WordCase.TITLE + && !dictionary.hasLanguage("de")) { +return; + } - IntsRefFSTEnum.InputOutput mapping; - while ((mapping = fstEnum.next()) != null) { -IntsRef key = mapping.input; -if (Math.abs(key.length - word.length()) > 4 || !isSuitableRoot(mapping.output)) continue; - -String root = toString(key); -if (originalCase == WordCase.LOWER -&& WordCase.caseOf(root) == WordCase.TITLE -&& !dictionary.hasLanguage("de")) { - continue; -} + String lower = dictionary.toLowerCase(root); + int sc = + ngram(3, word, lower, EnumSet.of(NGramOptions.LONGER_WORSE)) + + commonPrefix(word, root); -String lower = dictionary.toLowerCase(root); -int sc = -ngram(3, word, lower, EnumSet.of(NGramOptions.LONGER_WORSE)) + commonPrefix(word, root); + entries.forEach(e -> roots.add(new Weighted<>(e, sc))); +}); +return roots.stream().limit(MAX_ROOTS).collect(Collectors.toList()); + } -roots.add(new WeightedWord(root, sc)); + private void processFST(FST fst, BiConsumer keyValueConsumer) { Review comment: BiPredicate sounds pure to me, while this processing can have side effects. It's not in the javadoc, just in the name: predicates are something stateless. `IOException`s would be in the FST walking, the processing code itself doesn't necessarily need them (but can also have them). Maybe given all that it's just easier to leave the walking here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2327: LUCENE-9740: scan affix stream once.
donnerpeter commented on a change in pull request #2327: URL: https://github.com/apache/lucene-solr/pull/2327#discussion_r572769064 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Dictionary.java ## @@ -778,31 +791,36 @@ char affixData(int affixIndex, int offset) { private static final byte[] BOM_UTF8 = {(byte) 0xef, (byte) 0xbb, (byte) 0xbf}; /** Parses the encoding and flag format specified in the provided InputStream */ - private void readConfig(BufferedInputStream stream) throws IOException, ParseException { -// I assume we don't support other BOMs (utf16, etc.)? We trivially could, -// by adding maybeConsume() with a proper bom... but I don't see hunspell repo to have -// any such exotic examples. -Charset streamCharset; -if (maybeConsume(stream, BOM_UTF8)) { - streamCharset = StandardCharsets.UTF_8; -} else { - streamCharset = DEFAULT_CHARSET; -} - -// TODO: can these flags change throughout the file? If not then we can abort sooner. And -// then we wouldn't even need to create a temp file for the affix stream - a large enough -// leading buffer (BufferedInputStream) would be sufficient? + private void readConfig(InputStream stream, Charset streamCharset) + throws IOException, ParseException { LineNumberReader reader = new LineNumberReader(new InputStreamReader(stream, streamCharset)); String line; +String flagLine = null; +boolean charsetFound = false; +boolean flagFound = false; while ((line = reader.readLine()) != null) { if (line.isBlank()) continue; String firstWord = line.split("\\s")[0]; if ("SET".equals(firstWord)) { decoder = getDecoder(singleArgument(reader, line)); +charsetFound = true; } else if ("FLAG".equals(firstWord)) { -flagParsingStrategy = getFlagParsingStrategy(line, decoder.charset()); +// Preserve the flag line for parsing later since we need the decoder's charset +// and just in case they come out of order. +flagLine = line; +flagFound = true; + } else { +continue; } + + if (charsetFound && flagFound) { +break; + } +} + +if (flagFound) { Review comment: It could be paired by using something nullable encoding-related :) Anyway it's very minor. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] donnerpeter opened a new pull request #2333: LUCENE-9752: Hunspell Stemmer: reduce parameter count
donnerpeter opened a new pull request #2333: URL: https://github.com/apache/lucene-solr/pull/2333 # Description There's too many parameters, some of them avoidable # Solution `doSuffix` is always true, `circumfix` can be calculated at the usage site (and once, not for every homonym). # Tests Unaffected # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `master` branch. - [x] I have run `./gradlew check`. - [ ] I have added tests for my changes. - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] donnerpeter commented on a change in pull request #2333: LUCENE-9752: Hunspell Stemmer: reduce parameter count
donnerpeter commented on a change in pull request #2333: URL: https://github.com/apache/lucene-solr/pull/2333#discussion_r572772417 ## File path: lucene/analysis/common/src/java/org/apache/lucene/analysis/hunspell/Stemmer.java ## @@ -698,15 +688,6 @@ private boolean applyAffix( } } - // if circumfix was previously set by a prefix, we must check this suffix, - // to ensure it has it, and vice versa - if (dictionary.circumfix != Dictionary.FLAG_UNSET) { Review comment: moved into `skipLookup`, as this check is independent of the loop variable This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] murblanc commented on pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json
murblanc commented on pull request #2318: URL: https://github.com/apache/lucene-solr/pull/2318#issuecomment-775852524 > Error is a timeout from `CollectionsHandler` having waited 45 seconds I take that back @noblepaul . I did the test wrong and just did it again and it passes. Timing for the collection creation (11x11=121 replicas on 3 nodes) is similar with or without PRS at about 45 seconds. I can do more testing later (more concurrent threads, more and smaller collections). Note I did put out a few numbers on PRS (not with the patch in this PR though), see [this comment](https://issues.apache.org/jira/browse/SOLR-15146?focusedCommentId=17281460&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17281460). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] iverase opened a new pull request #2334: LUCENE-9705: Create Lucene90TermVectorsFormat
iverase opened a new pull request #2334: URL: https://github.com/apache/lucene-solr/pull/2334 For now this is just a copy of Lucene90TermVectorsFormat. The existing Lucene50TermVectorsFormat was moved to backwards-codecs, along with its utility classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9751) Assertion error (int overflow) in ByteSliceReader
[ https://issues.apache.org/jira/browse/LUCENE-9751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281751#comment-17281751 ] Michael McCandless commented on LUCENE-9751: Hmm I thought we long ago added a best effort to detect/prevent too large a DWPT RAM buffer? Were you maybe indexing rather large individual documents? > Assertion error (int overflow) in ByteSliceReader > - > > Key: LUCENE-9751 > URL: https://issues.apache.org/jira/browse/LUCENE-9751 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.7 >Reporter: Dawid Weiss >Priority: Major > > New computers come with insane amounts of ram and heaps can get pretty big. > If you adjust per-thread buffers to larger values strange things start > happening. This happened to us today: > {code} > Caused by: java.lang.AssertionError > at > org.apache.lucene.index.ByteSliceReader.init(ByteSliceReader.java:44) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.TermsHashPerField.initReader(TermsHashPerField.java:88) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.FreqProxFields$FreqProxPostingsEnum.reset(FreqProxFields.java:430) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.FreqProxFields$FreqProxTermsEnum.postings(FreqProxFields.java:247) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:127) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:907) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:318) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write(PerFieldPostingsFormat.java:170) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:120) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:264) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:480) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.DocumentsWriter.postUpdate(DocumentsWriter.java:394) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:440) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1471) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > ... 7 more > {code} > Likely an int overflow in TermsHashPerField: > {code} > reader.init(bytePool, > > postingsArray.byteStarts[termID]+stream*ByteBlockPool.FIRST_LEVEL_SIZE, > streamAddressBuffer[offsetInAddressBuffer+stream]); > {code} > Don't know if this can be prevented somehow. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9751) Assertion error (int overflow) in ByteSliceReader
[ https://issues.apache.org/jira/browse/LUCENE-9751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281753#comment-17281753 ] Dawid Weiss commented on LUCENE-9751: - Everything passes with flying colors on lower heap settings (which result in smaller per-thread buffers). Lower means ~20GB. This failure occurred with max heap of 32GB. It's a highly concurrent and job-stealing setup so I doubt I can easily reproduce... > Assertion error (int overflow) in ByteSliceReader > - > > Key: LUCENE-9751 > URL: https://issues.apache.org/jira/browse/LUCENE-9751 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.7 >Reporter: Dawid Weiss >Priority: Major > > New computers come with insane amounts of ram and heaps can get pretty big. > If you adjust per-thread buffers to larger values strange things start > happening. This happened to us today: > {code} > Caused by: java.lang.AssertionError > at > org.apache.lucene.index.ByteSliceReader.init(ByteSliceReader.java:44) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.TermsHashPerField.initReader(TermsHashPerField.java:88) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.FreqProxFields$FreqProxPostingsEnum.reset(FreqProxFields.java:430) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.FreqProxFields$FreqProxTermsEnum.postings(FreqProxFields.java:247) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:127) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:907) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:318) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write(PerFieldPostingsFormat.java:170) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:120) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:264) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:480) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.DocumentsWriter.postUpdate(DocumentsWriter.java:394) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:440) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1471) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > ... 7 more > {code} > Likely an int overflow in TermsHashPerField: > {code} > reader.init(bytePool, > > postingsArray.byteStarts[termID]+stream*ByteBlockPool.FIRST_LEVEL_SIZE, > streamAddressBuffer[offsetInAddressBuffer+stream]); > {code} > Don't know if this can be prevented somehow. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr-operator] krishnachalla-hv opened a new issue #212: Solr cloud never getes deleted when reclaimPolicy is set to Delete
krishnachalla-hv opened a new issue #212: URL: https://github.com/apache/lucene-solr-operator/issues/212 I facing this issue after upgrading to **v0.2.8**. Earlier I was using **v0.2.6** version of solrcloud things were working fine as expected except the **pvc** deletion that were created by solrcloud since the feature was not available in **v0.2.6**. Now I am testing my application with latest version of solr-operator(v02.8) facing wired issue, When I set the **reclaimPolicy: Delete** for solr as well provided Zookeeper instance in **solrcloud** yaml file, when I tried to uninstall it only the solr-operator and zk-operatorr are getting uninstalled but the solrcloud and zookeepr pods never gets terminated. I created one sample test chart like below tested these things but the issue is still the same. ``` apiVersion: v2 name: test description: A Helm chart for intializing multi node solr cloud. type: application version: 1.0.0 appVersion: 1.0.0 dependencies: - name: solr-operator version: 0.2.8 repository: "https://apache.github.io/lucene-solr-operator/charts"; condition: solr-operator.enabled - name: zookeeper-operator version: 0.3.0 repository: "https://kubernetes-charts.banzaicloud.com"; condition: zookeeper-operator.enabled ``` And my solr cloud yaml configuration is: ``` apiVersion: solr.bloomberg.com/v1beta1 kind: SolrCloud metadata: name: {{ .Release.Name }} spec: dataStorage: persistent: reclaimPolicy: Delete pvcTemplate: spec: resources: requests: storage: "5Gi" replicas: 2 solrImage: tag: 8.7.0 solrJavaMem: "-Xms1g -Xmx3g" customSolrKubeOptions: podOptions: resources: limits: memory: "1G" requests: cpu: "65m" memory: "156Mi" zookeeperRef: provided: chroot: "/solr" persistence: reclaimPolicy: Delete spec: resources: requests: storage: "5Gi" replicas: 3 zookeeperPodPolicy: resources: limits: memory: "1G" requests: cpu: "65m" memory: "156Mi" solrOpts: "-Dsolr.autoSoftCommit.maxTime=1" solrGCTune: "-XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=8" ``` If I set the **reclaimPolicy** to **Retain** and uninstall the chart, solrcloud is also uninstalling properly. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9663) Adding compression to terms dict from SortedSet/Sorted DocValues
[ https://issues.apache.org/jira/browse/LUCENE-9663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281779#comment-17281779 ] Bruno Roustant commented on LUCENE-9663: I'm ready to merge. I think it could go to 8.9 branch but I'd like to have confirmation. This change adds compression to Lucene80DocValuesFormat if the Mode.BEST_COMPRESSION is used and is backward compatible. [~jpountz] any suggestion? Thanks > Adding compression to terms dict from SortedSet/Sorted DocValues > > > Key: LUCENE-9663 > URL: https://issues.apache.org/jira/browse/LUCENE-9663 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Jaison.Bi >Priority: Trivial > Fix For: master (9.0) > > Time Spent: 11h > Remaining Estimate: 0h > > Elasticsearch keyword field uses SortedSet DocValues. In our applications, > “keyword” is the most frequently used field type. > LUCENE-7081 has done prefix-compression for docvalues terms dict. We can do > better by replacing prefix-compression with LZ4. In one of our application, > the dvd files were ~41% smaller with this change(from 1.95 GB to 1.15 GB). > I've done simple tests based on the real application data, comparing the > write/merge time cost, and the on-disk *.dvd file size(after merge into 1 > segment). > || ||Before||After|| > |Write time cost(ms)|591972|618200| > |Merge time cost(ms)|270661|294663| > |*.dvd file size(GB)|1.95|1.15| > This feature is only for the high-cardinality fields. > I'm doing the benchmark test based on luceneutil. Will attach the report and > patch after the test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] muse-dev[bot] commented on a change in pull request #2334: LUCENE-9705: Create Lucene90TermVectorsFormat
muse-dev[bot] commented on a change in pull request #2334: URL: https://github.com/apache/lucene-solr/pull/2334#discussion_r572873185 ## File path: lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/compressing/Lucene50CompressingTermVectorsReader.java ## @@ -0,0 +1,1367 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.backward_codecs.compressing; + +import java.io.Closeable; +import java.io.IOException; +import java.util.Collection; +import java.util.Collections; +import java.util.Iterator; +import java.util.NoSuchElementException; +import org.apache.lucene.codecs.CodecUtil; +import org.apache.lucene.codecs.TermVectorsReader; +import org.apache.lucene.codecs.compressing.CompressionMode; +import org.apache.lucene.codecs.compressing.Decompressor; +import org.apache.lucene.index.BaseTermsEnum; +import org.apache.lucene.index.CorruptIndexException; +import org.apache.lucene.index.FieldInfo; +import org.apache.lucene.index.FieldInfos; +import org.apache.lucene.index.Fields; +import org.apache.lucene.index.ImpactsEnum; +import org.apache.lucene.index.IndexFileNames; +import org.apache.lucene.index.PostingsEnum; +import org.apache.lucene.index.SegmentInfo; +import org.apache.lucene.index.SlowImpactsEnum; +import org.apache.lucene.index.Terms; +import org.apache.lucene.index.TermsEnum; +import org.apache.lucene.store.AlreadyClosedException; +import org.apache.lucene.store.ByteArrayDataInput; +import org.apache.lucene.store.ChecksumIndexInput; +import org.apache.lucene.store.Directory; +import org.apache.lucene.store.IOContext; +import org.apache.lucene.store.IndexInput; +import org.apache.lucene.util.Accountable; +import org.apache.lucene.util.Accountables; +import org.apache.lucene.util.ArrayUtil; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.IOUtils; +import org.apache.lucene.util.LongsRef; +import org.apache.lucene.util.packed.BlockPackedReaderIterator; +import org.apache.lucene.util.packed.PackedInts; + +/** + * {@link TermVectorsReader} for {@link Lucene50CompressingTermVectorsFormat}. + * + * @lucene.experimental + */ +public final class Lucene50CompressingTermVectorsReader extends TermVectorsReader +implements Closeable { + + // hard limit on the maximum number of documents per chunk + static final int MAX_DOCUMENTS_PER_CHUNK = 128; + + static final String VECTORS_EXTENSION = "tvd"; + static final String VECTORS_INDEX_EXTENSION = "tvx"; + static final String VECTORS_META_EXTENSION = "tvm"; + static final String VECTORS_INDEX_CODEC_NAME = "Lucene85TermVectorsIndex"; + + static final int VERSION_START = 1; + static final int VERSION_OFFHEAP_INDEX = 2; + /** Version where all metadata were moved to the meta file. */ + static final int VERSION_META = 3; + + static final int VERSION_CURRENT = VERSION_META; + static final int META_VERSION_START = 0; + + static final int PACKED_BLOCK_SIZE = 64; + + static final int POSITIONS = 0x01; + static final int OFFSETS = 0x02; + static final int PAYLOADS = 0x04; + static final int FLAGS_BITS = PackedInts.bitsRequired(POSITIONS | OFFSETS | PAYLOADS); + + private final FieldInfos fieldInfos; + final FieldsIndex indexReader; + final IndexInput vectorsStream; + private final int version; + private final int packedIntsVersion; + private final CompressionMode compressionMode; + private final Decompressor decompressor; + private final int chunkSize; + private final int numDocs; + private boolean closed; + private final BlockPackedReaderIterator reader; + private final long numDirtyChunks; // number of incomplete compressed blocks written + private final long numDirtyDocs; // cumulative number of missing docs in incomplete chunks + private final long maxPointer; // end of the data section + + // used by clone + private Lucene50CompressingTermVectorsReader(Lucene50CompressingTermVectorsReader reader) { +this.fieldInfos = reader.fieldInfos; +this.vectorsStream = reader.vectorsStream.clone(); +this.indexReader = reader.indexReader.clone(); +this.packedIntsVersion = reader.packedIntsVersion; +this.compressionMode = reader.compressionMode; +this.decompressor = reader.decompr
[GitHub] [lucene-solr] mikemccand commented on a change in pull request #2247: LUCENE-9476 Add getBulkPath API for the Taxonomy index
mikemccand commented on a change in pull request #2247: URL: https://github.com/apache/lucene-solr/pull/2247#discussion_r572845064 ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java ## @@ -31,7 +33,7 @@ import org.apache.lucene.facet.taxonomy.ParallelTaxonomyArrays; import org.apache.lucene.facet.taxonomy.TaxonomyReader; import org.apache.lucene.index.BinaryDocValues; -import org.apache.lucene.index.CorruptIndexException; // javadocs +import org.apache.lucene.index.CorruptIndexException; Review comment: Hmm, did we remove the `// javadocs` comment on purpose? ## File path: lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyReader.java ## @@ -353,12 +349,137 @@ public FacetLabel getPath(int ordinal) throws IOException { } synchronized (categoryCache) { - categoryCache.put(catIDInteger, ret); + categoryCache.put(ordinal, ret); } return ret; } + private FacetLabel getPathFromCache(int ordinal) { +// TODO: can we use an int-based hash impl, such as IntToObjectMap, +// wrapped as LRU? +synchronized (categoryCache) { + return categoryCache.get(ordinal); +} + } + + private void checkOrdinalBounds(int ordinal, int indexReaderMaxDoc) + throws IllegalArgumentException { +if (ordinal < 0 || ordinal >= indexReaderMaxDoc) { + throw new IllegalArgumentException( + "ordinal " + + ordinal + + " is out of the range of the indexReader " + + indexReader.toString()); +} + } + + /** + * Returns an array of FacetLabels for a given array of ordinals. + * + * This API is generally faster than iteratively calling {@link #getPath(int)} over an array of + * ordinals. It uses the {@link #getPath(int)} method iteratively when it detects that the index + * was created using StoredFields (with no performance gains) and uses DocValues based iteration + * when the index is based on DocValues. + * + * @param ordinals Array of ordinals that are assigned to categories inserted into the taxonomy + * index + */ + public FacetLabel[] getBulkPath(int... ordinals) throws IOException { +ensureOpen(); + +int ordinalsLength = ordinals.length; +FacetLabel[] bulkPath = new FacetLabel[ordinalsLength]; +// remember the original positions of ordinals before they are sorted +int originalPosition[] = new int[ordinalsLength]; +Arrays.setAll(originalPosition, IntUnaryOperator.identity()); +int indexReaderMaxDoc = indexReader.maxDoc(); + +for (int i = 0; i < ordinalsLength; i++) { + // check whether the ordinal is valid before accessing the cache + checkOrdinalBounds(ordinals[i], indexReaderMaxDoc); + // check the cache before trying to find it in the index + FacetLabel ordinalPath = getPathFromCache(ordinals[i]); + if (ordinalPath != null) { +bulkPath[i] = ordinalPath; + } +} + +// parallel sort the ordinals and originalPosition array based on the values in the ordinals +// array +new InPlaceMergeSorter() { + @Override + protected void swap(int i, int j) { +int x = ordinals[i]; +ordinals[i] = ordinals[j]; +ordinals[j] = x; + +x = originalPosition[i]; +originalPosition[i] = originalPosition[j]; +originalPosition[j] = x; + } + ; + + @Override + public int compare(int i, int j) { +return Integer.compare(ordinals[i], ordinals[j]); + } +}.sort(0, ordinalsLength); + +int readerIndex; +int leafReaderMaxDoc = 0; +int leafReaderDocBase = 0; +LeafReader leafReader; +LeafReaderContext leafReaderContext; +BinaryDocValues values = null; + +for (int i = 0; i < ordinalsLength; i++) { + if (bulkPath[originalPosition[i]] == null) { +if (values == null || ordinals[i] >= leafReaderMaxDoc) { + + readerIndex = ReaderUtil.subIndex(ordinals[i], indexReader.leaves()); + leafReaderContext = indexReader.leaves().get(readerIndex); + leafReader = leafReaderContext.reader(); + leafReaderMaxDoc = leafReader.maxDoc(); + leafReaderDocBase = leafReaderContext.docBase; + values = leafReader.getBinaryDocValues(Consts.FULL); + + // this check is only needed once to confirm that the index uses BinaryDocValues + boolean success = values.advanceExact(ordinals[i] - leafReaderDocBase); + if (success == false) { +return getBulkPathForOlderIndexes(ordinals); Review comment: Hmm, I'm confused -- wouldn't an older index have no `BinaryDocValues` field? So, `values` would be null, and we should fallback then? This code should hit `NullPointerException` on an old index I think? How come our backwards compatibility test didn't expose this? ## File path: lucene/facet/src/
[jira] [Created] (LUCENE-9753) Hunspell: disallow compounds with parts present in dictionary space-separated
Peter Gromov created LUCENE-9753: Summary: Hunspell: disallow compounds with parts present in dictionary space-separated Key: LUCENE-9753 URL: https://issues.apache.org/jira/browse/LUCENE-9753 Project: Lucene - Core Issue Type: Sub-task Reporter: Peter Gromov -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9406) Make it simpler to track IndexWriter's events
[ https://issues.apache.org/jira/browse/LUCENE-9406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281787#comment-17281787 ] Michael McCandless commented on LUCENE-9406: +1 for [~zacharymorn]'s proposed plan! > Make it simpler to track IndexWriter's events > - > > Key: LUCENE-9406 > URL: https://issues.apache.org/jira/browse/LUCENE-9406 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > > This is the second spinoff from a [controversial PR to add a new index-time > feature to Lucene to merge small segments during > commit|https://github.com/apache/lucene-solr/pull/1552]. That change can > substantially reduce the number of small index segments to search. > In that PR, there was a new proposed interface, {{IndexWriterEvents}}, giving > the application a chance to track when {{IndexWriter}} kicked off merges > during commit, how many, how long it waited, how often it gave up waiting, > etc. > Such telemetry from production usage is really helpful when tuning settings > like which merges (e.g. a size threshold) to attempt on commit, and how long > to wait during commit, etc. > I am splitting out this issue to explore possible approaches to do this. > E.g. [~simonw] proposed using a statistics class instead, but if I understood > that correctly, I think that would put the role of aggregation inside > {{IndexWriter}}, which is not ideal. > Many interesting events, e.g. how many merges are being requested, how large > are they, how long did they take to complete or fail, etc., can be gleaned by > wrapping expert Lucene classes like {{MergePolicy}} and {{MergeScheduler}}. > But for those events that cannot (e.g. {{IndexWriter}} stopped waiting for > merges during commit), it would be very helpful to have some simple way to > track so applications can better tune. > It is also possible to subclass {{IndexWriter}} and override key methods, but > I think that is inherently risky as {{IndexWriter}}'s protected methods are > not considered to be a stable API, and the synchronization used by > {{IndexWriter}} is confusing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] donnerpeter opened a new pull request #2335: LUCENE-9753: Hunspell: disallow compounds with parts present in dicti…
donnerpeter opened a new pull request #2335: URL: https://github.com/apache/lucene-solr/pull/2335 …onary, space-separated # Description Don't accept `compoundword` when there's `compound word` in the dictionary # Solution Like Hunspell, handle this near CHECKCOMPOUNDREP pattern check # Tests `wordpair` from Hunspell repo # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `master` branch. - [x] I have run `./gradlew check`. - [x] I have added tests for my changes. - [ ] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr-operator] krishnachalla-hv opened a new issue #213: Solr cloud provided zookeeper cluster in unhealthy state
krishnachalla-hv opened a new issue #213: URL: https://github.com/apache/lucene-solr-operator/issues/213 I have installed **v0.2.8** version of the operator, Sometimes one of the provided zookeeper instance goes into unhealthy state. I have used default values initialize zookeeper instance that are provided in the documentation. Here is my configuration: ``` apiVersion: v2 name: test description: A Helm chart for intializing multi node solr cloud. appVersion: 1.0.0 dependencies: - name: solr-operator version: 0.2.8 repository: "https://apache.github.io/lucene-solr-operator/charts"; condition: solr-operator.enabled - name: zookeeper-operator version: 0.3.0 repository: "https://kubernetes-charts.banzaicloud.com"; condition: zookeeper-operator.enabled ``` And the solr cloud yaml file: ``` apiVersion: solr.bloomberg.com/v1beta1 kind: SolrCloud metadata: name: {{ .Release.Name }} spec: dataStorage: persistent: reclaimPolicy: Retain pvcTemplate: spec: resources: requests: storage: "5Gi" replicas: 2 solrImage: tag: 8.7.0 solrJavaMem: "-Xms1g -Xmx3g" customSolrKubeOptions: podOptions: resources: limits: memory: "1G" requests: cpu: "65m" memory: "156Mi" zookeeperRef: provided: chroot: "/solr" persistence: reclaimPolicy: Retain spec: resources: requests: storage: "5Gi" replicas: 3 zookeeperPodPolicy: resources: limits: memory: "1G" requests: cpu: "65m" memory: "156Mi" solrOpts: "-Dsolr.autoSoftCommit.maxTime=1" solrGCTune: "-XX:SurvivorRatio=4 -XX:TargetSurvivorRatio=90 -XX:MaxTenuringThreshold=8" ``` After applying the above configuration always 2nd zk pod will be in unhealthy state.[ZK pod error.log](https://github.com/apache/lucene-solr-operator/files/5951776/ZK.pod.error.log) But the remaining 2 ZK pods will be functioning properly.  This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] gerlowskija opened a new pull request #2336: SOLR-15101: Add list/delete APIs for incremental backups
gerlowskija opened a new pull request #2336: URL: https://github.com/apache/lucene-solr/pull/2336 # Description SOLR-13608 introduced support into Solr for an "incremental" backup file structure, which allows storing multiple backup points for the same collection at a given location. With the ability to store multiple backups at the same place, users will need to be able to list and cleanup these backups. # Solution This PR introduces two new APIs: one for listing the backups at a given location (along with associated metadata), and one to delete or cleanup these backups. The APIs are offered in both v1 and v2 flavors. # Tests Manual testing, along with new automated tests in `PurgeGraphTest` (reference checking for detecting index files to delete), `V2CollectionBackupsAPIMappingTest` (v1<->v2 mapping), and `AbstractIncrementalBackupTest` (integration test for list, delete functionality). # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/solr/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [x] I have given Solr maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `master` branch. - [ ] I have run `./gradlew check`. - [x] I have added tests for my changes. - [x] I have added documentation for the [Ref Guide](https://github.com/apache/lucene-solr/tree/master/solr/solr-ref-guide) (for Solr changes only). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-15101) Add list-backups and delete-backups APIs
[ https://issues.apache.org/jira/browse/SOLR-15101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281802#comment-17281802 ] Jason Gerlowski commented on SOLR-15101: I've pushed up a PR for this. I'm hoping to add some additional tests, but otherwise the code and docs should be ready to go. I'll plan on merging this in 4-5 days or so, and backporting to branch_8x afterwards. > Add list-backups and delete-backups APIs > > > Key: SOLR-15101 > URL: https://issues.apache.org/jira/browse/SOLR-15101 > Project: Solr > Issue Type: Sub-task > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: master (9.0) >Reporter: Jason Gerlowski >Assignee: Jason Gerlowski >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > The accepted SIP-12 outlines a plan for changing Solr's backup file structure > in a way that supports storing multiple backups within a single "location" > URI. With this comes a need for APIs that can list out and delete backups > within that single location. > SIP-12 has v1 and v2 API specs for these APIs. This ticket covers > implementing them. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9747) Missing package-info.java causes NPE in MissingDoclet.java
[ https://issues.apache.org/jira/browse/LUCENE-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281817#comment-17281817 ] David Eric Pugh commented on LUCENE-9747: - I have Java 11.0.3+7: ➜ lucene-solr-epugh git:(SOLR-15121) ✗ java --version openjdk 11.0.3 2019-04-16 And here is the stack trace (take 2): {{javadoc: error - fatal error encountered: java.lang.NullPointerException}} {{javadoc: error - Please file a bug against the javadoc tool via the Java bug reporting page}} {{(}}{{[http://bugreport.java.com|http://bugreport.java.com/]}}{{) after checking the Bug Database (}}{{[http://bugs.java.com|http://bugs.java.com/]}}{{)}} {{for duplicates. Include error messages and the following diagnostic in your report. Thank you.}} {{java.lang.NullPointerException}} {{ at jdk.javadoc/jdk.javadoc.internal.tool.Messager.getDiagSource(Messager.java:206)}} {{ at jdk.javadoc/jdk.javadoc.internal.tool.Messager.printError(Messager.java:234)}} {{ at jdk.javadoc/jdk.javadoc.internal.tool.Messager.print(Messager.java:121)}} {{ at org.apache.lucene.missingdoclet.MissingDoclet.error(MissingDoclet.java:434)}} {{ at org.apache.lucene.missingdoclet.MissingDoclet.checkComment(MissingDoclet.java:309)}} {{ at org.apache.lucene.missingdoclet.MissingDoclet.check(MissingDoclet.java:237)}} {{ at org.apache.lucene.missingdoclet.MissingDoclet.run(MissingDoclet.java:205)}} {{ at jdk.javadoc/jdk.javadoc.internal.tool.Start.parseAndExecute(Start.java:582)}} {{ at jdk.javadoc/jdk.javadoc.internal.tool.Start.begin(Start.java:431)}} {{ at jdk.javadoc/jdk.javadoc.internal.tool.Start.begin(Start.java:344)}} {{ at jdk.javadoc/jdk.javadoc.internal.tool.Main.execute(Main.java:63)}} {{ at jdk.javadoc/jdk.javadoc.internal.tool.Main.main(Main.java:52)}} > Missing package-info.java causes NPE in MissingDoclet.java > -- > > Key: LUCENE-9747 > URL: https://issues.apache.org/jira/browse/LUCENE-9747 > Project: Lucene - Core > Issue Type: Improvement > Components: general/javadocs >Affects Versions: master (9.0) >Reporter: David Eric Pugh >Priority: Minor > Attachments: LUCENE-9747.patch > > Time Spent: 1h 50m > Remaining Estimate: 0h > > When running {{./gradlew :solr:core:javadoc}} discovered that if a package > directory is missing the {{package-info.java}} file you get a VERY cryptic > error: > > {{javadoc: error - fatal error encountered: java.lang.NullPointerException}} > {{javadoc: error - Please file a bug against the javadoc tool via the Java > bug reporting page}} > > I poked around and that the {{MissingDoclet.java}} method call to > \{{reporter.print(Diagnostic.Kind.ERROR, element, fullMessage.toString());}} > was failing, due to the element having some sort of null in it. I am > attaching a patch and a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jimczi commented on a change in pull request #2256: LUCENE-9507 Custom order for leaves in IndexReader and IndexWriter
jimczi commented on a change in pull request #2256: URL: https://github.com/apache/lucene-solr/pull/2256#discussion_r572984156 ## File path: lucene/core/src/java/org/apache/lucene/index/StandardDirectoryReader.java ## @@ -39,33 +40,47 @@ final IndexWriter writer; final SegmentInfos segmentInfos; + private final Comparator leafSorter; private final boolean applyAllDeletes; private final boolean writeAllDeletes; - /** called only from static open() methods */ + /** package private constructor, called only from static open() methods. */ StandardDirectoryReader( Directory directory, LeafReader[] readers, IndexWriter writer, SegmentInfos sis, + Comparator leafSorter, boolean applyAllDeletes, boolean writeAllDeletes) throws IOException { -super(directory, readers); +super(directory, sortLeaves(readers, leafSorter)); Review comment: I wonder if the leafSorter should be declared and executed in the base class `BaseCompositeReader` ? That would expose the feature explicitly to `MultiReader` and friends. ## File path: lucene/core/src/java/org/apache/lucene/index/DirectoryReader.java ## @@ -56,7 +57,24 @@ * @throws IOException if there is a low-level IO error */ public static DirectoryReader open(final Directory directory) throws IOException { -return StandardDirectoryReader.open(directory, null); +return StandardDirectoryReader.open(directory, null, null); + } + + /** + * Returns a IndexReader for the the index in the given Directory + * + * @param directory the index directory + * @param leafSorter a comparator for sorting leaf readers. Providing leafSorter is useful for + * indices on which it is expected to run many queries with particular sort criteria (e.g. for + * time-based indices this is usually a descending sort on timestamp). In this case {@code + * leafSorter} should sort leaves according to this sort criteria. Providing leafSorter allows + * to speed up this particular type of sort queries by early terminating while iterating + * though segments and segments' documents. Review comment: nit: s/though/through/ ## File path: lucene/core/src/test/org/apache/lucene/index/TestIndexWriterReader.java ## @@ -169,7 +176,7 @@ public void testUpdateDocument() throws Exception { // writer.close wrote a new commit assertFalse(r2.isCurrent()); -DirectoryReader r3 = DirectoryReader.open(dir1); +DirectoryReader r3 = open(dir1); Review comment: Can you keep the explicit version ? ## File path: lucene/core/src/java/org/apache/lucene/index/IndexWriterConfig.java ## @@ -478,6 +479,18 @@ public IndexWriterConfig setIndexSort(Sort sort) { return this; } + /** + * Set the comparator for sorting leaf readers. A DirectoryReader opened from a IndexWriter with + * this configuration will have its leaf readers sorted with the provided leaf sorter. + * + * @param leafSorter – a comparator for sorting leaf readers + * @return IndexWriterConfig with leafSorter set. + */ + public IndexWriterConfig setLeafSorter(Comparator leafSorter) { Review comment: You added a specific unit test for this feature but we could also set a random value in `LuceneTestCase#newIndexWriterConfig` to improve the coverage. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9754) ICU Tokenizer: letter-space-number-letter tokenized inconsistently
Trey Jones created LUCENE-9754: -- Summary: ICU Tokenizer: letter-space-number-letter tokenized inconsistently Key: LUCENE-9754 URL: https://issues.apache.org/jira/browse/LUCENE-9754 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 7.5 Environment: Tested most recently on Elasticsearch 6.5.4. Reporter: Trey Jones The tokenization of strings like _14th_ with the ICU tokenizer is affected by the character that comes before preceeding whitespace. For example, _x 14th_ is tokenized as x | 14th; _ァ 14th_ is tokenized as ァ | 14 | th. In general, in a letter-space-number-letter sequence, if the writing system before the space is the same as the writing system after the number, then you get two tokens. If the writing systems differ, you get three tokens. If the conditions are just right, the chunking that the ICU tokenizer does (trying to split on spaces to create <4k chunks) can create an artificial boundary between the tokens (e.g., between _ァ_ and _14th_) and prevent the unexpected split of the second token (_14th_). Because chunking changes can ripple through a long document, editing text or the effects of a character filter can cause changes in tokenization thousands of lines later in a document. My guess is that some "previous character set" flag is not reset at the space, and numbers are not in a character set, so _t_ is compared to _ァ_ and they are not the same—causing a token split at the character set change—but I'm not sure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9673) The level of IntBlockPool slice is always 1
[ https://issues.apache.org/jira/browse/LUCENE-9673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281862#comment-17281862 ] Michael McCandless commented on LUCENE-9673: I was curious if this impacted indexing throughput and ran `luceneutil` three times with this change (confusingly, named `trunkN.txt` below) and three times without this change (`baseN.log`): ``` [mike@beast3 trunk]$ grep "indexing done" /l/logs/trunk?.txt /l/logs/trunk1.txt:Indexer: indexing done (89114 msec); total 27624170 docs /l/logs/trunk2.txt:Indexer: indexing done (89974 msec); total 27624192 docs /l/logs/trunk3.txt:Indexer: indexing done (90614 msec); total 27624409 docs [mike@beast3 trunk]$ grep "indexing done" /l/logs/base?.log /l/logs/base1.log:Indexer: indexing done (89271 msec); total 27623915 docs /l/logs/base2.log:Indexer: indexing done (91676 msec); total 27624107 docs /l/logs/base3.log:Indexer: indexing done (93120 msec); total 27624268 docs ``` Possibly a small speedup, but within the noise/variance of the test. Plus, the precise doc count indexed changes each time, which is not right! I opened [https://github.com/mikemccand/luceneutil/issues/106] to get to the bottom of that ... > The level of IntBlockPool slice is always 1 > > > Key: LUCENE-9673 > URL: https://issues.apache.org/jira/browse/LUCENE-9673 > Project: Lucene - Core > Issue Type: Bug > Components: core/other >Reporter: mashudong >Priority: Minor > Attachments: LUCENE-9673.patch > > > First slice is allocated by IntBlockPoo.newSlice(), and its level is 1, > > {code:java} > private int newSlice(final int size) { > if (intUpto > INT_BLOCK_SIZE-size) { > nextBuffer(); > assert assertSliceBuffer(buffer); > } > > final int upto = intUpto; > intUpto += size; > buffer[intUpto-1] = 1; > return upto; > }{code} > > > If one slice is not enough, IntBlockPoo.allocSlice() is called to allocate > more slices, > as the following code shows, level is 1, newLevel is NEXT_LEVEL_ARRAY[0] > which is also 1. > > The result is the level of IntBlockPool slice is always 1, the first slice is > 2 bytes long, and all subsequent slices are 4 bytes long. > > {code:java} > private static final int[] NEXT_LEVEL_ARRAY = {1, 2, 3, 4, 5, 6, 7, 8, 9, 9}; > private int allocSlice(final int[] slice, final int sliceOffset) { > final int level = slice[sliceOffset]; > final int newLevel = NEXT_LEVEL_ARRAY[level - 1]; > final int newSize = LEVEL_SIZE_ARRAY[newLevel]; > // Maybe allocate another block > if (intUpto > INT_BLOCK_SIZE - newSize) { > nextBuffer(); > assert assertSliceBuffer(buffer); > } > final int newUpto = intUpto; > final int offset = newUpto + intOffset; > intUpto += newSize; > // Write forwarding address at end of last slice: > slice[sliceOffset] = offset; > // Write new level: > buffer[intUpto - 1] = newLevel; > return newUpto; > } > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-15133) Document how to eliminate Failed to reserve shared memory warning
[ https://issues.apache.org/jira/browse/SOLR-15133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281872#comment-17281872 ] Mike Drob commented on SOLR-15133: -- I think [https://shipilev.net/jvm/anatomy-quarks/2-transparent-huge-pages/] is a better explainer of what is going on, including the difference between {{UseLargePages}} and {{UseTransparentHugePages}}. Keeping this enabled for heaps beyond 1G (which is most Solr heaps IME), appears to be beneficial, when the system supports it. Now for the wrinkle... I believe both hugetlbfs and THP are reliant on kernel settings/parameters, and docker images don't have kernels themselves. MacOS doesn't support Large Pages ([https://bugs.openjdk.java.net/browse/JDK-8233062)] which suggests that processes running in Docker for Mac wouldn't either. I don't know if this holds true for Windows/Linux as well, or if the docker engines there are able to delegate that memory management request. > Document how to eliminate Failed to reserve shared memory warning > - > > Key: SOLR-15133 > URL: https://issues.apache.org/jira/browse/SOLR-15133 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: Docker, documentation >Affects Versions: 8.7 >Reporter: David Eric Pugh >Assignee: David Eric Pugh >Priority: Minor > Fix For: master (9.0) > > Time Spent: 1h 10m > Remaining Estimate: 0h > > inspired by a conversation on > [https://github.com/docker-solr/docker-solr/issues/273,] it would be good to > document how to get rid of shared memory warning in Docker setups. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9751) Assertion error (int overflow) in ByteSliceReader
[ https://issues.apache.org/jira/browse/LUCENE-9751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281874#comment-17281874 ] Michael McCandless commented on LUCENE-9751: Thanks [~dweiss], clearly we have a bug here! We need better testing of "large" indexing RAM buffers. This is the assert that tripped I think: {noformat} public void init(ByteBlockPool pool, int startIndex, int endIndex) { assert endIndex-startIndex >= 0; {noformat} So I think most likely the {{endIndex}} overflowed int and became negative elsewhere. We do know that our "best effort" will fail to catch indexing a gigantic document that pushes the indexing buffer over 2.1 GB. We default to 1945 MB cutoff ({{IWC.setRAMPerThreadhardLimitMB}} can be used to change that), but if that one gigantic document takes the RAM usage from e.g. 1944 MB up beyond 2048 MB then it can lead to exceptions like this. > Assertion error (int overflow) in ByteSliceReader > - > > Key: LUCENE-9751 > URL: https://issues.apache.org/jira/browse/LUCENE-9751 > Project: Lucene - Core > Issue Type: Bug >Affects Versions: 8.7 >Reporter: Dawid Weiss >Priority: Major > > New computers come with insane amounts of ram and heaps can get pretty big. > If you adjust per-thread buffers to larger values strange things start > happening. This happened to us today: > {code} > Caused by: java.lang.AssertionError > at > org.apache.lucene.index.ByteSliceReader.init(ByteSliceReader.java:44) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.TermsHashPerField.initReader(TermsHashPerField.java:88) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.FreqProxFields$FreqProxPostingsEnum.reset(FreqProxFields.java:430) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.FreqProxFields$FreqProxTermsEnum.postings(FreqProxFields.java:247) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:127) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:907) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:318) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.write(PerFieldPostingsFormat.java:170) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.FreqProxTermsWriter.flush(FreqProxTermsWriter.java:120) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.DefaultIndexingChain.flush(DefaultIndexingChain.java:264) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:350) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:480) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.DocumentsWriter.postUpdate(DocumentsWriter.java:394) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:440) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > at > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1471) > ~[lucene-core-8.7.0.jar:8.7.0 2dc63e901c60cda27ef3b744bc554f1481b3b067 - > atrisharma - 2020-10-29 19:35:28] > ... 7 more > {code} > Likely an int overflow in TermsHashPerField: > {code} > reader.init(bytePool, > > postingsArray.byteStarts[termID]+stream*ByteBlockPool.FIRST_LEVEL_SI
[jira] [Created] (LUCENE-9755) Index Segment without DocValues May Cause Search to Fail
Thomas Hecker created LUCENE-9755: - Summary: Index Segment without DocValues May Cause Search to Fail Key: LUCENE-9755 URL: https://issues.apache.org/jira/browse/LUCENE-9755 Project: Lucene - Core Issue Type: Bug Components: core/search Affects Versions: 8.3.1, 8.x, 8.8 Reporter: Thomas Hecker Attachments: DocValuesTest.java Not sure if this can be considered a bug, but it is certainly a caveat that may slip through testing due to its nature. Consider the following scenario: * all documents in the index have a field "numfield" indexed as IntPoint * in addition, SOME of those documents are also indexed with a SortedNumericDocValuesField using the same "numfield" name The documents without the DocValues cannot be matched from any queries that involve sorting, so we save some space by omitting the DocValues for those documents. This works perfectly fine, unless * the index contains a segment that only contains documents without the DocValues In this case, running a query that sorts by "numfield" will throw the following exception: {noformat} java.lang.IllegalStateException: unexpected docvalues type NONE for field 'numfield' (expected one of [SORTED_NUMERIC, NUMERIC]). Re-index with correct docvalues type. at org.apache.lucene.index.DocValues.checkField(DocValues.java:317) at org.apache.lucene.index.DocValues.getSortedNumeric(DocValues.java:389) at org.apache.lucene.search.SortedNumericSortField$3.getNumericDocValues(SortedNumericSortField.java:159) at org.apache.lucene.search.FieldComparator$NumericComparator.doSetNextReader(FieldComparator.java:155){noformat} I have included a minimal example program that demonstrates the issue. This will * create an index with two documents, each having "numfield" indexed * add a DocValuesField "numfield" only for the first document * force the two documents into separate index segments * run a query that matches only the first document and sorts by "numfield" This results in the aforementioned exception. When removing the following lines from the code: {code:java} if (i==docCount/2) { iw.commit(); } {code} both documents get added to the same segment. When re-running the code creating with a single index segment, the query works fine. Tested with Lucene 8.3.1 and 8.8.0 . Like I said, this may not be considered a bug. But it has slipped through our testing because the existence of such a DocValues-free segment is such a rare and short-lived event. We can avoid this issue in the future by using a different field name for the DocValuesField. But for our production systems we have to patch DocValues.checkField() to suppress the IllegalStateException as reindexing is not an option right now. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] anshumg commented on a change in pull request #2328: SOLR-15145: System property to control whether base_url is stored in state.json to enable back-compat with older SolrJ versi
anshumg commented on a change in pull request #2328: URL: https://github.com/apache/lucene-solr/pull/2328#discussion_r573145918 ## File path: solr/solrj/src/java/org/apache/solr/common/cloud/ZkNodeProps.java ## @@ -118,14 +120,9 @@ public static ZkNodeProps load(byte[] bytes) { @Override public void write(JSONWriter jsonWriter) { // don't write out the base_url if we have a node_name Review comment: Perhaps also add to the comment about the `STORE_BASE_URL` flag ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr-operator] madrob opened a new issue #214: extensions/v1beta1 Ingress is deprecated
madrob opened a new issue #214: URL: https://github.com/apache/lucene-solr-operator/issues/214 After deploying a cluster using this operator, I wanted to get the Ingress, but the one we currently use is deprecated. ```$ k get ing Warning: extensions/v1beta1 Ingress is deprecated in v1.14+, unavailable in v1.22+; use networking.k8s.io/v1 Ingress``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob commented on a change in pull request #2318: SOLR-15138: PerReplicaStates does not scale to large collections as well as state.json
madrob commented on a change in pull request #2318: URL: https://github.com/apache/lucene-solr/pull/2318#discussion_r573116634 ## File path: solr/solrj/src/java/org/apache/solr/common/cloud/PerReplicaStates.java ## @@ -92,6 +94,17 @@ public PerReplicaStates(String path, int cversion, List states) { } + /** Check and return if all replicas are ACTIVE + */ + public boolean allActive() { +if (this.allActive != null) return allActive; +boolean[] result = new boolean[]{true}; Review comment: Agree. ## File path: solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java ## @@ -264,8 +264,11 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin log.info("Cleaned up artifacts for failed create collection for [{}]", collectionName); throw new SolrException(ErrorCode.BAD_REQUEST, "Underlying core creation failed while creating collection: " + collectionName); } else { +//we want to wait till all the replicas are ACTIVE for PRS collections because + ocmh.zkStateReader.waitForState(collectionName, 30, TimeUnit.SECONDS, (liveNodes, c) -> + c.getPerReplicaStates() == null || // this is not a PRS collection Review comment: I agree with Ilan here, let's skip the extra watcher and call to ZK. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-15138) PerReplicaStates does not scale to large collections as well as state.json
[ https://issues.apache.org/jira/browse/SOLR-15138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17281996#comment-17281996 ] Mike Drob commented on SOLR-15138: -- Added a couple comments where I agreed with Ilan's review. More generally, I don't know if that is the right place to be blocking? Why not in AddReplicaCmd where we already have a check for user provided \{{waitForFinalState}} parameter. Similarly, do we need to consider PRS state in MoveReplicaCmd (or maybe other places?) > PerReplicaStates does not scale to large collections as well as state.json > -- > > Key: SOLR-15138 > URL: https://issues.apache.org/jira/browse/SOLR-15138 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 8.8 >Reporter: Mike Drob >Assignee: Noble Paul >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > I was testing PRS collection creation with larger collections today > (previously I had tested with many small collections) and it seemed to be > having trouble keeping up. > > I was running a 4 node instance, each JVM with 4G Heap in k8s, and a single > zookeeper. > > With this cluster configuration, I am able to create several (at least 10) > collections with 11 shards and 11 replicas using the "old way" of keeping > state. These collections are created serially, waiting for all replicas to be > active before proceeding. > However, when attempting to do the same with PRS, the creation stalls on > collection 2 or 3, with several replicas stuck in a "down" state. Further, > when attempting to delete these collections using the regular API it > sometimes takes several attempts after getting stuck a few times as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9747) Missing package-info.java causes NPE in MissingDoclet.java
[ https://issues.apache.org/jira/browse/LUCENE-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282011#comment-17282011 ] Dawid Weiss commented on LUCENE-9747: - Indeed, I can reproduce this with JDK11 too. > Missing package-info.java causes NPE in MissingDoclet.java > -- > > Key: LUCENE-9747 > URL: https://issues.apache.org/jira/browse/LUCENE-9747 > Project: Lucene - Core > Issue Type: Improvement > Components: general/javadocs >Affects Versions: master (9.0) >Reporter: David Eric Pugh >Priority: Minor > Attachments: LUCENE-9747.patch > > Time Spent: 1h 50m > Remaining Estimate: 0h > > When running {{./gradlew :solr:core:javadoc}} discovered that if a package > directory is missing the {{package-info.java}} file you get a VERY cryptic > error: > > {{javadoc: error - fatal error encountered: java.lang.NullPointerException}} > {{javadoc: error - Please file a bug against the javadoc tool via the Java > bug reporting page}} > > I poked around and that the {{MissingDoclet.java}} method call to > \{{reporter.print(Diagnostic.Kind.ERROR, element, fullMessage.toString());}} > was failing, due to the element having some sort of null in it. I am > attaching a patch and a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9747) Missing package-info.java causes NPE in MissingDoclet.java
[ https://issues.apache.org/jira/browse/LUCENE-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282012#comment-17282012 ] Dawid Weiss commented on LUCENE-9747: - https://bugs.openjdk.java.net/browse/JDK-8224082 > Missing package-info.java causes NPE in MissingDoclet.java > -- > > Key: LUCENE-9747 > URL: https://issues.apache.org/jira/browse/LUCENE-9747 > Project: Lucene - Core > Issue Type: Improvement > Components: general/javadocs >Affects Versions: master (9.0) >Reporter: David Eric Pugh >Priority: Minor > Attachments: LUCENE-9747.patch > > Time Spent: 1h 50m > Remaining Estimate: 0h > > When running {{./gradlew :solr:core:javadoc}} discovered that if a package > directory is missing the {{package-info.java}} file you get a VERY cryptic > error: > > {{javadoc: error - fatal error encountered: java.lang.NullPointerException}} > {{javadoc: error - Please file a bug against the javadoc tool via the Java > bug reporting page}} > > I poked around and that the {{MissingDoclet.java}} method call to > \{{reporter.print(Diagnostic.Kind.ERROR, element, fullMessage.toString());}} > was failing, due to the element having some sort of null in it. I am > attaching a patch and a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-15129) Use the Solr TGZ artifact as Docker context
[ https://issues.apache.org/jira/browse/SOLR-15129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282025#comment-17282025 ] David Smiley commented on SOLR-15129: - See my comment [on PR 1769|https://github.com/apache/lucene-solr/pull/1769#issuecomment-729210262]; I will copy: bq. Wouldn't it be simpler for the release manager to build the docker image, examine the sha256 hash of the image, and publish that to the download location, making it official? Someone who wants to use the official Solr docker image who is ultra-paranoid can reference the image by hash like so: bq. bq. docker run --rm solr@sha256:02fe5f1ac04c28291fba23a18cd8765dd62c7a98538f07f2f7d8504ba217284d bq. That runs Solr 8.7, the official one. It's compact and can even be broadcasted easily in the release announcement for future Solr releases for people to get and run the latest release immediately, and be assured it's the correct one. bq. bq. I wonder what other major Apache projects do. CC [~janhoy] RE asking official images folks -- thanks for the reminder > Use the Solr TGZ artifact as Docker context > --- > > Key: SOLR-15129 > URL: https://issues.apache.org/jira/browse/SOLR-15129 > Project: Solr > Issue Type: Sub-task > Security Level: Public(Default Security Level. Issues are Public) >Affects Versions: master (9.0) >Reporter: Houston Putman >Priority: Major > > As discussed in SOLR-15127, there is a need for a unified Dockerfile that > allows for release and local builds. > This ticket is an attempt to achieve this by using the Solr distribution TGZ > as the docker context to build from. > Therefore release images would be completely reproducible by running: > {{docker build -f solr-9.0.0/Dockerfile > https://www.apache.org/dyn/closer.lua/lucene/solr/9.0.0/solr-9.0.0.tgz}} > The changes to the Solr distribution would include adding a Dockerfile at > {{solr-/Dockerfile}}, adding the docker scripts under > {{solr-/docker}}, and adding a version file at > {{solr-/VERSION.txt}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9747) Missing package-info.java causes NPE in MissingDoclet.java
[ https://issues.apache.org/jira/browse/LUCENE-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282028#comment-17282028 ] David Eric Pugh commented on LUCENE-9747: - :P > Missing package-info.java causes NPE in MissingDoclet.java > -- > > Key: LUCENE-9747 > URL: https://issues.apache.org/jira/browse/LUCENE-9747 > Project: Lucene - Core > Issue Type: Improvement > Components: general/javadocs >Affects Versions: master (9.0) >Reporter: David Eric Pugh >Priority: Minor > Attachments: LUCENE-9747.patch > > Time Spent: 1h 50m > Remaining Estimate: 0h > > When running {{./gradlew :solr:core:javadoc}} discovered that if a package > directory is missing the {{package-info.java}} file you get a VERY cryptic > error: > > {{javadoc: error - fatal error encountered: java.lang.NullPointerException}} > {{javadoc: error - Please file a bug against the javadoc tool via the Java > bug reporting page}} > > I poked around and that the {{MissingDoclet.java}} method call to > \{{reporter.print(Diagnostic.Kind.ERROR, element, fullMessage.toString());}} > was failing, due to the element having some sort of null in it. I am > attaching a patch and a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss opened a new pull request #2337: LUCENE-9747: dodge javadoc reporter NPE bug on Java 11.
dweiss opened a new pull request #2337: URL: https://github.com/apache/lucene-solr/pull/2337 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9747) Missing package-info.java causes NPE in MissingDoclet.java
[ https://issues.apache.org/jira/browse/LUCENE-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282029#comment-17282029 ] Dawid Weiss commented on LUCENE-9747: - Filed a slightly smaller PR. Passes for me. > Missing package-info.java causes NPE in MissingDoclet.java > -- > > Key: LUCENE-9747 > URL: https://issues.apache.org/jira/browse/LUCENE-9747 > Project: Lucene - Core > Issue Type: Improvement > Components: general/javadocs >Affects Versions: master (9.0) >Reporter: David Eric Pugh >Priority: Minor > Attachments: LUCENE-9747.patch > > Time Spent: 2h > Remaining Estimate: 0h > > When running {{./gradlew :solr:core:javadoc}} discovered that if a package > directory is missing the {{package-info.java}} file you get a VERY cryptic > error: > > {{javadoc: error - fatal error encountered: java.lang.NullPointerException}} > {{javadoc: error - Please file a bug against the javadoc tool via the Java > bug reporting page}} > > I poked around and that the {{MissingDoclet.java}} method call to > \{{reporter.print(Diagnostic.Kind.ERROR, element, fullMessage.toString());}} > was failing, due to the element having some sort of null in it. I am > attaching a patch and a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9747) Missing package-info.java causes NPE in MissingDoclet.java
[ https://issues.apache.org/jira/browse/LUCENE-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282031#comment-17282031 ] David Eric Pugh commented on LUCENE-9747: - LGTM. > Missing package-info.java causes NPE in MissingDoclet.java > -- > > Key: LUCENE-9747 > URL: https://issues.apache.org/jira/browse/LUCENE-9747 > Project: Lucene - Core > Issue Type: Improvement > Components: general/javadocs >Affects Versions: master (9.0) >Reporter: David Eric Pugh >Priority: Minor > Attachments: LUCENE-9747.patch > > Time Spent: 2h > Remaining Estimate: 0h > > When running {{./gradlew :solr:core:javadoc}} discovered that if a package > directory is missing the {{package-info.java}} file you get a VERY cryptic > error: > > {{javadoc: error - fatal error encountered: java.lang.NullPointerException}} > {{javadoc: error - Please file a bug against the javadoc tool via the Java > bug reporting page}} > > I poked around and that the {{MissingDoclet.java}} method call to > \{{reporter.print(Diagnostic.Kind.ERROR, element, fullMessage.toString());}} > was failing, due to the element having some sort of null in it. I am > attaching a patch and a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss merged pull request #2337: LUCENE-9747: dodge javadoc reporter NPE bug on Java 11.
dweiss merged pull request #2337: URL: https://github.com/apache/lucene-solr/pull/2337 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9747) Missing package-info.java causes NPE in MissingDoclet.java
[ https://issues.apache.org/jira/browse/LUCENE-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282032#comment-17282032 ] Dawid Weiss commented on LUCENE-9747: - Thanks for reporting! > Missing package-info.java causes NPE in MissingDoclet.java > -- > > Key: LUCENE-9747 > URL: https://issues.apache.org/jira/browse/LUCENE-9747 > Project: Lucene - Core > Issue Type: Improvement > Components: general/javadocs >Affects Versions: master (9.0) >Reporter: David Eric Pugh >Priority: Minor > Attachments: LUCENE-9747.patch > > Time Spent: 2h > Remaining Estimate: 0h > > When running {{./gradlew :solr:core:javadoc}} discovered that if a package > directory is missing the {{package-info.java}} file you get a VERY cryptic > error: > > {{javadoc: error - fatal error encountered: java.lang.NullPointerException}} > {{javadoc: error - Please file a bug against the javadoc tool via the Java > bug reporting page}} > > I poked around and that the {{MissingDoclet.java}} method call to > \{{reporter.print(Diagnostic.Kind.ERROR, element, fullMessage.toString());}} > was failing, due to the element having some sort of null in it. I am > attaching a patch and a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9747) Missing package-info.java causes NPE in MissingDoclet.java
[ https://issues.apache.org/jira/browse/LUCENE-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss resolved LUCENE-9747. - Fix Version/s: master (9.0) Resolution: Fixed > Missing package-info.java causes NPE in MissingDoclet.java > -- > > Key: LUCENE-9747 > URL: https://issues.apache.org/jira/browse/LUCENE-9747 > Project: Lucene - Core > Issue Type: Improvement > Components: general/javadocs >Affects Versions: master (9.0) >Reporter: David Eric Pugh >Assignee: Dawid Weiss >Priority: Minor > Fix For: master (9.0) > > Attachments: LUCENE-9747.patch > > Time Spent: 2h > Remaining Estimate: 0h > > When running {{./gradlew :solr:core:javadoc}} discovered that if a package > directory is missing the {{package-info.java}} file you get a VERY cryptic > error: > > {{javadoc: error - fatal error encountered: java.lang.NullPointerException}} > {{javadoc: error - Please file a bug against the javadoc tool via the Java > bug reporting page}} > > I poked around and that the {{MissingDoclet.java}} method call to > \{{reporter.print(Diagnostic.Kind.ERROR, element, fullMessage.toString());}} > was failing, due to the element having some sort of null in it. I am > attaching a patch and a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-9747) Missing package-info.java causes NPE in MissingDoclet.java
[ https://issues.apache.org/jira/browse/LUCENE-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss reassigned LUCENE-9747: --- Assignee: Dawid Weiss > Missing package-info.java causes NPE in MissingDoclet.java > -- > > Key: LUCENE-9747 > URL: https://issues.apache.org/jira/browse/LUCENE-9747 > Project: Lucene - Core > Issue Type: Improvement > Components: general/javadocs >Affects Versions: master (9.0) >Reporter: David Eric Pugh >Assignee: Dawid Weiss >Priority: Minor > Attachments: LUCENE-9747.patch > > Time Spent: 2h > Remaining Estimate: 0h > > When running {{./gradlew :solr:core:javadoc}} discovered that if a package > directory is missing the {{package-info.java}} file you get a VERY cryptic > error: > > {{javadoc: error - fatal error encountered: java.lang.NullPointerException}} > {{javadoc: error - Please file a bug against the javadoc tool via the Java > bug reporting page}} > > I poked around and that the {{MissingDoclet.java}} method call to > \{{reporter.print(Diagnostic.Kind.ERROR, element, fullMessage.toString());}} > was failing, due to the element having some sort of null in it. I am > attaching a patch and a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9747) Missing package-info.java causes NPE in MissingDoclet.java
[ https://issues.apache.org/jira/browse/LUCENE-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282033#comment-17282033 ] ASF subversion and git services commented on LUCENE-9747: - Commit 1f5b37f299206b0d82d2105a0472b417898fc29f in lucene-solr's branch refs/heads/master from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=1f5b37f ] LUCENE-9747: dodge javadoc reporter NPE bug on Java 11. (#2337) > Missing package-info.java causes NPE in MissingDoclet.java > -- > > Key: LUCENE-9747 > URL: https://issues.apache.org/jira/browse/LUCENE-9747 > Project: Lucene - Core > Issue Type: Improvement > Components: general/javadocs >Affects Versions: master (9.0) >Reporter: David Eric Pugh >Assignee: Dawid Weiss >Priority: Minor > Fix For: master (9.0) > > Attachments: LUCENE-9747.patch > > Time Spent: 2h 10m > Remaining Estimate: 0h > > When running {{./gradlew :solr:core:javadoc}} discovered that if a package > directory is missing the {{package-info.java}} file you get a VERY cryptic > error: > > {{javadoc: error - fatal error encountered: java.lang.NullPointerException}} > {{javadoc: error - Please file a bug against the javadoc tool via the Java > bug reporting page}} > > I poked around and that the {{MissingDoclet.java}} method call to > \{{reporter.print(Diagnostic.Kind.ERROR, element, fullMessage.toString());}} > was failing, due to the element having some sort of null in it. I am > attaching a patch and a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-15138) PerReplicaStates does not scale to large collections as well as state.json
[ https://issues.apache.org/jira/browse/SOLR-15138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282040#comment-17282040 ] Mike Drob commented on SOLR-15138: -- This patch is an improvement over what we had previously, but I don't think it takes care of the situation completely. I was able to create 5 collections in my cluster, but #6 timed out. Although interestingly, #7 created just fine. Maybe there's a race condition somewhere then, if it's not related to the amount of existing or outstanding watches when subsequent collections continue to create. > PerReplicaStates does not scale to large collections as well as state.json > -- > > Key: SOLR-15138 > URL: https://issues.apache.org/jira/browse/SOLR-15138 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 8.8 >Reporter: Mike Drob >Assignee: Noble Paul >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > I was testing PRS collection creation with larger collections today > (previously I had tested with many small collections) and it seemed to be > having trouble keeping up. > > I was running a 4 node instance, each JVM with 4G Heap in k8s, and a single > zookeeper. > > With this cluster configuration, I am able to create several (at least 10) > collections with 11 shards and 11 replicas using the "old way" of keeping > state. These collections are created serially, waiting for all replicas to be > active before proceeding. > However, when attempting to do the same with PRS, the creation stalls on > collection 2 or 3, with several replicas stuck in a "down" state. Further, > when attempting to delete these collections using the regular API it > sometimes takes several attempts after getting stuck a few times as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-15138) PerReplicaStates does not scale to large collections as well as state.json
[ https://issues.apache.org/jira/browse/SOLR-15138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282042#comment-17282042 ] Mike Drob commented on SOLR-15138: -- For the shards which did not come up, I noticed that not all replicas had registered for leader election, and that there was no leader present for that shard. Maybe our creations timeouts need to take into account leaderVoteWait? > PerReplicaStates does not scale to large collections as well as state.json > -- > > Key: SOLR-15138 > URL: https://issues.apache.org/jira/browse/SOLR-15138 > Project: Solr > Issue Type: Bug > Components: SolrCloud >Affects Versions: 8.8 >Reporter: Mike Drob >Assignee: Noble Paul >Priority: Major > Time Spent: 1h > Remaining Estimate: 0h > > I was testing PRS collection creation with larger collections today > (previously I had tested with many small collections) and it seemed to be > having trouble keeping up. > > I was running a 4 node instance, each JVM with 4G Heap in k8s, and a single > zookeeper. > > With this cluster configuration, I am able to create several (at least 10) > collections with 11 shards and 11 replicas using the "old way" of keeping > state. These collections are created serially, waiting for all replicas to be > active before proceeding. > However, when attempting to do the same with PRS, the creation stalls on > collection 2 or 3, with several replicas stuck in a "down" state. Further, > when attempting to delete these collections using the regular API it > sometimes takes several attempts after getting stuck a few times as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] epugh commented on pull request #2306: SOLR-15121: Move XSLT (tr param) to scripting contrib
epugh commented on pull request #2306: URL: https://github.com/apache/lucene-solr/pull/2306#issuecomment-776276696 Okay, I've done a bunch of tweaking (banging my head?) against the ref guide docs, and they are working, and I think all the tests are passing. I don't like that the SolrJ tests depend on the sample_techproducts_configs directory, but I think adding some `startup=lazy` settings for the XSLT classes means that the links in the ref guide will work, and the solrj tests don't blow up on the missing xslt classes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-14561) Validate parameters to CoreAdminAPI
[ https://issues.apache.org/jira/browse/SOLR-14561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282082#comment-17282082 ] Jan Høydahl commented on SOLR-14561: Did you try allowPaths? https://lucene.apache.org/solr/guide/8_6/solr-upgrade-notes.html > Validate parameters to CoreAdminAPI > --- > > Key: SOLR-14561 > URL: https://issues.apache.org/jira/browse/SOLR-14561 > Project: Solr > Issue Type: Improvement >Reporter: Jan Høydahl >Assignee: Jan Høydahl >Priority: Major > Fix For: 8.6 > > Time Spent: 4h 40m > Remaining Estimate: 0h > > CoreAdminAPI does not validate parameter input. We should limit what users > can specify for at least {{instanceDir and dataDir}} params, perhaps restrict > them to be relative to SOLR_HOME or SOLR_DATA_HOME. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] thelabdude commented on pull request #2328: SOLR-15145: System property to control whether base_url is stored in state.json to enable back-compat with older SolrJ versions
thelabdude commented on pull request #2328: URL: https://github.com/apache/lucene-solr/pull/2328#issuecomment-776287188 I ran a back-compat test with a client app built with SolrJ 8.7.0 and a server from this branch and it works as expected. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (SOLR-15148) Include a backcompat utility class with a main in solrj that we can use to test older SolrJ against a release candidate
Timothy Potter created SOLR-15148: - Summary: Include a backcompat utility class with a main in solrj that we can use to test older SolrJ against a release candidate Key: SOLR-15148 URL: https://issues.apache.org/jira/browse/SOLR-15148 Project: Solr Issue Type: New Feature Security Level: Public (Default Security Level. Issues are Public) Components: SolrJ Reporter: Timothy Potter Changes to SolrJ in 8.8.0 (SOLR-12182) broke backcompat (fix is SOLR-15145) should have been caught during RC smoke testing. A simple utility class that we can run during RC smoke testing to catch back-compat breaks like this would be useful. To keep things simple, the smoke tester can download the previous version of SolrJ from Maven central and invoke this Backcompat app (embedded in the SolrJ JAR) against the new Solr server in the RC. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-15011) /admin/logging handler should be able to configure logs on all nodes
[ https://issues.apache.org/jira/browse/SOLR-15011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282092#comment-17282092 ] Chris M. Hostetter commented on SOLR-15011: --- BadApple'ing the test just prevents it from being a noisy failure on jenkins ... BadApple tests are still run on developer boxes by default, so this test is still causing lots of failures. {quote}I tried something else that occurred to me... I merely commented out the substance of the issue (LoggingHandler calling into AdminHandlersProxy) and... the test still passed. I'm not surprised; this is an embedded test and thus all nodes share the same logging state. Hmm. I wonder if we can't realistically test this until we have Docker based test infrastructure with fully isolated Solr nodes. {quote} If that's the case then i would suggest the test just be deleted – or explicitly @AwaitsFixed – but if it's going to stick around in a disabled state it should probably point at a new Jira to track if/how/when we might be able to adequately test it, so that the current Jira can be re-resolved and correctly track when this functionality was added. > /admin/logging handler should be able to configure logs on all nodes > > > Key: SOLR-15011 > URL: https://issues.apache.org/jira/browse/SOLR-15011 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Components: logging >Reporter: David Smiley >Assignee: David Smiley >Priority: Major > Fix For: master (9.0) > > Time Spent: 3.5h > Remaining Estimate: 0h > > The LoggingHandler registered at /admin/logging can configure log levels for > the current node. This is nice but in SolrCloud, what's needed is an ability > to change the level for _all_ nodes in the cluster. I propose that this be a > parameter name "distrib" defaulting to SolrCloud mode's status. An admin UI > could have a checkbox for it. I don't propose that the read operations be > changed -- they can continue to just look at the node you are hitting. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-9756) Extend FieldInfosFormat tests to cover points and vectors
Julie Tibshirani created LUCENE-9756: Summary: Extend FieldInfosFormat tests to cover points and vectors Key: LUCENE-9756 URL: https://issues.apache.org/jira/browse/LUCENE-9756 Project: Lucene - Core Issue Type: Test Reporter: Julie Tibshirani Currently {{BaseFieldInfoFormatTestCase}} doesn't exercise points, vectors, or the soft deletes field. We should make sure the test covers these options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jtibshirani commented on pull request #2269: LUCENE-9322: Add TestLucene90FieldInfosFormat
jtibshirani commented on pull request #2269: URL: https://github.com/apache/lucene-solr/pull/2269#issuecomment-776361691 I opened https://issues.apache.org/jira/browse/LUCENE-9756 to add tests for points and vectors. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jtibshirani opened a new pull request #2338: LUCENE-9756: Extend FieldInfosFormat tests to cover points and vectors
jtibshirani opened a new pull request #2338: URL: https://github.com/apache/lucene-solr/pull/2338 This commit adds coverage to `BaseFieldInfoFormatTestCase ` for points, vectors, and the soft deletes field. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jtibshirani opened a new pull request #2339: LUCENE-9705: Reset internal version in Lucene90FieldInfosFormat.
jtibshirani opened a new pull request #2339: URL: https://github.com/apache/lucene-solr/pull/2339 Since this is a fresh format, we can remove older version logic and reset the internal version to 0. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9747) Missing package-info.java causes NPE in MissingDoclet.java
[ https://issues.apache.org/jira/browse/LUCENE-9747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17282163#comment-17282163 ] Robert Muir commented on LUCENE-9747: - whoah, nice work [~epugh] [~dweiss]. thanks for debugging through it! > Missing package-info.java causes NPE in MissingDoclet.java > -- > > Key: LUCENE-9747 > URL: https://issues.apache.org/jira/browse/LUCENE-9747 > Project: Lucene - Core > Issue Type: Improvement > Components: general/javadocs >Affects Versions: master (9.0) >Reporter: David Eric Pugh >Assignee: Dawid Weiss >Priority: Minor > Fix For: master (9.0) > > Attachments: LUCENE-9747.patch > > Time Spent: 2h 10m > Remaining Estimate: 0h > > When running {{./gradlew :solr:core:javadoc}} discovered that if a package > directory is missing the {{package-info.java}} file you get a VERY cryptic > error: > > {{javadoc: error - fatal error encountered: java.lang.NullPointerException}} > {{javadoc: error - Please file a bug against the javadoc tool via the Java > bug reporting page}} > > I poked around and that the {{MissingDoclet.java}} method call to > \{{reporter.print(Diagnostic.Kind.ERROR, element, fullMessage.toString());}} > was failing, due to the element having some sort of null in it. I am > attaching a patch and a PR. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org