[GitHub] [lucene] SevenCss commented on issue #7820: CheckIndex cannot "fix" indexes that have individual segments with missing or corrupt .si files because sanity checks will fail trying to read the
SevenCss commented on issue #7820: URL: https://github.com/apache/lucene/issues/7820#issuecomment-1683471448 Hi , @asfimport Recently, i encountered one similar issue that there are two segments files in my index ( I'm using Lucene 8.10 version ) : - segments_a7 - segments_a8 while segments_a7 refer to the already deleted file `_be.si`. It seems that "segments_a7" was left behind and not deleted successfully. (We use the KeepOnlyLastCommitDeletionPolicy ). When i tried to exam the index with CheckIndex, it could not detect the error and could not fix the problem, either. So, i would like to ask for some help here. Here are some questions - Are you going to fix this open issue recently? - Besides, could you please provide some idea why segments_a7 is left behind if you happen to have some experience? - Is there any solution we could do, except manually delete the corrupted segments file? Below is the callstack: ``` at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:293) at org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:165) at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1089) ... Caused by: java.nio.file.NoSuchFileException: Z:\BIN\x64\SearchData2\index\9d8a49d54e04e0be62c877acc18a5a0a\DossierContent\_be.si at java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:85) at java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103) at java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:108) at java.base/sun.nio.fs.WindowsFileSystemProvider.newFileChannel(WindowsFileSystemProvider.java:120) at java.base/java.nio.channels.FileChannel.open(FileChannel.java:292) at java.base/java.nio.channels.FileChannel.open(FileChannel.java:345) at org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:238) at org.apache.lucene.store.Directory.openChecksumInput(Directory.java:157) at org.apache.lucene.codecs.lucene86.Lucene86SegmentInfoFormat.read(Lucene86SegmentInfoFormat.java:91) at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:357) at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:291) ... 4 more Suppressed: org.apache.lucene.index.CorruptIndexException: checksum passed (55e61d41). possibly transient resource issue, or a Lucene or JVM bug (resource=BufferedChecksumIndexInput(MMapIndexInput(path="Z:\BIN\x64\SearchData2\index\9d8a49d54e04e0be62c877acc18a5a0a\DossierContent\segments_a7"))) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] mikemccand commented on issue #12513: Try out a tantivy's term dictionary format
mikemccand commented on issue #12513: URL: https://github.com/apache/lucene/issues/12513#issuecomment-1683667646 > it took me days to digest [Lucene90BlockTreeTermsWriter](https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.html) and I'm still not sure I got every bits correct Sorry! This is my fault :) It'd be awesome to simplify the block-tree terms dictionary if you have ideas ... it is truly hairy. Yet it is fast and compactish and paged memory friendly (hot stuff localized together so OS can clearly cache that, large pages of cold stuff can be mostly left on disk, for indices that do not entirely fit in RAM). Also, thank you @Tony-X for creating the open source licensed (ASL2) combination of Tantivy's and Lucene's benchmark in [your repo](https://github.com/Tony-X/search-benchmark-game), enabling us to isolate/understand the performance and functional differences. This has already led to some nice cross-fertilization gains in Lucene, such as [optimizing `count()` for disjunctive queries](https://github.com/apache/lucene/pull/12415) -- see the [new nightly chart for `count(OrHighHigh)`](https://home.apache.org/~mikemccand/lucenebench/CountOrHighHigh.html) -- thank you @jpountz and @fulmicoton (Tantivy creator!) for [the idea](https://github.com/Tony-X/search-benchmark-game/issues/30#issuecomment-1579761787). The added cost of G1GC memory barriers ([separate issue](https://github.com/Tony-X/search-benchmark-game/issues/45#issuecomment-1682165680), a 4.9% latency hit to `AndHighHigh`, thanks @slow-J and @uschindler for suggesting we test/isolate GC effects), was surprising to me. +1 to explore a terms dictionary format similar to Tantivy's. I think the experimental (no backwards compatibility!) `FSTPostingsFormat` is close? It holds all terms in a single FST (for each segment), and maps to a `byte[]` blob holding all metadata (corpus statistics, maybe pulsed posting if the term appears only once in all docs, else pointers to the `.doc`/`.pos`/etc. postings files) for this term. To match Tantivy's approach we would change that to dereference through a `long` (there can be > 2.1 B terms in one segment) ordinal instead of inlining all metadata in a single `byte[]`, so that the FST only stores this ordinal and then looks up all the term metadata in a different data structure? But, that FST can get quite large, and take quite a bit of time to create during indexing, though FSTs are off-heap now, so perhaps letting the OS decide the hot vs warm pages will be fine at search time. Term dictionary heavy queries, e.g. `FuzzyQuery` or `RegexpQuery`, might become faster? Maybe this eventually becomes Lucene's default terms dictionary! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] almogtavor commented on issue #12406: Register nested queries (ToParentBlockJoinQuery) to Lucene Monitor
almogtavor commented on issue #12406: URL: https://github.com/apache/lucene/issues/12406#issuecomment-1683918288 @romseygeek @dweiss @uschindler @dsmiley @gsmiller I'd love to get feedback from you on the subject -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] easyice opened a new issue, #12514: Could we add more index for BKD LeafNode?
easyice opened a new issue, #12514: URL: https://github.com/apache/lucene/issues/12514 ### Description Currently in BKD LeafNode, we scan all 512 values and call `visitor.visit ` in `CELL_CROSSES_QUERY` case, this is usually not a issue for range query, but for point query, such as `PointInSetQuery`, maybe only one value hit at a leaf node, but it still scan all the block. in 1D case, values will visited in increasing order. maybe we can create a special index structure for these 512 values, for instance, write the point value/offset per 32 values, used to skip some point values that will not match, i had wrote some POC code, it will improve query performance by 20% in high cardinality(10 million), but no improve in lower cardinality -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] Tony-X commented on issue #12513: Try out a tantivy's term dictionary format
Tony-X commented on issue #12513: URL: https://github.com/apache/lucene/issues/12513#issuecomment-1684251100 Thanks @mikemccand for bringing in the context. I should've done that part better :) > FSTPostingsFormat is close? It holds all terms in a single FST (for each segment), and maps to a byte[] blob holding all metadata (corpus statistics, maybe pulsed posting if the term appears only once in all docs, else pointers to the .doc/.pos/etc. postings files) for this term. Yes, I actually tried to use `FSTPostingsFormat` in the benchmarks game and I had to increase the heap size from 4g to 32g to workaround the in-heap memory demand. Search-wise, the performance got slightly bit worse. So I set out to dig deeper and realized what you pointed out -- the FST maps the term to a byte[] blob (the postings's term metadata). I have not gone to the full details of the [paper](https://citeseerx.ist.psu.edu/doc/10.1.1.24.3698) that underpins FSTCompiler implementation but I believe mapping to 8-byte ordinals (monotonically increasing) are much easier than mapping to variable-length and unordered byte[] blobs. Also, compression-wise the FST may have done a great job in compressing the keys but not so for the blobs. > But, that FST can get quite large, and take quite a bit of time to create during indexing I think if we move the values out of FST we could balance the size. Time-wise, I'm not sure. Hopefully the simplified value space make building FST easier. This requires some experimentation > Term dictionary heavy queries, e.g. FuzzyQuery or RegexpQuery, might become faster? Maybe this eventually becomes Lucene's default terms dictionary! Yes, this can be very promising :) The fact that it is FST and contains all terms makes it efficient to skip no-existent terms. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] stefanvodita commented on a diff in pull request #12337: Index arbitrary fields in taxonomy docs
stefanvodita commented on code in PR #12337: URL: https://github.com/apache/lucene/pull/12337#discussion_r1298912817 ## lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyIndexReader.java: ## @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.taxonomy.directory; + +import java.io.IOException; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.store.Directory; + +/** + * This is like a {@link DirectoryTaxonomyReader}, except it provides access to the underlying + * {@link DirectoryReader} and full path field name. + */ +public class DirectoryTaxonomyIndexReader extends DirectoryTaxonomyReader { Review Comment: Great point @epotyom! I’ve thought a bit more about this and I’d like to consider exposing the `IndexReader` of `DirectoryTaxonomyReader` by making `getInternalIndexReader` public instead of protected. I actually like this solution better than what I coded up previously. It’s cleaner, it’s backwards compatible, and a user could have already gotten the `IndexReader` anyway by extending `DirectoryTaxonomyReader`. I’m curious if anyone has other ideas though. ## lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyIndexReader.java: ## @@ -0,0 +1,43 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.facet.taxonomy.directory; + +import java.io.IOException; +import org.apache.lucene.index.DirectoryReader; +import org.apache.lucene.store.Directory; + +/** + * This is like a {@link DirectoryTaxonomyReader}, except it provides access to the underlying + * {@link DirectoryReader} and full path field name. + */ +public class DirectoryTaxonomyIndexReader extends DirectoryTaxonomyReader { Review Comment: Great point @epotyom! I’ve thought a bit more about this and I’d like to consider exposing the `IndexReader` of `DirectoryTaxonomyReader` by making `getInternalIndexReader` public instead of protected. I actually like this solution better than what I coded up previously. It’s cleaner, it’s backwards compatible, and a user could have already gotten the `IndexReader` anyway by extending `DirectoryTaxonomyReader`. I’m curious if anyone has other ideas though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org