[GitHub] [lucene] SevenCss commented on issue #7820: CheckIndex cannot "fix" indexes that have individual segments with missing or corrupt .si files because sanity checks will fail trying to read the

2023-08-18 Thread via GitHub


SevenCss commented on issue #7820:
URL: https://github.com/apache/lucene/issues/7820#issuecomment-1683471448

   Hi , @asfimport 
   Recently, i encountered one similar issue that there are two segments files 
in my index ( I'm using Lucene 8.10 version ) :
   - segments_a7
   - segments_a8
   while segments_a7 refer to the already deleted file `_be.si`.  It seems that 
"segments_a7" was left behind and not deleted successfully. (We use the 
KeepOnlyLastCommitDeletionPolicy ). 
   
   When i tried to exam the index with CheckIndex, it could not detect the 
error and could not fix the problem, either.  
   
   So, i would like to ask for some help here.  Here are some questions 
 - Are you going to fix this open issue recently?  
 - Besides, could you please provide some idea why segments_a7 is left 
behind if you happen to have some experience?  
 - Is there any solution we could do, except manually delete the corrupted 
segments file?
   
   Below is the callstack: 
   
   ```
at 
org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:293)
at 
org.apache.lucene.index.IndexFileDeleter.(IndexFileDeleter.java:165)
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:1089)
   ...
   Caused by: java.nio.file.NoSuchFileException: 
Z:\BIN\x64\SearchData2\index\9d8a49d54e04e0be62c877acc18a5a0a\DossierContent\_be.si
at 
java.base/sun.nio.fs.WindowsException.translateToIOException(WindowsException.java:85)
at 
java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:103)
at 
java.base/sun.nio.fs.WindowsException.rethrowAsIOException(WindowsException.java:108)
at 
java.base/sun.nio.fs.WindowsFileSystemProvider.newFileChannel(WindowsFileSystemProvider.java:120)
at java.base/java.nio.channels.FileChannel.open(FileChannel.java:292)
at java.base/java.nio.channels.FileChannel.open(FileChannel.java:345)
at 
org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:238)
at 
org.apache.lucene.store.Directory.openChecksumInput(Directory.java:157)
at 
org.apache.lucene.codecs.lucene86.Lucene86SegmentInfoFormat.read(Lucene86SegmentInfoFormat.java:91)
at 
org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:357)
at 
org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:291)
... 4 more
Suppressed: org.apache.lucene.index.CorruptIndexException: checksum 
passed (55e61d41). possibly transient resource issue, or a Lucene or JVM bug 
(resource=BufferedChecksumIndexInput(MMapIndexInput(path="Z:\BIN\x64\SearchData2\index\9d8a49d54e04e0be62c877acc18a5a0a\DossierContent\segments_a7")))
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on issue #12513: Try out a tantivy's term dictionary format

2023-08-18 Thread via GitHub


mikemccand commented on issue #12513:
URL: https://github.com/apache/lucene/issues/12513#issuecomment-1683667646

   > it took me days to digest 
[Lucene90BlockTreeTermsWriter](https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/codecs/lucene90/blocktree/Lucene90BlockTreeTermsWriter.html)
 and I'm still not sure I got every bits correct
   
   Sorry!  This is my fault :)  It'd be awesome to simplify the block-tree 
terms dictionary if you have ideas ... it is truly hairy.  Yet it is fast and 
compactish and paged memory friendly (hot stuff localized together so OS can 
clearly cache that, large pages of cold stuff can be mostly left on disk, for 
indices that do not entirely fit in RAM).
   
   Also, thank you @Tony-X for creating the open source licensed (ASL2) 
combination of Tantivy's and Lucene's benchmark in [your 
repo](https://github.com/Tony-X/search-benchmark-game), enabling us to 
isolate/understand the performance and functional differences.
   
   This has already led to some nice cross-fertilization gains in Lucene, such 
as [optimizing `count()` for disjunctive 
queries](https://github.com/apache/lucene/pull/12415) -- see the [new nightly 
chart for 
`count(OrHighHigh)`](https://home.apache.org/~mikemccand/lucenebench/CountOrHighHigh.html)
 -- thank you @jpountz and @fulmicoton (Tantivy creator!) for [the 
idea](https://github.com/Tony-X/search-benchmark-game/issues/30#issuecomment-1579761787).
  The added cost of G1GC memory barriers ([separate 
issue](https://github.com/Tony-X/search-benchmark-game/issues/45#issuecomment-1682165680),
 a 4.9% latency hit to `AndHighHigh`, thanks @slow-J and @uschindler for 
suggesting we test/isolate GC effects), was surprising to me.
   
   +1 to explore a terms dictionary format similar to Tantivy's.  I think the 
experimental (no backwards compatibility!) `FSTPostingsFormat` is close?  It 
holds all terms in a single FST (for each segment), and maps to a `byte[]` blob 
holding all metadata (corpus statistics, maybe pulsed posting if the term 
appears only once in all docs, else pointers to the `.doc`/`.pos`/etc. postings 
files) for this term.  To match Tantivy's approach we would change that to 
dereference through a `long` (there can be > 2.1 B terms in one segment) 
ordinal instead of inlining all metadata in a single `byte[]`, so that the FST 
only stores this ordinal and then looks up all the term metadata in a different 
data structure?  But, that FST can get quite large, and take quite a bit of 
time to create during indexing, though FSTs are off-heap now, so perhaps 
letting the OS decide the hot vs warm pages will be fine at search time.  Term 
dictionary heavy queries, e.g. `FuzzyQuery` or `RegexpQuery`, might become
  faster?  Maybe this eventually becomes Lucene's default terms dictionary!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] almogtavor commented on issue #12406: Register nested queries (ToParentBlockJoinQuery) to Lucene Monitor

2023-08-18 Thread via GitHub


almogtavor commented on issue #12406:
URL: https://github.com/apache/lucene/issues/12406#issuecomment-1683918288

   @romseygeek @dweiss @uschindler @dsmiley @gsmiller I'd love to get feedback 
from you on the subject


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] easyice opened a new issue, #12514: Could we add more index for BKD LeafNode?

2023-08-18 Thread via GitHub


easyice opened a new issue, #12514:
URL: https://github.com/apache/lucene/issues/12514

   ### Description
   
   Currently in BKD LeafNode,  we scan all 512 values and call `visitor.visit ` 
in `CELL_CROSSES_QUERY` case, this is usually not a issue for range query, but 
for point query, such as `PointInSetQuery`, maybe only one value hit at a leaf 
node, but it still scan all the block. in 1D case, values will visited in 
increasing order. maybe we can create a special index structure for these 512 
values, for instance, write the point value/offset per 32 values, used to skip 
some point values that will not match, i had wrote some POC code, it will 
improve query performance by 20% in high cardinality(10  million), but no 
improve in  lower cardinality


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] Tony-X commented on issue #12513: Try out a tantivy's term dictionary format

2023-08-18 Thread via GitHub


Tony-X commented on issue #12513:
URL: https://github.com/apache/lucene/issues/12513#issuecomment-1684251100

   Thanks @mikemccand for bringing in the context. I should've done that part 
better :) 
   
   > FSTPostingsFormat is close? It holds all terms in a single FST (for each 
segment), and maps to a byte[] blob holding all metadata (corpus statistics, 
maybe pulsed posting if the term appears only once in all docs, else pointers 
to the .doc/.pos/etc. postings files) for this term.
   
   Yes, I actually tried to use `FSTPostingsFormat` in the benchmarks game and 
I had to increase the heap size from 4g to 32g to workaround the in-heap memory 
demand. Search-wise, the performance got slightly bit worse. So I set out to 
dig deeper and realized what you pointed out -- the FST maps the term to a 
byte[] blob (the postings's term metadata). I have not gone to the full details 
of the [paper](https://citeseerx.ist.psu.edu/doc/10.1.1.24.3698) that underpins 
FSTCompiler implementation but I believe mapping to 8-byte ordinals 
(monotonically increasing) are much easier than mapping to  variable-length and 
unordered byte[] blobs. Also, compression-wise the FST may have done  a great 
job in compressing the keys but not so for the blobs.
   
   > But, that FST can get quite large, and take quite a bit of time to create 
during indexing
   
   I think if we move the values out of FST we could balance the size. 
Time-wise, I'm not sure. Hopefully the simplified value space make building FST 
easier. This requires some experimentation
   
   > Term dictionary heavy queries, e.g. FuzzyQuery or RegexpQuery, might 
become faster? Maybe this eventually becomes Lucene's default terms dictionary!
   
   Yes, this can be very promising :) The fact that it is FST and contains all 
terms makes it efficient to skip no-existent terms.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] stefanvodita commented on a diff in pull request #12337: Index arbitrary fields in taxonomy docs

2023-08-18 Thread via GitHub


stefanvodita commented on code in PR #12337:
URL: https://github.com/apache/lucene/pull/12337#discussion_r1298912817


##
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyIndexReader.java:
##
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.taxonomy.directory;
+
+import java.io.IOException;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.store.Directory;
+
+/**
+ * This is like a {@link DirectoryTaxonomyReader}, except it provides access 
to the underlying
+ * {@link DirectoryReader} and full path field name.
+ */
+public class DirectoryTaxonomyIndexReader extends DirectoryTaxonomyReader {

Review Comment:
   Great point @epotyom! I’ve thought a bit more about this and I’d like to 
consider exposing the `IndexReader` of `DirectoryTaxonomyReader` by making 
`getInternalIndexReader` public instead of protected. I actually like this 
solution better than what I coded up previously. It’s cleaner, it’s backwards 
compatible, and a user could have already gotten the `IndexReader` anyway by 
extending `DirectoryTaxonomyReader`. I’m curious if anyone has other ideas 
though.



##
lucene/facet/src/java/org/apache/lucene/facet/taxonomy/directory/DirectoryTaxonomyIndexReader.java:
##
@@ -0,0 +1,43 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.facet.taxonomy.directory;
+
+import java.io.IOException;
+import org.apache.lucene.index.DirectoryReader;
+import org.apache.lucene.store.Directory;
+
+/**
+ * This is like a {@link DirectoryTaxonomyReader}, except it provides access 
to the underlying
+ * {@link DirectoryReader} and full path field name.
+ */
+public class DirectoryTaxonomyIndexReader extends DirectoryTaxonomyReader {

Review Comment:
   Great point @epotyom! I’ve thought a bit more about this and I’d like to 
consider exposing the `IndexReader` of `DirectoryTaxonomyReader` by making 
`getInternalIndexReader` public instead of protected. I actually like this 
solution better than what I coded up previously. It’s cleaner, it’s backwards 
compatible, and a user could have already gotten the `IndexReader` anyway by 
extending `DirectoryTaxonomyReader`. I’m curious if anyone has other ideas 
though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org