Re: AW: AW: SolrClient#updateByQuery?

Erick Erickson Sun, 28 Jan 2018 09:24:01 -0800

bq: I am still getting "suggestions" (from spellcheck.q)

OK, this is actually expected behavior. The spellcheck is done from
the _indexed_ terms. Documents deleted from the index are marked as
deleted, the associated terms are not purged from the index until the
segment is merged. When just checking the terms for spellcheck,
there's no good way to figure out that a term is part of a deleted
doc.

Your expungeDeletes "fix" really wouldn't have actually fixed your
problem in any kind of production environment. ExpungeDeletes merges
segments with > n% deleted docs. It only fixed your test case because,
I suspect, you have very few documents (perhaps only one) in your
segment, so it was merged away. In a situation where you had, say,
10,000 docs in a segments and you deleted the one (and only) document
with some term, expungeDeletes would skip merging the segment and
spellcheck would still have returned the suggestion.

Optimize on the other hand unconditionally rewrites all segments into
a single segment, so that was removing the indexed term. As discussed,
optimize is a _very_ expensive operation and, unless you're able to
optimize after every indexing session it will not scale. The
situations where I've seen this be acceptable are ones in which the
index changes rarely, for example you update your index once a day. If
you continually update your index, optimizing will actually make this
problem worse between optimizations, see:
https://lucidworks.com/2017/10/13/segment-merging-deleted-documents-optimize-may-bad/

At a higher level, you're expending a lot of effort to handle the case
where a document is deleted and it's the last "live" doc in your
entire corpus that contains a term. For a decent sized corpus this
will be quite rare so often people simply don't worry about it. The
scenario in your test case is somewhat artificial and makes it seem
more likely than it will probably be "in the real world".

Consider setting spellcheck's thresholdTokenFrequency to some value.
That parameter's primary purpose is to handle situations where words
are misspelled in the documents so you don't suggest those misspelled
words, but I think it would cover this situation too. Unfortunately it
will not work very well in a simple test setup either. Let's say you
set it to 2%. You index 100 documents and 3 of them contain the term.
It's now found in your spellcheck test. Now you delete two of them
(without merging any segments). The term frequency is _still_ 3% so it
still may be found after you delete and commit.

I suppose you could structure your test this way:
index 100 docs, 3 of them have a specific term.
Set your threshold to 2%
Check that the term is suggested
index 100 more docs
Check that the term is _not_ suggested.

Best,
Erick

On Sun, Jan 28, 2018 at 7:24 AM, Clemens Wyss DEV <clemens...@mysign.ch> wrote:
> I must clarify a few things:
> the unittest I noted does not check/perform a DBQ but a "simple" deleteById.
> The deleted document is no more found (as expected) BUT I am still getting 
> "suggestions" (from spellcheck.q). So my problem is not that I find deleted 
> documents but suggestions resulting from the deleted document.
>
> The suggestions-configuration is as follows:
> <searchComponent name="suggest_phrase" class="solr.SpellCheckComponent">
>             <lst name="spellchecker">
>                 <str name="name">suggest_phrase_fuzzy</str>
>                 <str 
> name="classname">org.apache.solr.spelling.suggest.Suggester</str>
>                 <str 
> name="lookupImpl">org.apache.solr.spelling.suggest.fst.FuzzyLookupFactory</str>
>                 <str name="allTermsRequired">true</str>
>                 <str name="maxEdits">2</str>
>                 <str name="ignoreCase=">true</str>
>                 <str name="field">_my_suggest_phrase</str>
>                 <str name="suggestAnalyzerFieldType">string</str> <!-- 
> suggest_phrase -->
>                 <!--  <str name="storeDir">suggest_phrase_fuzzy</str>  -->
>                 <str name="buildOnOptimize">false</str>
>                 <str name="buildOnStartup">false</str> <!-- ?? -->
>                 <str name="buildOnCommit">true</str>
>             </lst>
>         </searchComponent>
>
> Most importantly: "buildOnCommit"->true.
>
> The question hence is:
> What (which commit?) do I need to do after
>>solrClient.deleteById( toBeDeletedDocumentIDs );
>
> for the suggestions to be up-to-date too (without heavy commit/optimize)?
>
> thx and sorry for the misunderstandings
>
> -----Ursprüngliche Nachricht-----
> Von: Erick Erickson [mailto:erickerick...@gmail.com]
> Gesendet: Samstag, 27. Januar 2018 18:20
> An: solr-user <solr-user@lucene.apache.org>
> Betreff: Re: AW: AW: SolrClient#updateByQuery?
>
> Clemens:
>
> Let's not raise a JIRA quite yet. I am 99% sure your test is not doing what 
> you think or you have some invalid expectations. This is such a fundamental 
> feature that it'd surprise me a _lot_ if it were a bug.
> Also, there are a bunch of DeleteByQuery tests in the junit tests that's run 
> all the time..
>
> Wait, are you issuing an explicit commit or not? I saw this phrase 
> "...brutely by forcing a commit (with "expunge deletes")..." and saw the word 
> "commit" and assumed you were issuing a commit, but re-reading that's not 
> clear at all. Code should look something like
>
> update-via-delete-by-query
> solrClient.commit();
> query to see if doc is gone.
>
> So here's what I'd try next:
>
> 1> Issue an explicit commit command (SolrCient.commit()) after the
> DBQ. The defaults there are openSearcher = true and waitSearcher= true. When 
> that returns _then_ issue your query.
> 2> If that doesn't work, try (just for information gathering) waiting,
> several seconds after the commit to try your request. This should _not_ be 
> necessary, but it'll give us a clue what's going on.
> 3> Show us the code if you can.
>
> Best,
> Erick
>
>
> On Sat, Jan 27, 2018 at 6:55 AM, Clemens Wyss DEV <clemens...@mysign.ch> 
> wrote:
>> Erick said/wrote:
>>> If you commit after docs are deleted and _still_ see them in search
>>> results, that's a JIRA
>> should I JIRA it?
>>
>> -----Ursprüngliche Nachricht-----
>> Von: Shawn Heisey [mailto:apa...@elyograg.org]
>> Gesendet: Samstag, 27. Januar 2018 12:05
>> An: solr-user@lucene.apache.org
>> Betreff: Re: AW: AW: SolrClient#updateByQuery?
>>
>> On 1/27/2018 12:49 AM, Clemens Wyss DEV wrote:
>>> Thanks for all these (main contributor's 😉) valuable inputs!
>>>
>>> First thing I did was getting getting rid of "expungeDeletes". My
>>> "single-deletion" unittest failed unti I added the optimize-param
>>>> updateReques.setParam( "optimize", "true" );
>>> Does this make sense or should JIRA it?
>>> How expensive ist this "optimization"?
>>
>> An optimize operation is a complete rewrite of the entire index to one 
>> segment.  It will typically double the size of the index.  The rewritten 
>> index will not have any documents that were deleted in it.  It's slow and 
>> extremely expensive.  If the index is one gigabyte, expect an optimize to 
>> take at least half an hour, possibly longer, to complete.
>> The CPU and disk I/O are going to take a beating while the optimize is 
>> occurring.
>>
>> Thanks,
>> Shawn

Re: AW: AW: SolrClient#updateByQuery?

Reply via email to