Re: Synonym-time Reindexing Issues

Erick Erickson Thu, 07 Apr 2011 05:45:27 -0700

OK, see below.

On Wed, Apr 6, 2011 at 6:22 PM, Preston Marshall <pres...@synergyeoc.com>wrote:

> Reply Inline:
> On Apr 6, 2011, at 8:12 AM, Erick Erickson wrote:
>
> > Hmmm, this should work just fine. Here are my questions.
> >
> > 1> are you absolutely sure that the new synonym file
> >     is available when reindexing?
> Not sure what you mean here, solr is running as root, and the file is never
> moved around or anything crazy.
>

Just a sanity check that you're changing the indexing file you think you're
changing. I've sometimes managed to be in the wrong directory, on the wrong
machine, etc.

Hmmm, what happens if you just stop/start the server instead of delete the
index? I'm wondering if the old file is used (assuming *nix here). I have no
evidence this could be the  case but it's an idea.

> > 2> does the sunspot program do anything wonky with
> >     the ids? The documents
> >     will only be replaced if the IDs are identical.
> Is there a way I can add debugging to show what it's doing with the IDs or
> something to view the index?  I tried using Luke, but I can't get it to
> actually show me the actual data of the objects, only the name and some
> other basic info.
>

The issue is seeing whatever has been defined as the <uniqueKey> field. In
the default schema, it's defined as "id". I'm NOT talking about the internal
Lucene ID, it's entirely about what's defined in your schema. Set
stored="true" for fields to see them easily. The point here is that Solr
updates documents based on <uniqueKey>. If there is no such field,
reindexing your documents will simply add another copy, the original is
still searchable.

> > 3> are you sure that a commit is done at the end?
> It appears that it commits a few times during reindexing.
> > 4> What happens if you optimize? At that point, maxdocs
> >     and numdocs should be the same, and should be the count
> >     of documents. if they differ by a factor of 2, I'd suspect your
> >     id field isn't being used correctly.
> I'm unaware of what you mean by optimizing, or even viewing maxdocs and
> numdocs, but I will RTFM to find out.  I did notice something strange
> earlier though that may relate to this.  When I ran a search there were
> duplicate results.
>

OK, see the <uniqueKey> discussion above. It really sounds
like re-indexing the data is
merely adding documents again and again and again, not
replacing the first copy with the second. If this is true, your numDocs
and maxDocs should be nearly equal the first time and grow
by the number of documents you index every time you
reindex. If/when you <uniqueKey> is working, you should see
numDocs stay constant and maxDocs go up by the number
of documents you re-index.

Sending an optimize command to the indexer will reclaim all
unused resources and bring numDocs and maxDocs back
to the same value, but this is probably not your problem.

I do see that "id" is the <uniqueKey> in your schema. So I'm
guessing, especially because the comment says that this
field is used by sunspot, that the sunspot stuff is creating
a new id for each document when you re-index. If all this is
true, it's an issue with sunspot. So here's what I predict. If you
look at the id field you'll see some sunspot-generated id that's
unique for every added document even if it's a new copy
of an old document, so Solr sees two separate, entirely
unrelated documents. The old one has the old synonyms and
the new one the new list.

The maxDocs/numDocs are available on the admin page, click
the "statistics" link.

Best
Erick

> >
> > If the hypothesis that you id field isn't working correctly, your number
> > of hits should be going up after re-indexing...
> >
> > If none of that is relevant, let us know what you find and we'll
> > try something else....
> >
> > Best
> > Erick
> >
> > On Tue, Apr 5, 2011 at 10:46 PM, Preston Marshall <
> pres...@synergyeoc.com>wrote:
> >
> >> Hello all, I am having an issue with Solr and the SynonymFilterFactory.
>  I
> >> am using a library to interface with Solr called "sunspot."  I realize
> that
> >> is not what this list is for, but I believe this may be an issue with
> Solr,
> >> not the library (plus the lib author doesn't know the answer). I am
> using
> >> the SynonymFilterFactory in my index-time analyzer, and it works great.
>  My
> >> only problem is when it comes to changing the synonyms file.  I would
> expect
> >> to be able to edit the file, run a reindex (this is through the
> library),
> >> and have the new synonyms function when the reindex is complete.
> >> Unfortunately this is not the case, as changing the synonyms file
> doesn't
> >> actually affect the search results.  What DOES work is deleting the
> existing
> >> index, and starting from scratch.  This is unacceptable for my usage
> though,
> >> because I need the old index to remain online while the new one is being
> >> built, so there is no downtime.
> >>
> >> Here's my schema in case anyone needs it:
> >> https://gist.github.com/88f8fb763e99abe4d5b8
> >>
> >> Thanks,
> >> Preston
> >>
> >> P.S. Sorry if this dupes, first post and I didn't see it show up in the
> >> archives.
> >>
>
>

Re: Synonym-time Reindexing Issues

Reply via email to