Search for ISBN-like identifiers

Sebastian Riemer Thu, 05 Jan 2017 02:09:07 -0800

Hi folks,


TL;DR: Is there an easy way, to copy ISBNs with hyphens to the general text 
field, respectively configure the analyser on that field, so that a search for 
the hyphenated ISBN returns exactly the matching document?

Long version:
I've defined a field "text" of type "text_general", where I copy all my other 
fields to, to be able to do a "quick search" where I set q=text

The definition of the type text_general is like this:



<fieldType name="text_general" class="solr.TextField" 
positionIncrementGap="100">

      <analyzer type="index">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" />

        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

      <analyzer type="query">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" 
words="stopwords.txt" />

        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" 
ignoreCase="true" expand="true"/>

        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

    </fieldType>


I now face the problem, that searching for a book with text:978-3-8052-5094-8* 
does not return the single result I expect. However searching for 
text:9783805250948* instead returns a result. Note, that I am adding a wildcard 
at the end automatically, to further broaden the resultset. Note also, that it 
does not seem to matter whether I put backslashes in front of the hyphen or not 
(to be exact, when sending via SolrJ from my application, I put in the 
backslashes, but I don't see a difference when using SolrAdmin as I guess 
SolrAdmin automatically inserts backslashes if needed?)

When storing ISBNs, I do store them twice, once with hyphens 
(978-3-8052-5094-8) and once without (9783805250948). A pure phrase search on 
both those values return also the single document.

I learned that the StandardTokenizer splits up values from fields at index 
time, and I've also learned that I can use the solrAdmin analysis and the 
debugQuery to help understand what is going on. From the analysis screen I see, 
that given the value 9783805250948 at index-time and 9783805250948* query-time 
both leads to an unchanged value 9783805250948 at the end.
When given the value 978-3-8052-5094-8 for "Field Value (Index)" and 
978-3-8052-5094-8* for "Field Value (Query)"  I can see how the ISBN is 
tokenized into 5 parts. Again, the values match on both sides (Index and Query).

How does the left side correlate with the right side? My guess: The left side 
means, "Values stored in field text will be tokenized while indexing as show 
here on the left". The right side means, "When querying on the field text, I'll 
tokenize the entered value like this, and see if I find something on the index" 
Is this correct?

Another question: when querying and investigating the single document in 
solrAdmin, the contents I see In the column text represents the _stored_ value 
of the field text, right?
And am I correct that this actually has nothing to do, with what is actually 
stored in  the index for searching?

When storing the value 978-3-8052-5094-8, are only the tokenized values stored 
for search, or is the "whole word" also stored? Is there a way to actually see 
all the values which are stored for search?
When searching text:" 978-3-8052-5094-8" I get the single result, so I guess 
the value as a whole must also be stored in the index for searching?

One more thing which confuses me:
Searching for text: 978-3-8052-5094-8 gives me 72 results, because it leads to 
searching for "parsedquery_toString":"text:978 text:3 text:8052 text:5094 
text:8",
but searching for text: 978-3-8052-5094-8* gives me 0 results, this leads to 
"parsedquery_toString":"text:978-3-8052-5094-8*",

Why is the appended wildcard changing the behaviour so radically? I'd rather 
expect to get something like "parsedquery_toString":"text:978 text:3 text:8052 
text:5094 text:8*",  and thus even more results.

Btw. I've found and read an interesting blog about storing ISBNs and alikes 
here: 
http://robotlibrarian.billdueber.com/2012/03/solr-field-type-for-numericish-ids/
 However, I already store my ISBN also in a separate field, of type string, 
which works fine when I use this field for searching.

Best regards, sorry for the enormously long question and thank you for 
listening.

Sebastian

Search for ISBN-like identifiers

Reply via email to