Sebastian -
There’s some precedent out there for ISBN’s. Bill Dueber and the
UMICH/code4lib folks have done amazing work, check it out here -
https://github.com/mlibrary/umich_solr_library_filters
<https://github.com/mlibrary/umich_solr_library_filters>
- Erik
> On Jan 5, 2017, at 5:08 AM, Sebastian Riemer <[email protected]> wrote:
>
> Hi folks,
>
>
> TL;DR: Is there an easy way, to copy ISBNs with hyphens to the general text
> field, respectively configure the analyser on that field, so that a search
> for the hyphenated ISBN returns exactly the matching document?
>
> Long version:
> I've defined a field "text" of type "text_general", where I copy all my other
> fields to, to be able to do a "quick search" where I set q=text
>
> The definition of the type text_general is like this:
>
>
>
> <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
>
> <analyzer type="index">
>
> <tokenizer class="solr.StandardTokenizerFactory"/>
>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" />
>
> <filter class="solr.LowerCaseFilterFactory"/>
>
> </analyzer>
>
> <analyzer type="query">
>
> <tokenizer class="solr.StandardTokenizerFactory"/>
>
> <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" />
>
> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>
> <filter class="solr.LowerCaseFilterFactory"/>
>
> </analyzer>
>
> </fieldType>
>
>
> I now face the problem, that searching for a book with
> text:978-3-8052-5094-8* does not return the single result I expect. However
> searching for text:9783805250948* instead returns a result. Note, that I am
> adding a wildcard at the end automatically, to further broaden the resultset.
> Note also, that it does not seem to matter whether I put backslashes in front
> of the hyphen or not (to be exact, when sending via SolrJ from my
> application, I put in the backslashes, but I don't see a difference when
> using SolrAdmin as I guess SolrAdmin automatically inserts backslashes if
> needed?)
>
> When storing ISBNs, I do store them twice, once with hyphens
> (978-3-8052-5094-8) and once without (9783805250948). A pure phrase search on
> both those values return also the single document.
>
> I learned that the StandardTokenizer splits up values from fields at index
> time, and I've also learned that I can use the solrAdmin analysis and the
> debugQuery to help understand what is going on. From the analysis screen I
> see, that given the value 9783805250948 at index-time and 9783805250948*
> query-time both leads to an unchanged value 9783805250948 at the end.
> When given the value 978-3-8052-5094-8 for "Field Value (Index)" and
> 978-3-8052-5094-8* for "Field Value (Query)" I can see how the ISBN is
> tokenized into 5 parts. Again, the values match on both sides (Index and
> Query).
>
> How does the left side correlate with the right side? My guess: The left side
> means, "Values stored in field text will be tokenized while indexing as show
> here on the left". The right side means, "When querying on the field text,
> I'll tokenize the entered value like this, and see if I find something on the
> index" Is this correct?
>
> Another question: when querying and investigating the single document in
> solrAdmin, the contents I see In the column text represents the _stored_
> value of the field text, right?
> And am I correct that this actually has nothing to do, with what is actually
> stored in the index for searching?
>
> When storing the value 978-3-8052-5094-8, are only the tokenized values
> stored for search, or is the "whole word" also stored? Is there a way to
> actually see all the values which are stored for search?
> When searching text:" 978-3-8052-5094-8" I get the single result, so I guess
> the value as a whole must also be stored in the index for searching?
>
> One more thing which confuses me:
> Searching for text: 978-3-8052-5094-8 gives me 72 results, because it leads
> to searching for "parsedquery_toString":"text:978 text:3 text:8052 text:5094
> text:8",
> but searching for text: 978-3-8052-5094-8* gives me 0 results, this leads to
> "parsedquery_toString":"text:978-3-8052-5094-8*",
>
> Why is the appended wildcard changing the behaviour so radically? I'd rather
> expect to get something like "parsedquery_toString":"text:978 text:3
> text:8052 text:5094 text:8*", and thus even more results.
>
> Btw. I've found and read an interesting blog about storing ISBNs and alikes
> here:
> http://robotlibrarian.billdueber.com/2012/03/solr-field-type-for-numericish-ids/
> However, I already store my ISBN also in a separate field, of type string,
> which works fine when I use this field for searching.
>
> Best regards, sorry for the enormously long question and thank you for
> listening.
>
> Sebastian