AW: Search for ISBN-like identifiers

Sebastian Riemer Thu, 05 Jan 2017 10:41:27 -0800

Thank you very much for taking the time to help me!

I'll definitely have a look at the link you've posted.


@ShawnHeisey Thanks too for shedding light on the wildcard behaviour!

Allow me one further question:
- Assuming that I define a separate field for storing the ISBNs, using the 
awesome analyzer provider by Mr. Bill Dueber. How do I get that field copied 
into my general text field, which is used by my QuickSearch-Input? Won't that 
field be processed again by the analyser defined on the text field?
- Should I alternatively add more fields to the q-Parameter? As for now, I 
always have set q=text:<whatever_I_want_to_search_here> but I guess one could 
try something like 
q=text:<whatever_i_want_to_search>+isbnspeciallookupfield:<whatever_i_want_to_search>

I don't really know about that last idea though, since the searches are 
propably OR-combined which is not what I like to have.

Third option would be, to pre-process the distinction to where to look at in 
the solr in my application of course. I.e. everything being a regex containing 
only numbers and hyphens with length 13 -> don't query on field text, instead 
use field isbnspeciallookupfield


Many thanks again, and have a nice day!
Sebastian


-----Ursprüngliche Nachricht-----
Von: Erik Hatcher [mailto:[email protected]] 
Gesendet: Donnerstag, 5. Januar 2017 19:10
An: [email protected]
Betreff: Re: Search for ISBN-like identifiers

Sebastian -

There’s some precedent out there for ISBN’s.  Bill Dueber and the 
UMICH/code4lib folks have done amazing work, check it out here -

        https://github.com/mlibrary/umich_solr_library_filters 
<https://github.com/mlibrary/umich_solr_library_filters>

  - Erik


> On Jan 5, 2017, at 5:08 AM, Sebastian Riemer <[email protected]> wrote:
> 
> Hi folks,
> 
> 
> TL;DR: Is there an easy way, to copy ISBNs with hyphens to the general text 
> field, respectively configure the analyser on that field, so that a search 
> for the hyphenated ISBN returns exactly the matching document?
> 
> Long version:
> I've defined a field "text" of type "text_general", where I copy all 
> my other fields to, to be able to do a "quick search" where I set 
> q=text
> 
> The definition of the type text_general is like this:
> 
> 
> 
> <fieldType name="text_general" class="solr.TextField" 
> positionIncrementGap="100">
> 
>      <analyzer type="index">
> 
>        <tokenizer class="solr.StandardTokenizerFactory"/>
> 
>        <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" />
> 
>        <filter class="solr.LowerCaseFilterFactory"/>
> 
>      </analyzer>
> 
>      <analyzer type="query">
> 
>        <tokenizer class="solr.StandardTokenizerFactory"/>
> 
>        <filter class="solr.StopFilterFactory" ignoreCase="true" 
> words="stopwords.txt" />
> 
>        <filter class="solr.SynonymFilterFactory" 
> synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
> 
>        <filter class="solr.LowerCaseFilterFactory"/>
> 
>      </analyzer>
> 
>    </fieldType>
> 
> 
> I now face the problem, that searching for a book with 
> text:978-3-8052-5094-8* does not return the single result I expect. 
> However searching for text:9783805250948* instead returns a result. 
> Note, that I am adding a wildcard at the end automatically, to further 
> broaden the resultset. Note also, that it does not seem to matter 
> whether I put backslashes in front of the hyphen or not (to be exact, 
> when sending via SolrJ from my application, I put in the backslashes, 
> but I don't see a difference when using SolrAdmin as I guess SolrAdmin 
> automatically inserts backslashes if needed?)
> 
> When storing ISBNs, I do store them twice, once with hyphens 
> (978-3-8052-5094-8) and once without (9783805250948). A pure phrase search on 
> both those values return also the single document.
> 
> I learned that the StandardTokenizer splits up values from fields at index 
> time, and I've also learned that I can use the solrAdmin analysis and the 
> debugQuery to help understand what is going on. From the analysis screen I 
> see, that given the value 9783805250948 at index-time and 9783805250948* 
> query-time both leads to an unchanged value 9783805250948 at the end.
> When given the value 978-3-8052-5094-8 for "Field Value (Index)" and 
> 978-3-8052-5094-8* for "Field Value (Query)"  I can see how the ISBN is 
> tokenized into 5 parts. Again, the values match on both sides (Index and 
> Query).
> 
> How does the left side correlate with the right side? My guess: The left side 
> means, "Values stored in field text will be tokenized while indexing as show 
> here on the left". The right side means, "When querying on the field text, 
> I'll tokenize the entered value like this, and see if I find something on the 
> index" Is this correct?
> 
> Another question: when querying and investigating the single document in 
> solrAdmin, the contents I see In the column text represents the _stored_ 
> value of the field text, right?
> And am I correct that this actually has nothing to do, with what is actually 
> stored in  the index for searching?
> 
> When storing the value 978-3-8052-5094-8, are only the tokenized values 
> stored for search, or is the "whole word" also stored? Is there a way to 
> actually see all the values which are stored for search?
> When searching text:" 978-3-8052-5094-8" I get the single result, so I guess 
> the value as a whole must also be stored in the index for searching?
> 
> One more thing which confuses me:
> Searching for text: 978-3-8052-5094-8 gives me 72 results, because it 
> leads to searching for "parsedquery_toString":"text:978 text:3 
> text:8052 text:5094 text:8", but searching for text: 
> 978-3-8052-5094-8* gives me 0 results, this leads to 
> "parsedquery_toString":"text:978-3-8052-5094-8*",
> 
> Why is the appended wildcard changing the behaviour so radically? I'd rather 
> expect to get something like "parsedquery_toString":"text:978 text:3 
> text:8052 text:5094 text:8*",  and thus even more results.
> 
> Btw. I've found and read an interesting blog about storing ISBNs and alikes 
> here: 
> http://robotlibrarian.billdueber.com/2012/03/solr-field-type-for-numericish-ids/
>  However, I already store my ISBN also in a separate field, of type string, 
> which works fine when I use this field for searching.
> 
> Best regards, sorry for the enormously long question and thank you for 
> listening.
> 
> Sebastian

AW: Search for ISBN-like identifiers

Reply via email to