This all depends on how the tokenizers take your URLs apart. To quickly
see what ended up in the index, go to a core in the UI, select Schema
Browser, select the field containing your URLs, click on "Load Term Info".
In your case, for the field holding the URL you could try to switch to a
tokenizer that defines tokens as a sequence of alphanumeric characters,
roughly [a-z0-9]+ plus diacritics. In particular punctuation and
separation characters like dash, underscore, slash, dot and the like
would never be part of a token, i.e. they don't make a difference.
Then you can search the url parts with a phrase query
(https://cwiki.apache.org/confluence/display/solr/The+Standard+Query+Parser#TheStandardQueryParser-SpecifyingTermsfortheStandardQueryParserwhich)
like
url:"IAE-UPC-0001"
In the same way as during indexing, the dashes are removed to end up
with three tokens, namely IAE, UPC and 0001. Further they have to be in
that order. Naturally this will then match anything like:
"IAE_UPC_0001"
"IAE UPC 0001"
"IAE/UPC+0001"
"IAE\UPC\0001"
"IAE.UPC,0001"
Depending on how your URLs are structured, there is the chance for false
positives, of course.
The Really Good Thing here is, that you don't need to use wildcards.
I have not yet looked at the wildcard-queries implementation in
Solr/Lucene, but with the commercial search engines I know, they are a
great way to loose the confidence of your users, because they just don't
work as expected by anyone not knowing the implementation. Either they
deliver only partial results or they kill the performance or they even
go OOM. If Solr committers have not done something really ingenious,
Solr/Lucene does have the same problems.
Harald.
On 31.07.2014 18:31, Paul Rogers wrote:
Hi Guys
I have a Solr application searching on data uploaded by Nutch. The search
I wish to carry out is for a particular document reference contained within
the "url" field, e.g. IAE-UPC-0001.
The problem is is that the file names that comprise the url's are not
consistent, so a url might contain the reference as IAE-UPC-0001 or
IAE_UPC_0001 (ie using either the minus or underscore as the delimiter) but
not both.
I have created the query (in the solr admin interface):
url:"IAE-UPC-0001"
which works (returning the single expected document), as do:
url:"IAE*UPC*0001"
url:"IAE?UPC?0001"
when the doc ref is in the format IAE-UPC-0001 (ie using the minus sign as
a delimiter).
However:
url:"IAE_UPC_0001"
url:"IAE*UPC*0001"
url:"IAE?UPC?0001"
do not work (returning zero documents) when the doc ref is in the format
IAE_UPC_0001 (ie using the underscore character as the delimiter).
I'm assuming the underscore is a special character but have tried looking
at the solr wiki but can't find anything to say what the problem is. Also
the minus sign also has a specific meaning but is nullified by adding the
quotes.
Can anyone suggest what I'm doing wrong?
Many thanks
Paul
--
Harald Kirsch
Raytion GmbH
Kaiser-Friedrich-Ring 74
40547 Duesseldorf
Fon +49 211 53883-216
Fax +49-211-550266-19
http://www.raytion.com