Hi Guys Thanks for the replies. I've had a look at the WordDelimiterFilterFactory and the Term Info for the url field. It seems that all the terms exist and I now understand that each url is being broken up using the delimiters specified. But I think I'm still missing something.
Am I correct in assuming the minus sign (-) is also a delimiter? If so why then does url:"IAE-UPC-0001" return a result (when the url contains the substring IAE-UPC-0001) whereas url:"IAE_UPC_0001" doesn't (when the url contains the substring IAE_UPC_0001)? Secondly if the url has indeed been broken into the terms IAE UPC and 0001 why do all the searches suggested or tried succeed when the delimiter is a minus sign (-) but not when the delimiter is an underscore (_), returning zero matches? Finally, shouldn't the query url:"IAE UPC 0001"~1 work since all it is looking for is the three terms? Many thanks for any enlightenment. P On 4 August 2014 01:33, Harald Kirsch <harald.kir...@raytion.com> wrote: > This all depends on how the tokenizers take your URLs apart. To quickly > see what ended up in the index, go to a core in the UI, select Schema > Browser, select the field containing your URLs, click on "Load Term Info". > > In your case, for the field holding the URL you could try to switch to a > tokenizer that defines tokens as a sequence of alphanumeric characters, > roughly [a-z0-9]+ plus diacritics. In particular punctuation and separation > characters like dash, underscore, slash, dot and the like would never be > part of a token, i.e. they don't make a difference. > > Then you can search the url parts with a phrase query ( > https://cwiki.apache.org/confluence/display/solr/The+ > Standard+Query+Parser#TheStandardQueryParser- > SpecifyingTermsfortheStandardQueryParserwhich) like > > url:"IAE-UPC-0001" > > In the same way as during indexing, the dashes are removed to end up with > three tokens, namely IAE, UPC and 0001. Further they have to be in that > order. Naturally this will then match anything like: > > "IAE_UPC_0001" > "IAE UPC 0001" > "IAE/UPC+0001" > "IAE\UPC\0001" > "IAE.UPC,0001" > > Depending on how your URLs are structured, there is the chance for false > positives, of course. > > The Really Good Thing here is, that you don't need to use wildcards. > > I have not yet looked at the wildcard-queries implementation in > Solr/Lucene, but with the commercial search engines I know, they are a > great way to loose the confidence of your users, because they just don't > work as expected by anyone not knowing the implementation. Either they > deliver only partial results or they kill the performance or they even go > OOM. If Solr committers have not done something really ingenious, > Solr/Lucene does have the same problems. > > Harald. > > > > > > > On 31.07.2014 18:31, Paul Rogers wrote: > >> Hi Guys >> >> I have a Solr application searching on data uploaded by Nutch. The search >> I wish to carry out is for a particular document reference contained >> within >> the "url" field, e.g. IAE-UPC-0001. >> >> The problem is is that the file names that comprise the url's are not >> consistent, so a url might contain the reference as IAE-UPC-0001 or >> IAE_UPC_0001 (ie using either the minus or underscore as the delimiter) >> but >> not both. >> >> I have created the query (in the solr admin interface): >> >> url:"IAE-UPC-0001" >> >> which works (returning the single expected document), as do: >> >> url:"IAE*UPC*0001" >> url:"IAE?UPC?0001" >> >> when the doc ref is in the format IAE-UPC-0001 (ie using the minus sign as >> a delimiter). >> >> However: >> >> url:"IAE_UPC_0001" >> url:"IAE*UPC*0001" >> url:"IAE?UPC?0001" >> >> do not work (returning zero documents) when the doc ref is in the format >> IAE_UPC_0001 (ie using the underscore character as the delimiter). >> >> I'm assuming the underscore is a special character but have tried looking >> at the solr wiki but can't find anything to say what the problem is. Also >> the minus sign also has a specific meaning but is nullified by adding the >> quotes. >> >> Can anyone suggest what I'm doing wrong? >> >> Many thanks >> >> Paul >> >> > -- > Harald Kirsch > Raytion GmbH > Kaiser-Friedrich-Ring 74 > 40547 Duesseldorf > Fon +49 211 53883-216 > Fax +49-211-550266-19 > http://www.raytion.com >