Hi Erick Thanks for the assist. Did as you suggested (tho' I used Nutch). Cleared out solr's index and Nutch's crawl DB and then emptied all the documents out of the web server bar 10 of each type (IAE-UPC-#### and IAE_UPC_####). Then crawled the site using Nutch.
Then confirmed that all 20 docs had been uploaded and that *.* search returned all 20 docs. Now when I do a url search on either (for example) q=url:"IAE-UPC-220" or q="IAE_UPC_0001" I get a result returned for each as expected, ie it now works as expected. So seems I now need to figure out why Nutch isn't crawling the documents. Again many thanks. P On 18 August 2014 11:22, Erick Erickson <erickerick...@gmail.com> wrote: > I'd pull Nutch out of the mix here as a test. Create > some test docs (use the exampleDocs directory?) and > go from there at least long enough to insure that Solr > does what you expect if the data gets there properly. > > You can set this up in about 10 minutes, and test it > in about 15 more. May save you endless hours. > > Because you're conflating two issues here: > 1> whether Nutch is sending the data > 2> whether Solr is indexing and searching as you expect. > > Some of the Solr/Lucene analysis chains do transformations > that may not be what you assume, particularly things > like StandardTokenizer and WordDelimiterFilterFactory. > > So I'd take the time to see that the values you're dealing > with are behaving as you expect. The admin/analysis page > will help you a _lot_ here. > > Best, > Erick > > > > > On Mon, Aug 18, 2014 at 7:16 AM, Paul Rogers <paul.roge...@gmail.com> > wrote: > > Hi Guys > > > > I've been checking into this further and have deleted the index a couple > of > > times and rebuilt it with the suggestions you've supplied. > > > > I had a bit of an epiphany last week and decided to check if the > document I > > was searching for was actually in the index (did this by doing a *.* > query > > to a file and grep'ing for the 'IAE_UPC_0001@ string). It seems it > isn't!! > > Not sure if it was in the original index or not, tho' I suspect not. > > > > As far as I can see anything with the reference in the form IAE_UPC_#### > > has not been indexed while those with the reference in the form > > IAE-UPC-#### has. Not sure if that's a coincidence or not. > > > > Need to see if I can get the docs into the index and then check if the > > search works or not. Will see if the guys on the Nutch list can shed any > > light. > > > > All the best. > > > > P > > > > > > On 4 August 2014 17:09, Jack Krupansky <j...@basetechnology.com> wrote: > > > >> The standard tokenizer treats underscore as a valid token character, > not a > >> delimiter. > >> > >> The word delimiter filter will treat underscore as a delimiter though. > >> > >> Make sure your query-time WDF does not have preserveOriginal="1" - but > the > >> index-time WDF should have preserveOriginal="1". Otherwise, the query > >> phrase will generate an extra token which will participate in the > matching > >> and might cause a mismatch. > >> > >> -- Jack Krupansky > >> > >> -----Original Message----- From: Paul Rogers > >> Sent: Monday, August 4, 2014 5:55 PM > >> > >> To: solr-user@lucene.apache.org > >> Subject: Re: How to search for phrase "IAE_UPC_0001" > >> > >> Hi Guys > >> > >> Thanks for the replies. I've had a look at the > WordDelimiterFilterFactory > >> and the Term Info for the url field. It seems that all the terms exist > and > >> I now understand that each url is being broken up using the delimiters > >> specified. But I think I'm still missing something. > >> > >> Am I correct in assuming the minus sign (-) is also a delimiter? > >> > >> If so why then does url:"IAE-UPC-0001" return a result (when the url > >> contains the substring IAE-UPC-0001) whereas url:"IAE_UPC_0001" doesn't > >> (when the url contains the substring IAE_UPC_0001)? > >> > >> Secondly if the url has indeed been broken into the terms IAE UPC and > 0001 > >> why do all the searches suggested or tried succeed when the delimiter > is a > >> minus sign (-) but not when the delimiter is an underscore (_), > returning > >> zero matches? > >> > >> Finally, shouldn't the query url:"IAE UPC 0001"~1 work since all it is > >> looking for is the three terms? > >> > >> Many thanks for any enlightenment. > >> > >> P > >> > >> > >> > >> > >> On 4 August 2014 01:33, Harald Kirsch <harald.kir...@raytion.com> > wrote: > >> > >> This all depends on how the tokenizers take your URLs apart. To quickly > >>> see what ended up in the index, go to a core in the UI, select Schema > >>> Browser, select the field containing your URLs, click on "Load Term > Info". > >>> > >>> In your case, for the field holding the URL you could try to switch to > a > >>> tokenizer that defines tokens as a sequence of alphanumeric characters, > >>> roughly [a-z0-9]+ plus diacritics. In particular punctuation and > >>> separation > >>> characters like dash, underscore, slash, dot and the like would never > be > >>> part of a token, i.e. they don't make a difference. > >>> > >>> Then you can search the url parts with a phrase query ( > >>> https://cwiki.apache.org/confluence/display/solr/The+ > >>> Standard+Query+Parser#TheStandardQueryParser- > >>> SpecifyingTermsfortheStandardQueryParserwhich) like > >>> > >>> url:"IAE-UPC-0001" > >>> > >>> In the same way as during indexing, the dashes are removed to end up > with > >>> three tokens, namely IAE, UPC and 0001. Further they have to be in that > >>> order. Naturally this will then match anything like: > >>> > >>> "IAE_UPC_0001" > >>> "IAE UPC 0001" > >>> "IAE/UPC+0001" > >>> "IAE\UPC\0001" > >>> "IAE.UPC,0001" > >>> > >>> Depending on how your URLs are structured, there is the chance for > false > >>> positives, of course. > >>> > >>> The Really Good Thing here is, that you don't need to use wildcards. > >>> > >>> I have not yet looked at the wildcard-queries implementation in > >>> Solr/Lucene, but with the commercial search engines I know, they are a > >>> great way to loose the confidence of your users, because they just > don't > >>> work as expected by anyone not knowing the implementation. Either they > >>> deliver only partial results or they kill the performance or they even > go > >>> OOM. If Solr committers have not done something really ingenious, > >>> Solr/Lucene does have the same problems. > >>> > >>> Harald. > >>> > >>> > >>> > >>> > >>> > >>> > >>> On 31.07.2014 18:31, Paul Rogers wrote: > >>> > >>> Hi Guys > >>>> > >>>> I have a Solr application searching on data uploaded by Nutch. The > >>>> search > >>>> I wish to carry out is for a particular document reference contained > >>>> within > >>>> the "url" field, e.g. IAE-UPC-0001. > >>>> > >>>> The problem is is that the file names that comprise the url's are not > >>>> consistent, so a url might contain the reference as IAE-UPC-0001 or > >>>> IAE_UPC_0001 (ie using either the minus or underscore as the > delimiter) > >>>> but > >>>> not both. > >>>> > >>>> I have created the query (in the solr admin interface): > >>>> > >>>> url:"IAE-UPC-0001" > >>>> > >>>> which works (returning the single expected document), as do: > >>>> > >>>> url:"IAE*UPC*0001" > >>>> url:"IAE?UPC?0001" > >>>> > >>>> when the doc ref is in the format IAE-UPC-0001 (ie using the minus > sign > >>>> as > >>>> a delimiter). > >>>> > >>>> However: > >>>> > >>>> url:"IAE_UPC_0001" > >>>> url:"IAE*UPC*0001" > >>>> url:"IAE?UPC?0001" > >>>> > >>>> do not work (returning zero documents) when the doc ref is in the > format > >>>> IAE_UPC_0001 (ie using the underscore character as the delimiter). > >>>> > >>>> I'm assuming the underscore is a special character but have tried > looking > >>>> at the solr wiki but can't find anything to say what the problem is. > Also > >>>> the minus sign also has a specific meaning but is nullified by adding > the > >>>> quotes. > >>>> > >>>> Can anyone suggest what I'm doing wrong? > >>>> > >>>> Many thanks > >>>> > >>>> Paul > >>>> > >>>> > >>>> -- > >>> Harald Kirsch > >>> Raytion GmbH > >>> Kaiser-Friedrich-Ring 74 > >>> 40547 Duesseldorf > >>> Fon +49 211 53883-216 > >>> Fax +49-211-550266-19 > >>> http://www.raytion.com > >>> > >>> > >> >