NP, glad you're making forward progress! Erick
On Mon, Aug 18, 2014 at 12:31 PM, Paul Rogers <paul.roge...@gmail.com> wrote: > Hi Erick > > Thanks for the assist. Did as you suggested (tho' I used Nutch). Cleared > out solr's index and Nutch's crawl DB and then emptied all the documents > out of the web server bar 10 of each type (IAE-UPC-#### and IAE_UPC_####). > Then crawled the site using Nutch. > > Then confirmed that all 20 docs had been uploaded and that *.* search > returned all 20 docs. > > Now when I do a url search on either (for example) q=url:"IAE-UPC-220" or > q="IAE_UPC_0001" I get a result returned for each as expected, ie it now > works as expected. > > So seems I now need to figure out why Nutch isn't crawling the documents. > > Again many thanks. > > P > > > > > On 18 August 2014 11:22, Erick Erickson <erickerick...@gmail.com> wrote: > >> I'd pull Nutch out of the mix here as a test. Create >> some test docs (use the exampleDocs directory?) and >> go from there at least long enough to insure that Solr >> does what you expect if the data gets there properly. >> >> You can set this up in about 10 minutes, and test it >> in about 15 more. May save you endless hours. >> >> Because you're conflating two issues here: >> 1> whether Nutch is sending the data >> 2> whether Solr is indexing and searching as you expect. >> >> Some of the Solr/Lucene analysis chains do transformations >> that may not be what you assume, particularly things >> like StandardTokenizer and WordDelimiterFilterFactory. >> >> So I'd take the time to see that the values you're dealing >> with are behaving as you expect. The admin/analysis page >> will help you a _lot_ here. >> >> Best, >> Erick >> >> >> >> >> On Mon, Aug 18, 2014 at 7:16 AM, Paul Rogers <paul.roge...@gmail.com> >> wrote: >> > Hi Guys >> > >> > I've been checking into this further and have deleted the index a couple >> of >> > times and rebuilt it with the suggestions you've supplied. >> > >> > I had a bit of an epiphany last week and decided to check if the >> document I >> > was searching for was actually in the index (did this by doing a *.* >> query >> > to a file and grep'ing for the 'IAE_UPC_0001@ string). It seems it >> isn't!! >> > Not sure if it was in the original index or not, tho' I suspect not. >> > >> > As far as I can see anything with the reference in the form IAE_UPC_#### >> > has not been indexed while those with the reference in the form >> > IAE-UPC-#### has. Not sure if that's a coincidence or not. >> > >> > Need to see if I can get the docs into the index and then check if the >> > search works or not. Will see if the guys on the Nutch list can shed any >> > light. >> > >> > All the best. >> > >> > P >> > >> > >> > On 4 August 2014 17:09, Jack Krupansky <j...@basetechnology.com> wrote: >> > >> >> The standard tokenizer treats underscore as a valid token character, >> not a >> >> delimiter. >> >> >> >> The word delimiter filter will treat underscore as a delimiter though. >> >> >> >> Make sure your query-time WDF does not have preserveOriginal="1" - but >> the >> >> index-time WDF should have preserveOriginal="1". Otherwise, the query >> >> phrase will generate an extra token which will participate in the >> matching >> >> and might cause a mismatch. >> >> >> >> -- Jack Krupansky >> >> >> >> -----Original Message----- From: Paul Rogers >> >> Sent: Monday, August 4, 2014 5:55 PM >> >> >> >> To: solr-user@lucene.apache.org >> >> Subject: Re: How to search for phrase "IAE_UPC_0001" >> >> >> >> Hi Guys >> >> >> >> Thanks for the replies. I've had a look at the >> WordDelimiterFilterFactory >> >> and the Term Info for the url field. It seems that all the terms exist >> and >> >> I now understand that each url is being broken up using the delimiters >> >> specified. But I think I'm still missing something. >> >> >> >> Am I correct in assuming the minus sign (-) is also a delimiter? >> >> >> >> If so why then does url:"IAE-UPC-0001" return a result (when the url >> >> contains the substring IAE-UPC-0001) whereas url:"IAE_UPC_0001" doesn't >> >> (when the url contains the substring IAE_UPC_0001)? >> >> >> >> Secondly if the url has indeed been broken into the terms IAE UPC and >> 0001 >> >> why do all the searches suggested or tried succeed when the delimiter >> is a >> >> minus sign (-) but not when the delimiter is an underscore (_), >> returning >> >> zero matches? >> >> >> >> Finally, shouldn't the query url:"IAE UPC 0001"~1 work since all it is >> >> looking for is the three terms? >> >> >> >> Many thanks for any enlightenment. >> >> >> >> P >> >> >> >> >> >> >> >> >> >> On 4 August 2014 01:33, Harald Kirsch <harald.kir...@raytion.com> >> wrote: >> >> >> >> This all depends on how the tokenizers take your URLs apart. To quickly >> >>> see what ended up in the index, go to a core in the UI, select Schema >> >>> Browser, select the field containing your URLs, click on "Load Term >> Info". >> >>> >> >>> In your case, for the field holding the URL you could try to switch to >> a >> >>> tokenizer that defines tokens as a sequence of alphanumeric characters, >> >>> roughly [a-z0-9]+ plus diacritics. In particular punctuation and >> >>> separation >> >>> characters like dash, underscore, slash, dot and the like would never >> be >> >>> part of a token, i.e. they don't make a difference. >> >>> >> >>> Then you can search the url parts with a phrase query ( >> >>> https://cwiki.apache.org/confluence/display/solr/The+ >> >>> Standard+Query+Parser#TheStandardQueryParser- >> >>> SpecifyingTermsfortheStandardQueryParserwhich) like >> >>> >> >>> url:"IAE-UPC-0001" >> >>> >> >>> In the same way as during indexing, the dashes are removed to end up >> with >> >>> three tokens, namely IAE, UPC and 0001. Further they have to be in that >> >>> order. Naturally this will then match anything like: >> >>> >> >>> "IAE_UPC_0001" >> >>> "IAE UPC 0001" >> >>> "IAE/UPC+0001" >> >>> "IAE\UPC\0001" >> >>> "IAE.UPC,0001" >> >>> >> >>> Depending on how your URLs are structured, there is the chance for >> false >> >>> positives, of course. >> >>> >> >>> The Really Good Thing here is, that you don't need to use wildcards. >> >>> >> >>> I have not yet looked at the wildcard-queries implementation in >> >>> Solr/Lucene, but with the commercial search engines I know, they are a >> >>> great way to loose the confidence of your users, because they just >> don't >> >>> work as expected by anyone not knowing the implementation. Either they >> >>> deliver only partial results or they kill the performance or they even >> go >> >>> OOM. If Solr committers have not done something really ingenious, >> >>> Solr/Lucene does have the same problems. >> >>> >> >>> Harald. >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> >> >>> On 31.07.2014 18:31, Paul Rogers wrote: >> >>> >> >>> Hi Guys >> >>>> >> >>>> I have a Solr application searching on data uploaded by Nutch. The >> >>>> search >> >>>> I wish to carry out is for a particular document reference contained >> >>>> within >> >>>> the "url" field, e.g. IAE-UPC-0001. >> >>>> >> >>>> The problem is is that the file names that comprise the url's are not >> >>>> consistent, so a url might contain the reference as IAE-UPC-0001 or >> >>>> IAE_UPC_0001 (ie using either the minus or underscore as the >> delimiter) >> >>>> but >> >>>> not both. >> >>>> >> >>>> I have created the query (in the solr admin interface): >> >>>> >> >>>> url:"IAE-UPC-0001" >> >>>> >> >>>> which works (returning the single expected document), as do: >> >>>> >> >>>> url:"IAE*UPC*0001" >> >>>> url:"IAE?UPC?0001" >> >>>> >> >>>> when the doc ref is in the format IAE-UPC-0001 (ie using the minus >> sign >> >>>> as >> >>>> a delimiter). >> >>>> >> >>>> However: >> >>>> >> >>>> url:"IAE_UPC_0001" >> >>>> url:"IAE*UPC*0001" >> >>>> url:"IAE?UPC?0001" >> >>>> >> >>>> do not work (returning zero documents) when the doc ref is in the >> format >> >>>> IAE_UPC_0001 (ie using the underscore character as the delimiter). >> >>>> >> >>>> I'm assuming the underscore is a special character but have tried >> looking >> >>>> at the solr wiki but can't find anything to say what the problem is. >> Also >> >>>> the minus sign also has a specific meaning but is nullified by adding >> the >> >>>> quotes. >> >>>> >> >>>> Can anyone suggest what I'm doing wrong? >> >>>> >> >>>> Many thanks >> >>>> >> >>>> Paul >> >>>> >> >>>> >> >>>> -- >> >>> Harald Kirsch >> >>> Raytion GmbH >> >>> Kaiser-Friedrich-Ring 74 >> >>> 40547 Duesseldorf >> >>> Fon +49 211 53883-216 >> >>> Fax +49-211-550266-19 >> >>> http://www.raytion.com >> >>> >> >>> >> >> >>