Re: Tika: url issue

2014-06-05 Thread Paul Rogers
Hi

Can you not split it using oracle's string functions (as part of your
select statement)?

Something along the lines of:

SELECT .

RIGHT(LEFT(d.doc_name, (INSTR(d.doc_name, '#') - 1)),
LENGTH(LEFT(d.doc_name, (INSTR(d.doc_name, '#') - 1))) - 1)  as Name,
 ^- (strip asterisk from front)
...

Regards

P

On 4 June 2014 06:46, harshrossi  wrote:

> Hi,
>
>I am working on Solr using DataImortHander for indexing rich documents
> like pdf,word,image etc
> I am using TikaEntityProcessor for extracting contents from the files.
>
> I have one small issue regarding setting value to 'url' entry.
>
> My data-config.xml file is like so:
>
> 
>  driver="oracle.jdbc.OracleDriver"
> url="jdbc:oracle:thin:@KOR308051.bmh.apac.bosch.com:1521:xe"
> user="ezbdb"
> password="ezbdb"/>
>
> 
>
>
>
> 
>  query="SELECT
> d.doc_url as Link,
> d.doc_name as Name,
>
> cast(trunc(d.last_modified) as date) as Last_modified
> FROM doc_data d
> dataSource="db_ds"
> transformer="DateFormatTransformer,script:getFilePath">
> 
> 
>  xpath="/RDF/item/date"
> dateTimeFormat="-MM-dd HH:mm:ss"/>
>
>  processor="TikaEntityProcessor"
>   url="${db_link.LINK}" format="text"
> onError="skip">
>  
>
>
>
> 
> 
>
> The thing is, the file path is stored in a different pattern in the
> database:
> "doc_url" is the field in db which stores the url or file path. The file
> path is stored in this way:
>  *D:\Games\CS2\setup.doc#D:\Games\CS2\setup.doc#*
> i.e. the path is stored twice seperated by a '#'. I am not sure why it is
> done. It has been done by our client.
>
> All I need is only the one file path i.e. D:\Games\CS2\setup.doc
> I am passing the url value to tika as * url="${db_link.LINK}"
> *
> But the *${db_link.LINK}* contains the path coming from database directly.
> I have tried using script transformer and splitting the path string to
> parts
> by '#' and taking the first path using the method *getFilePath(row)* but no
> luck.
>
> I am still getting the path as stored in db. This gives a *FileNotFound*
> exception while trying to index it and that is obvious because the path is
> incorrect.
>
> What can be done to get only the path and leaving out rest of the path
> having # and all?
>
> Help would be much appreciated :)
>
>
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Tika-url-issue-tp4139781.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


How to search for phrase "IAE_UPC_0001"

2014-07-31 Thread Paul Rogers
Hi Guys

I have a Solr application searching on data uploaded by Nutch.  The search
I wish to carry out is for a particular document reference contained within
the "url" field, e.g. IAE-UPC-0001.

The problem is is that the file names that comprise the url's are not
consistent, so a url might contain the reference as IAE-UPC-0001 or
IAE_UPC_0001 (ie using either the minus or underscore as the delimiter) but
not both.

I have created the query (in the solr admin interface):

url:"IAE-UPC-0001"

which works (returning the single expected document), as do:

url:"IAE*UPC*0001"
url:"IAE?UPC?0001"

when the doc ref is in the format IAE-UPC-0001 (ie using the minus sign as
a delimiter).

However:

url:"IAE_UPC_0001"
url:"IAE*UPC*0001"
url:"IAE?UPC?0001"

do not work (returning zero documents) when the doc ref is in the format
IAE_UPC_0001 (ie using the underscore character as the delimiter).

I'm assuming the underscore is a special character but have tried looking
at the solr wiki but can't find anything to say what the problem is.  Also
the minus sign also has a specific meaning but is nullified by adding the
quotes.

Can anyone suggest what I'm doing wrong?

Many thanks

Paul


Re: How to search for phrase "IAE_UPC_0001"

2014-07-31 Thread Paul Rogers
Hi Erick

Thanks for the reply.  I'll have a look and see if it is any help.  Again
thanks for pointing me in the right direction.

regards

Paul


On 31 July 2014 11:58, Erick Erickson  wrote:

> Take a look at WordDelimiterFilterFactory. It has a bunch of
> options to allow this kind of thing to be indexed and searched.
>
> Note that in the default schema, the definition in the index part
> of the fieldType definition has slightly different parameters than
> the query time WordDelimiterFilterFactory, that's a good place
> to start.
>
> WARNING: WDFF is a bit complex, you _really_ would be well
> served by spending some time with the Admin/Analysis page to
> understand the effects of these parameters...
>
> Best,
> Erick
>
>
>
>
> On Thu, Jul 31, 2014 at 9:31 AM, Paul Rogers 
> wrote:
>
> > Hi Guys
> >
> > I have a Solr application searching on data uploaded by Nutch.  The
> search
> > I wish to carry out is for a particular document reference contained
> within
> > the "url" field, e.g. IAE-UPC-0001.
> >
> > The problem is is that the file names that comprise the url's are not
> > consistent, so a url might contain the reference as IAE-UPC-0001 or
> > IAE_UPC_0001 (ie using either the minus or underscore as the delimiter)
> but
> > not both.
> >
> > I have created the query (in the solr admin interface):
> >
> > url:"IAE-UPC-0001"
> >
> > which works (returning the single expected document), as do:
> >
> > url:"IAE*UPC*0001"
> > url:"IAE?UPC?0001"
> >
> > when the doc ref is in the format IAE-UPC-0001 (ie using the minus sign
> as
> > a delimiter).
> >
> > However:
> >
> > url:"IAE_UPC_0001"
> > url:"IAE*UPC*0001"
> > url:"IAE?UPC?0001"
> >
> > do not work (returning zero documents) when the doc ref is in the format
> > IAE_UPC_0001 (ie using the underscore character as the delimiter).
> >
> > I'm assuming the underscore is a special character but have tried looking
> > at the solr wiki but can't find anything to say what the problem is.
>  Also
> > the minus sign also has a specific meaning but is nullified by adding the
> > quotes.
> >
> > Can anyone suggest what I'm doing wrong?
> >
> > Many thanks
> >
> > Paul
> >
>


Re: How to search for phrase "IAE_UPC_0001"

2014-07-31 Thread Paul Rogers
Hi Jack

Thanks for the info. I'll take a look and see if I can figure it out (just
purchased the book).

P


On 31 July 2014 17:16, Jack Krupansky  wrote:

> And I have a lot more explanation and examples for word delimiter filter
> in my e-book:
> http://www.lulu.com/us/en/shop/jack-krupansky/solr-4x-
> deep-dive-early-access-release-7/ebook/product-21203548.html
>
> -- Jack Krupansky
>
> -Original Message- From: Erick Erickson
> Sent: Thursday, July 31, 2014 12:58 PM
> To: solr-user@lucene.apache.org
> Subject: Re: How to search for phrase "IAE_UPC_0001"
>
>
> Take a look at WordDelimiterFilterFactory. It has a bunch of
> options to allow this kind of thing to be indexed and searched.
>
> Note that in the default schema, the definition in the index part
> of the fieldType definition has slightly different parameters than
> the query time WordDelimiterFilterFactory, that's a good place
> to start.
>
> WARNING: WDFF is a bit complex, you _really_ would be well
> served by spending some time with the Admin/Analysis page to
> understand the effects of these parameters...
>
> Best,
> Erick
>
>
>
>
> On Thu, Jul 31, 2014 at 9:31 AM, Paul Rogers 
> wrote:
>
>  Hi Guys
>>
>> I have a Solr application searching on data uploaded by Nutch.  The search
>> I wish to carry out is for a particular document reference contained
>> within
>> the "url" field, e.g. IAE-UPC-0001.
>>
>> The problem is is that the file names that comprise the url's are not
>> consistent, so a url might contain the reference as IAE-UPC-0001 or
>> IAE_UPC_0001 (ie using either the minus or underscore as the delimiter)
>> but
>> not both.
>>
>> I have created the query (in the solr admin interface):
>>
>> url:"IAE-UPC-0001"
>>
>> which works (returning the single expected document), as do:
>>
>> url:"IAE*UPC*0001"
>> url:"IAE?UPC?0001"
>>
>> when the doc ref is in the format IAE-UPC-0001 (ie using the minus sign as
>> a delimiter).
>>
>> However:
>>
>> url:"IAE_UPC_0001"
>> url:"IAE*UPC*0001"
>> url:"IAE?UPC?0001"
>>
>> do not work (returning zero documents) when the doc ref is in the format
>> IAE_UPC_0001 (ie using the underscore character as the delimiter).
>>
>> I'm assuming the underscore is a special character but have tried looking
>> at the solr wiki but can't find anything to say what the problem is.  Also
>> the minus sign also has a specific meaning but is nullified by adding the
>> quotes.
>>
>> Can anyone suggest what I'm doing wrong?
>>
>> Many thanks
>>
>> Paul
>>
>>
>


Re: How to search for phrase "IAE_UPC_0001"

2014-08-04 Thread Paul Rogers
Hi Guys

Thanks for the replies.  I've had a look at the WordDelimiterFilterFactory
and the Term Info for the url field.  It seems that all the terms exist and
I now understand that each url is being broken up using the delimiters
specified.  But I think I'm still missing something.

Am I correct in assuming the minus sign (-) is also a delimiter?

If so why then does  url:"IAE-UPC-0001" return a result (when the url
contains the substring IAE-UPC-0001) whereas  url:"IAE_UPC_0001" doesn't
(when the url contains the substring IAE_UPC_0001)?

Secondly if the url has indeed been broken into the terms IAE UPC and 0001
why do all the searches suggested or tried succeed when the delimiter is a
minus sign (-) but not when the delimiter is an underscore (_), returning
zero matches?

Finally, shouldn't the query url:"IAE UPC 0001"~1 work since all it is
looking for is the three terms?

Many thanks for any enlightenment.

P




On 4 August 2014 01:33, Harald Kirsch  wrote:

> This all depends on how the tokenizers take your URLs apart. To quickly
> see what ended up in the index, go to a core in the UI, select Schema
> Browser, select the field containing your URLs, click on "Load Term Info".
>
> In your case, for the field holding the URL you could try to switch to a
> tokenizer that defines tokens as a sequence of alphanumeric characters,
> roughly [a-z0-9]+ plus diacritics. In particular punctuation and separation
> characters like dash, underscore, slash, dot and the like would never be
> part of a token, i.e. they don't make a difference.
>
> Then you can search the url parts with a phrase query (
> https://cwiki.apache.org/confluence/display/solr/The+
> Standard+Query+Parser#TheStandardQueryParser-
> SpecifyingTermsfortheStandardQueryParserwhich) like
>
>  url:"IAE-UPC-0001"
>
> In the same way as during indexing, the dashes are removed to end up with
> three tokens, namely IAE, UPC and 0001. Further they have to be in that
> order. Naturally this will then match anything like:
>
>   "IAE_UPC_0001"
>   "IAE UPC 0001"
>   "IAE/UPC+0001"
>   "IAE\UPC\0001"
>   "IAE.UPC,0001"
>
> Depending on how your URLs are structured, there is the chance for false
> positives, of course.
>
> The Really Good Thing here is, that you don't need to use wildcards.
>
> I have not yet looked at the wildcard-queries implementation in
> Solr/Lucene, but with the  commercial search engines I know, they are a
> great way to loose the confidence of your users, because they just don't
> work as expected by anyone not knowing the implementation. Either they
> deliver only partial results or they kill the performance or they even go
> OOM. If Solr committers have not done something really ingenious,
> Solr/Lucene does have the same problems.
>
> Harald.
>
>
>
>
>
>
> On 31.07.2014 18:31, Paul Rogers wrote:
>
>> Hi Guys
>>
>> I have a Solr application searching on data uploaded by Nutch.  The search
>> I wish to carry out is for a particular document reference contained
>> within
>> the "url" field, e.g. IAE-UPC-0001.
>>
>> The problem is is that the file names that comprise the url's are not
>> consistent, so a url might contain the reference as IAE-UPC-0001 or
>> IAE_UPC_0001 (ie using either the minus or underscore as the delimiter)
>> but
>> not both.
>>
>> I have created the query (in the solr admin interface):
>>
>> url:"IAE-UPC-0001"
>>
>> which works (returning the single expected document), as do:
>>
>> url:"IAE*UPC*0001"
>> url:"IAE?UPC?0001"
>>
>> when the doc ref is in the format IAE-UPC-0001 (ie using the minus sign as
>> a delimiter).
>>
>> However:
>>
>> url:"IAE_UPC_0001"
>> url:"IAE*UPC*0001"
>> url:"IAE?UPC?0001"
>>
>> do not work (returning zero documents) when the doc ref is in the format
>> IAE_UPC_0001 (ie using the underscore character as the delimiter).
>>
>> I'm assuming the underscore is a special character but have tried looking
>> at the solr wiki but can't find anything to say what the problem is.  Also
>> the minus sign also has a specific meaning but is nullified by adding the
>> quotes.
>>
>> Can anyone suggest what I'm doing wrong?
>>
>> Many thanks
>>
>> Paul
>>
>>
> --
> Harald Kirsch
> Raytion GmbH
> Kaiser-Friedrich-Ring 74
> 40547 Duesseldorf
> Fon +49 211 53883-216
> Fax +49-211-550266-19
> http://www.raytion.com
>


Re: How to search for phrase "IAE_UPC_0001"

2014-08-18 Thread Paul Rogers
Hi Guys

I've been checking into this further and have deleted the index a couple of
times and rebuilt it with the suggestions you've supplied.

I had a bit of an epiphany last week and decided to check if the document I
was searching for was actually in the index (did this by doing a *.* query
to a file and grep'ing for the 'IAE_UPC_0001@ string).  It seems it isn't!!
Not sure if it was in the original index or not, tho' I suspect not.

As far as I can see anything with the reference in the form IAE_UPC_
has not been indexed while those with the reference in the form
IAE-UPC- has.  Not sure if that's a coincidence or not.

Need to see if I can get the docs into the index and then check if the
search works or not.  Will see if the guys on the Nutch list can shed any
light.

All the best.

P


On 4 August 2014 17:09, Jack Krupansky  wrote:

> The standard tokenizer treats underscore as a valid token character, not a
> delimiter.
>
> The word delimiter filter will treat underscore as a delimiter though.
>
> Make sure your query-time WDF does not have preserveOriginal="1" - but the
> index-time WDF should have preserveOriginal="1". Otherwise, the query
> phrase will generate an extra token which will participate in the matching
> and might cause a mismatch.
>
> -- Jack Krupansky
>
> -Original Message- From: Paul Rogers
> Sent: Monday, August 4, 2014 5:55 PM
>
> To: solr-user@lucene.apache.org
> Subject: Re: How to search for phrase "IAE_UPC_0001"
>
> Hi Guys
>
> Thanks for the replies.  I've had a look at the WordDelimiterFilterFactory
> and the Term Info for the url field.  It seems that all the terms exist and
> I now understand that each url is being broken up using the delimiters
> specified.  But I think I'm still missing something.
>
> Am I correct in assuming the minus sign (-) is also a delimiter?
>
> If so why then does  url:"IAE-UPC-0001" return a result (when the url
> contains the substring IAE-UPC-0001) whereas  url:"IAE_UPC_0001" doesn't
> (when the url contains the substring IAE_UPC_0001)?
>
> Secondly if the url has indeed been broken into the terms IAE UPC and 0001
> why do all the searches suggested or tried succeed when the delimiter is a
> minus sign (-) but not when the delimiter is an underscore (_), returning
> zero matches?
>
> Finally, shouldn't the query url:"IAE UPC 0001"~1 work since all it is
> looking for is the three terms?
>
> Many thanks for any enlightenment.
>
> P
>
>
>
>
> On 4 August 2014 01:33, Harald Kirsch  wrote:
>
>  This all depends on how the tokenizers take your URLs apart. To quickly
>> see what ended up in the index, go to a core in the UI, select Schema
>> Browser, select the field containing your URLs, click on "Load Term Info".
>>
>> In your case, for the field holding the URL you could try to switch to a
>> tokenizer that defines tokens as a sequence of alphanumeric characters,
>> roughly [a-z0-9]+ plus diacritics. In particular punctuation and
>> separation
>> characters like dash, underscore, slash, dot and the like would never be
>> part of a token, i.e. they don't make a difference.
>>
>> Then you can search the url parts with a phrase query (
>> https://cwiki.apache.org/confluence/display/solr/The+
>> Standard+Query+Parser#TheStandardQueryParser-
>> SpecifyingTermsfortheStandardQueryParserwhich) like
>>
>>  url:"IAE-UPC-0001"
>>
>> In the same way as during indexing, the dashes are removed to end up with
>> three tokens, namely IAE, UPC and 0001. Further they have to be in that
>> order. Naturally this will then match anything like:
>>
>>   "IAE_UPC_0001"
>>   "IAE UPC 0001"
>>   "IAE/UPC+0001"
>>   "IAE\UPC\0001"
>>   "IAE.UPC,0001"
>>
>> Depending on how your URLs are structured, there is the chance for false
>> positives, of course.
>>
>> The Really Good Thing here is, that you don't need to use wildcards.
>>
>> I have not yet looked at the wildcard-queries implementation in
>> Solr/Lucene, but with the  commercial search engines I know, they are a
>> great way to loose the confidence of your users, because they just don't
>> work as expected by anyone not knowing the implementation. Either they
>> deliver only partial results or they kill the performance or they even go
>> OOM. If Solr committers have not done something really ingenious,
>> Solr/Lucene does have the same problems.
>>
>> Harald.
>>
>>
>>
>>
>>
>>

Re: How to search for phrase "IAE_UPC_0001"

2014-08-18 Thread Paul Rogers
Hi Erick

Thanks for the assist.  Did as you suggested (tho' I used Nutch).  Cleared
out solr's index and Nutch's crawl DB and then emptied all the documents
out of the web server bar 10 of each type (IAE-UPC- and IAE_UPC_).
 Then crawled the site using Nutch.

Then confirmed that all 20 docs had been uploaded and that *.* search
returned all 20 docs.

Now when I do a url search on either (for example) q=url:"IAE-UPC-220" or
q="IAE_UPC_0001" I get a result returned for each as expected, ie it now
works as expected.

So seems I now need to figure out why Nutch isn't crawling the documents.

Again many thanks.

P




On 18 August 2014 11:22, Erick Erickson  wrote:

> I'd pull Nutch out of the mix here as a test. Create
> some test docs (use the exampleDocs directory?) and
> go from there at least long enough to insure that Solr
> does what you expect if the data gets there properly.
>
> You can set this up in about 10 minutes, and test it
> in about 15 more. May save you endless hours.
>
> Because you're conflating two issues here:
> 1> whether Nutch is sending the data
> 2> whether Solr is indexing and searching as you expect.
>
> Some of the Solr/Lucene analysis chains do transformations
> that may not be what you assume, particularly things
> like StandardTokenizer and WordDelimiterFilterFactory.
>
> So I'd take the time to see that the values you're dealing
> with are behaving as you expect. The admin/analysis page
> will help you a _lot_ here.
>
> Best,
> Erick
>
>
>
>
> On Mon, Aug 18, 2014 at 7:16 AM, Paul Rogers 
> wrote:
> > Hi Guys
> >
> > I've been checking into this further and have deleted the index a couple
> of
> > times and rebuilt it with the suggestions you've supplied.
> >
> > I had a bit of an epiphany last week and decided to check if the
> document I
> > was searching for was actually in the index (did this by doing a *.*
> query
> > to a file and grep'ing for the 'IAE_UPC_0001@ string).  It seems it
> isn't!!
> > Not sure if it was in the original index or not, tho' I suspect not.
> >
> > As far as I can see anything with the reference in the form IAE_UPC_
> > has not been indexed while those with the reference in the form
> > IAE-UPC- has.  Not sure if that's a coincidence or not.
> >
> > Need to see if I can get the docs into the index and then check if the
> > search works or not.  Will see if the guys on the Nutch list can shed any
> > light.
> >
> > All the best.
> >
> > P
> >
> >
> > On 4 August 2014 17:09, Jack Krupansky  wrote:
> >
> >> The standard tokenizer treats underscore as a valid token character,
> not a
> >> delimiter.
> >>
> >> The word delimiter filter will treat underscore as a delimiter though.
> >>
> >> Make sure your query-time WDF does not have preserveOriginal="1" - but
> the
> >> index-time WDF should have preserveOriginal="1". Otherwise, the query
> >> phrase will generate an extra token which will participate in the
> matching
> >> and might cause a mismatch.
> >>
> >> -- Jack Krupansky
> >>
> >> -Original Message- From: Paul Rogers
> >> Sent: Monday, August 4, 2014 5:55 PM
> >>
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: How to search for phrase "IAE_UPC_0001"
> >>
> >> Hi Guys
> >>
> >> Thanks for the replies.  I've had a look at the
> WordDelimiterFilterFactory
> >> and the Term Info for the url field.  It seems that all the terms exist
> and
> >> I now understand that each url is being broken up using the delimiters
> >> specified.  But I think I'm still missing something.
> >>
> >> Am I correct in assuming the minus sign (-) is also a delimiter?
> >>
> >> If so why then does  url:"IAE-UPC-0001" return a result (when the url
> >> contains the substring IAE-UPC-0001) whereas  url:"IAE_UPC_0001" doesn't
> >> (when the url contains the substring IAE_UPC_0001)?
> >>
> >> Secondly if the url has indeed been broken into the terms IAE UPC and
> 0001
> >> why do all the searches suggested or tried succeed when the delimiter
> is a
> >> minus sign (-) but not when the delimiter is an underscore (_),
> returning
> >> zero matches?
> >>
> >> Finally, shouldn't the query url:"IAE UPC 0001"~1 work since all it is
> >> looking for is the three terms?
&g

Problem with Solr and Nutch integration

2011-02-27 Thread Paul Rogers
Hi Guys

I'm trying to integrate solr and nutch as per
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/, using the
branch_3x from svn on Tomcat 6.  After adding the "nutch" requestHandler to
solrconfig.xml while the solr-example will start on accessing the admin page
I get the following error message:

HTTP Status 404 - missing core name in path
--

*type* Status report

*message* *missing core name in path*

*description* *The requested resource (missing core name in path) is not
available.*

After  googling I found the following:

I think you are seeing the effects of SOLR-1743 masking another error ...
have you checked your log for other errors/exceptions being logged when
you startup solr with that solrconfig.xml?

Chris Hostetter

My initial question is this, am I right in thinking that solr should be
using the Tomcat 6 logging and that logs should be found at
$CATALINA_HOME/logs/catalina.out?

Thanks

Paul


Re: Problem with Solr and Nutch integration

2011-02-28 Thread Paul Rogers
Hi Anurag

Sorry for missing that key piece of info out.  I'm running Linux (Centos
5.5).

Regards

Paul

On 28 February 2011 07:26, Anurag  wrote:

> Which os u are using?
>


Re: Problem with Solr and Nutch integration

2011-02-28 Thread Paul Rogers
Hi Anurag

Thanks for the prompt reply.

I'm following the tutorial at

http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/

I have built solr and the example and added it to Tomcat as per

http://wiki.apache.org/solr/SolrTomcat

and this (solr-example) all appears to work fine (I can access the solr
admin page at 
http://localhost:8080/solr-example/admin/
and
search using the same).

I have copied the nutch schema.xml across and replaced the example one.
 Again everything seems to work fine.

However when I add the request handler:



dismax
explicit
0.01

content^0.5 anchor^1.0 title^1.2


content^0.5 anchor^1.5 title^1.2 site^1.5


url


2<-1 5<-2 6<90%

100

*:*
title url content
0
title
0
url
regex




and restart the solr-example app under tomcat I get the following error:

HTTP Status 404 - missing core name in path
--

*type* Status report

*message* *missing core name in path*

*description* *The requested resource (missing core name in path) is not
available.*
*
*
As soon as I comment out the request handler the example appears to work
again.

>From the previous mentioned post I understand that this error is masking the
actual error and I need to check the logs.  However I'm unsure exactly where
these are located.

I was hoping if I could post them It'd allow you guys to suggest a solution.

Many thanks


Paul
On 28 February 2011 11:37, Anurag  wrote:

> Solr uses jetty server  default, do u know that? you can run solr server
> without using Tomcat (using jetty server).
> Please describe the steps that led to the error. Which command u executed?
>
>


Re: Problem with Solr and Nutch integration

2011-03-01 Thread Paul Rogers
Hi Anurag

The request handler has been added the solrconfig file.

I'll try your attached requesthandler and see if that helps.

Interestingly enough the whole setup when I was using nutch 1.2/solr 1.4.1.
 It is only since moving to nutch trunk/solr branch_3x that the problem has
occurred.  I assume that something has changed inbetween and the tutorial's
request handler is incorrect for the later solr version.  Which versions of
solr/nutch are you using?

Assuming the catalina.out file is the correct log file the output I get is
shown below.  This output occurs on restarting the solr-example after adding
the new requesthandler.  When I access the solr admin page no additional
logging occurs.  Can any one see the problem?

Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrResourceLoader
locateSolrHome

INFO: Using JNDI solr.home: /opt/solr/example/solr

Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrResourceLoader 

INFO: Solr home set to '/opt/solr/example/solr/'

Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrResourceLoader
addToClassLoader

SEVERE: Can't find (or read) file to add to classloader:
/opt/solr/example/solr/./lib

Feb 28, 2011 6:28:59 PM org.apache.solr.servlet.SolrDispatchFilter init

INFO: SolrDispatchFilter.init()

Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrResourceLoader
locateSolrHome

INFO: Using JNDI solr.home: /opt/solr/example/solr

Feb 28, 2011 6:28:59 PM org.apache.solr.core.CoreContainer$Initializer
initialize
INFO: looking for solr.xml: /opt/solr/example/solr/solr.xml

Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrResourceLoader
locateSolrHome

INFO: Using JNDI solr.home: /opt/solr/example/solr

Feb 28, 2011 6:28:59 PM org.apache.solr.core.CoreContainer 

INFO: New CoreContainer: solrHome=/opt/solr/example/solr/ instance=6794958

Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrResourceLoader 

INFO: Solr home set to '/opt/solr/example/solr/'

Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrResourceLoader
addToClassLoader

SEVERE: Can't find (or read) file to add to classloader:
/opt/solr/example/solr/./lib

Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrResourceLoader 

INFO: Solr home set to '/opt/solr/example/solr/./'

Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrResourceLoader
addToClassLoader

SEVERE: Can't find (or read) file to add to classloader:
/opt/solr/example/solr/././lib

Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrConfig initLibs

INFO: Adding specified lib dirs to ClassLoader

Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader

INFO: Adding
'file:/opt/solr/contrib/extraction/lib/commons-compress-1.1.jar' to
classloader
Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader

INFO: Adding 'file:/opt/solr/contrib/extraction/lib/log4j-1.2.14.jar' to
classloader
Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader

INFO: Adding
'file:/opt/solr/contrib/extraction/lib/commons-logging-1.1.1.jar' to
classloader
Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader

INFO: Adding 'file:/opt/solr/contrib/extraction/lib/tika-parsers-0.8.jar' to
classloader
Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader

INFO: Adding 'file:/opt/solr/contrib/extraction/lib/asm-3.1.jar' to
classloader
Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader

INFO: Adding 'file:/opt/solr/contrib/extraction/lib/icu4j-4_6.jar' to
classloader
Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader

INFO: Adding 'file:/opt/solr/contrib/extraction/lib/xercesImpl-2.8.1.jar' to
classloader
Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader

INFO: Adding 'file:/opt/solr/contrib/extraction/lib/bcmail-jdk15-1.45.jar'
to classloader
Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader

INFO: Adding 'file:/opt/solr/contrib/extraction/lib/fontbox-1.3.1.jar' to
classloader
Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader

INFO: Adding 'file:/opt/solr/contrib/extraction/lib/poi-3.7.jar' to
classloader
Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader

INFO: Adding 'file:/opt/solr/contrib/extraction/lib/dom4j-1.6.1.jar' to
classloader
Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader

INFO: Adding
'file:/opt/solr/contrib/extraction/lib/geronimo-stax-api_1.0_spec-1.0.1.jar'
to classloader
Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader

INFO: Adding 'file:/opt/solr/contrib/extraction/lib/poi-ooxml-3.7.jar' to
classloader
Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader

INFO: Adding 'file:/opt/solr/contrib/extraction/lib/xml-apis-1.0.b2.jar' to
classloader
Feb 28, 2011 6:28:59 PM org.apache.solr.core.SolrResourceLoader
replaceClassLoader

INFO: Adding
'file:/

Problem adding new requesthandler to solr branch_3x

2011-03-04 Thread Paul Rogers
Dear All

Following on from:

http://lucene.472066.n3.nabble.com/Problem-with-Solr-and-Nutch-integration-tp2590334p2601915.html

I'm trying to add a new request handler to solr (the branch_3x checked
out from svn.  The request handler is as follows:

  
    
       dismax
       explicit
       0.01
       
          content^0.5 anchor^1.0 title^1.2
       
       
          content^0.5 anchor^1.5 title^1.2 site^1.5
       
       
          url
       
       
          2<-1 5<-2 6<90%
       
       100
       
       *:*
       title url content
       0
       title
       0
       url
       regex
     
  

This causes the solr-example (under Tomcat version 6) to fail on
startup and prevents me from accessing the solr admin screen.  The
same request handler works fine under the solr-example provided with
solr 1.4.1.  Through trial and error I have discovered the problem is
with the line:



If this amended to read:

true

the solr-example starts fine.

Can anyone explain:

1.  Why the problem occurs (has something changed between 1.4.1 and 3x)?
2.  Is the amended statement (true) the same
(equivalent) to the original ()?

Many thanks

Regards

Paul


Re: Problem adding new requesthandler to solr branch_3x

2011-03-05 Thread Paul Rogers
Koji

many thanks for that.

regards

Paul

On 5 March 2011 00:12, Koji Sekiguchi  wrote:

> 
>>
>> If this amended to read:
>>
>> true
>>
>> the solr-example starts fine.
>>
>
> Paul,
>
> It should be true.
>
> Koji
> --
> http://www.rondhuit.com/en/
>


Re: Problem adding new requesthandler to solr branch_3x

2011-03-09 Thread Paul Rogers
Hoss

many thanks for the reply

Paul

On 8 March 2011 19:45, Chris Hostetter  wrote:
>
> : 1.  Why the problem occurs (has something changed between 1.4.1 and 3x)?
>
> Various pieces of code dealing with config parsing have changed since
> 1.4.1 to be better about verifying that configs are meaningful ,ad
> reporting errors when unexpected things are encountered.  i'm not sure of
> the specific change, but the underlying point is: if 1.4.1 wasn't giving
> you an error for that syntax, it's because it was compleltey ignoring it.
>
>
> -Hoss


Re: Trying to Post. Emails rejected as spam.

2011-04-07 Thread Paul Rogers
Hi Park

I had the same problem.  I noticed one of the issues with the blocked
messages are they are HTML/Rich Text.

(FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,FS_REPLICA,
HTML_MESSAGE 
<-,RCVD_IN_DNSWL_NONE,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL

In GMail I can switch to plain text.  This fixed the problem for me.
If you can do the same in Yahoo you should find it reduces the spam
score sufficiently to allow the messages through.

Regards

Paul

On 7 April 2011 20:21, Ezequiel Calderara  wrote:
>
> Happened to me a couple of times, couldn't find a way a workaround...
>
> On Thu, Apr 7, 2011 at 4:14 PM, Parker Johnson  wrote:
>
> >
> > Hello everyone.  Does anyone else have problems posting to the list?  My
> > messages keep getting rejected with this response below.  I'll be surprised
> > if
> > this one makes it through :)
> >
> > -Park
> >
> > Sorry, we were unable to deliver your message to the following address.
> >
> > :
> > Remote  host said: 552 spam score (8.0) exceeded threshold
> >
> > (FREEMAIL_ENVFROM_END_DIGIT,FREEMAIL_FROM,FS_REPLICA,HTML_MESSAGE,RCVD_IN_DNSWL_NONE,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL
> >  ) [BODY]
> >
> > --- Below this line is a copy of the message.
> >
>
>
>
> --
> __
> Ezequiel.
>
> Http://www.ironicnet.com