Re: SolrCloud Feedback

2011-02-10 Thread thorsten

Hi Mark, hi all,

I just got a customer request to conduct an analysis on the state of
SolrCloud. 

He wants to see SolrCloud part of the next solr 1.5 release and is willing
to sponsor our dev time to close outstanding bugs and open issues that may
prevent the inclusion of SolrCloud in the next release. I need to give him a
listing of issues and an estimation how long it will take us to fix them.

I did
https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+SOLR+AND+(summary+~+cloud+OR+description+~+cloud+OR+comment+~+cloud)+AND+resolution+%3D+Unresolved
which returns me 8 bug. Do you consider this a comprehensive list of open
issues or are there missing some important ones in this list?

I read http://wiki.apache.org/solr/SolrCloud and it is talking about a
branch of its own however when I review
https://issues.apache.org/jira/browse/SOLR-1873 I get the impression that
the work is already merged back into trunk, right?

So what is the best to start testing the branch or trunk?

TIA for any informations

salu2
-- 
Thorsten Scherler 
codeBusters S.L. - web based systems

http://www.codebusters.es/
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrCloud-Feedback-tp2290048p2467091.html
Sent from the Solr - User mailing list archive at Nabble.com.


SolrCloud - Example C not working

2011-02-14 Thread Thorsten Scherler
Hi all,

I followed http://wiki.apache.org/solr/SolrCloud and everything worked
fine till I tried "Example C:".

I start all 4 server but all of them keep looping through:

"java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
at org.apache.zookeeper.ClientCnxn
$SendThread.run(ClientCnxn.java:1078)
Feb 14, 2011 1:31:16 PM org.apache.log4j.Category info
INFO: Opening socket connection to server localhost/127.0.0.1:9983
Feb 14, 2011 1:31:16 PM org.apache.log4j.Category warn
WARNING: Session 0x0 for server null, unexpected error, closing socket
connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
at org.apache.zookeeper.ClientCnxn
$SendThread.run(ClientCnxn.java:1078)
Feb 14, 2011 1:31:16 PM org.apache.log4j.Category info
INFO: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:9900
Feb 14, 2011 1:31:16 PM org.apache.log4j.Category warn
WARNING: Session 0x0 for server null, unexpected error, closing socket
connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
at org.apache.zookeeper.ClientCnxn
$SendThread.run(ClientCnxn.java:1078)
Feb 14, 2011 1:31:17 PM org.apache.log4j.Category info
INFO: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:9983
Feb 14, 2011 1:31:17 PM org.apache.log4j.Category warn
WARNING: Session 0x0 for server null, unexpected error, closing socket
connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
at org.apache.zookeeper.ClientCnxn
$SendThread.run(ClientCnxn.java:1078)
Feb 14, 2011 1:31:19 PM org.apache.log4j.Category info
INFO: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:8574
Feb 14, 2011 1:31:19 PM org.apache.log4j.Category warn
WARNING: Session 0x0 for server null, unexpected error, closing socket
connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
at org.apache.zookeeper.ClientCnxn
$SendThread.run(ClientCnxn.java:1078)
Feb 14, 2011 1:31:20 PM org.apache.log4j.Category info
INFO: Opening socket connection to server localhost/127.0.0.1:8574
Feb 14, 2011 1:31:20 PM org.apache.log4j.Category warn
WARNING: Session 0x0 for server null, unexpected error, closing socket
connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
at org.apache.zookeeper.ClientCnxn
$SendThread.run(ClientCnxn.java:1078)

The problem seems that the zk instances can not connects to the
different nodes and so do not get up at all.

I am using revision 1070473 for the tests. Anybody has an idea?

salu2
-- 
Thorsten Scherler 
codeBusters S.L. - web based systems

http://www.codebusters.es/


smime.p7s
Description: S/MIME cryptographic signature


Re: SolrCloud - Example C not working

2011-02-15 Thread Thorsten Scherler
Hmm, nobody has an idea, for everybody the example c is working fine.

salu2

On Mon, 2011-02-14 at 14:08 +0100, Thorsten Scherler wrote:
> Hi all,
> 
> I followed http://wiki.apache.org/solr/SolrCloud and everything worked
> fine till I tried "Example C:".
> 
> I start all 4 server but all of them keep looping through:
> 
> "java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
> at org.apache.zookeeper.ClientCnxn
> $SendThread.run(ClientCnxn.java:1078)
> Feb 14, 2011 1:31:16 PM org.apache.log4j.Category info
> INFO: Opening socket connection to server localhost/127.0.0.1:9983
> Feb 14, 2011 1:31:16 PM org.apache.log4j.Category warn
> WARNING: Session 0x0 for server null, unexpected error, closing socket
> connection and attempting reconnect
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
> at org.apache.zookeeper.ClientCnxn
> $SendThread.run(ClientCnxn.java:1078)
> Feb 14, 2011 1:31:16 PM org.apache.log4j.Category info
> INFO: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:9900
> Feb 14, 2011 1:31:16 PM org.apache.log4j.Category warn
> WARNING: Session 0x0 for server null, unexpected error, closing socket
> connection and attempting reconnect
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
> at org.apache.zookeeper.ClientCnxn
> $SendThread.run(ClientCnxn.java:1078)
> Feb 14, 2011 1:31:17 PM org.apache.log4j.Category info
> INFO: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:9983
> Feb 14, 2011 1:31:17 PM org.apache.log4j.Category warn
> WARNING: Session 0x0 for server null, unexpected error, closing socket
> connection and attempting reconnect
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
> at org.apache.zookeeper.ClientCnxn
> $SendThread.run(ClientCnxn.java:1078)
> Feb 14, 2011 1:31:19 PM org.apache.log4j.Category info
> INFO: Opening socket connection to server localhost/0:0:0:0:0:0:0:1:8574
> Feb 14, 2011 1:31:19 PM org.apache.log4j.Category warn
> WARNING: Session 0x0 for server null, unexpected error, closing socket
> connection and attempting reconnect
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
> at org.apache.zookeeper.ClientCnxn
> $SendThread.run(ClientCnxn.java:1078)
> Feb 14, 2011 1:31:20 PM org.apache.log4j.Category info
> INFO: Opening socket connection to server localhost/127.0.0.1:8574
> Feb 14, 2011 1:31:20 PM org.apache.log4j.Category warn
> WARNING: Session 0x0 for server null, unexpected error, closing socket
> connection and attempting reconnect
> java.net.ConnectException: Connection refused
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574)
> at org.apache.zookeeper.ClientCnxn
> $SendThread.run(ClientCnxn.java:1078)
> 
> The problem seems that the zk instances can not connects to the
> different nodes and so do not get up at all.
> 
> I am using revision 1070473 for the tests. Anybody has an idea?
> 
> salu2

-- 
Thorsten Scherler 
codeBusters S.L. - web based systems

http://www.codebusters.es/


smime.p7s
Description: S/MIME cryptographic signature


[solrCloud] Distributed IDF - scoring in the cloud

2011-02-18 Thread Thorsten Scherler
Hi all,

doing the solrCloud examples and one thing I am not clear about is the
scoring in a distributed search.

I did a small test where I used the "Example A: Simple two shard
cluster" from wiki:SolrCloud and additional added 

java -Durl=http://localhost:7574/solr/collection1/update -jar post.jar
ipod_other.xml

java -Durl=http://localhost:8983/solr/collection1/update -jar post.jar
monitor2.xml

Now requesting
http://localhost:8983/solr/collection1/select?distrib=true&q=electronics&fl=score&shards=localhost:8983/solr,localhost:7574/solr
for both host will return the same result. Here we get the score for
each hit based on the shard specific score and merge them into one
result doc.

However when I add monitor2.xml as well to 7574 which previously did not
contained this, the scoring changes depending on the server I request.

The score returned for 8983 is always 0.09289607 being distrib=true|false

The score returned for 7574 is always 0.121383816 being distrib=true|false

So is it correct to assume that if a document is indexed in both shards
the score which will predominate is the one from the host which has been
requested?

My client plan to distribute the current index into different shards.
For example each "Consejería" (counseling) should be hosted in a shard.
The critical point for the client is that the scoring is the same as in
the big unique index they use right now for a distributed search.

As I understand the current solrCloud implementation there is no concern
about harmonizing the score.

In my research I came across
http://markmail.org/message/bhhfwymz5y7lvoj7
"The "IDF" part of the relevancy score is the only place that
distributed search scoring won't "match up" with no distributed
scoring because the document frequency used for the term is local to
every core instead of global.  If you distribute your documents fairly
randomly to the different shards, this won't matter.

There is a patch in the works to add global idf, but I think that even
when it's committed, it will default to off because of the higher cost
associated with it." the patch is
https://issues.apache.org/jira/browse/SOLR-1632

However last comment is from 26/Jul/10 reporting the patch failed and a
comment from Yonik give the impression that is not ready to use:

"It looks like the issue is this: rewrite() doesn't work for function
queries (there is no propagation mechanism to go through value sources).
This is a problem when real queries are embedded in function queries."

Is there a general interest to bring 1632 to the trunk (especially for
solrCloud)? 

Or may it be better to look into something that aims to scale the index
into hbase so he does not lose the scoring.

TIA for your feedback
-- 
Thorsten Scherler 
codeBusters S.L. - web based systems

http://www.codebusters.es/



smime.p7s
Description: S/MIME cryptographic signature


big index vs. lots of small ones

2010-01-20 Thread Thorsten Scherler
Hi all,

I have to do an analyses about following usecase.

I am working as consultant in a public company. We are talking about to
offer in the future each public institution its own search server
(probably) based on Apache Solr. However the user of our portal should
be able to search all indexes.

The problematic part for our customer is that a meta search on various
indexes which then later merges the response will change the scoring.

Imagine you have the two indexes
- public health department (A)
- press relations department (B)

Now you have 300 documents in A and only one in B about "influenza A".
The B server will return the only document in its index with a very high
score, since being the only one it gets a very high "base" score,
correct?

On the other hand A may have much more important documents but they will
not get the same "base" score.

Meaning on a merge most likely the document from Server B will be top of
the list.

To prevent this phenomenon we are looking into merging all the
standalone indexes in on big index but that will lead us in other
problems because it will become pretty big pretty fast.

So here my questions:

- What are other people doing to solve this problem?
- What is the best way with Solr to solve the problem of the "base"
scoring?
- What is the best way to have multiple indexes in solr?
- Is it possible to get rid of the "base" scoring in solr?

TIA for any informations.

salu2
-- 
Thorsten Scherler 
Open Source Java 

Sociedad Andaluza para el Desarrollo de la Sociedad 
de la Información, S.A.U. (SADESI)






Re: big index vs. lots of small ones

2010-01-25 Thread Thorsten Scherler
On Wed, 2010-01-20 at 08:38 -0800, Marc Sturlese wrote:
> Check out this patch witch solve the distributed IDF's problem:
> https://issues.apache.org/jira/browse/SOLR-1632
> I think it fixes what you are explaining. The price you pay is that there
> are 2 requests per shard. If I am not worng the first is to get term
> frequencies and needed info and the second one is the proper search request.
> The patch also includes caching for terms in the first request.
> 

Nice!

Thank you very much, Mark.

Como van las cosas en Barcelona?

salu2

> 
> Thorsten Scherler-3 wrote:
> > 
> > Hi all,
> > 
> > I have to do an analyses about following usecase.
> > 
> > I am working as consultant in a public company. We are talking about to
> > offer in the future each public institution its own search server
> > (probably) based on Apache Solr. However the user of our portal should
> > be able to search all indexes.
> > 
> > The problematic part for our customer is that a meta search on various
> > indexes which then later merges the response will change the scoring.
> > 
> > Imagine you have the two indexes
> > - public health department (A)
> > - press relations department (B)
> > 
> > Now you have 300 documents in A and only one in B about "influenza A".
> > The B server will return the only document in its index with a very high
> > score, since being the only one it gets a very high "base" score,
> > correct?
> > 
> > On the other hand A may have much more important documents but they will
> > not get the same "base" score.
> > 
> > Meaning on a merge most likely the document from Server B will be top of
> > the list.
> > 
> > To prevent this phenomenon we are looking into merging all the
> > standalone indexes in on big index but that will lead us in other
> > problems because it will become pretty big pretty fast.
> > 
> > So here my questions:
> > 
> > - What are other people doing to solve this problem?
> > - What is the best way with Solr to solve the problem of the "base"
> > scoring?
> > - What is the best way to have multiple indexes in solr?
> > - Is it possible to get rid of the "base" scoring in solr?
> > 
> > TIA for any informations.
> > 
> > salu2
> > -- 
> > Thorsten Scherler 
> > Open Source Java 
> > 
> > Sociedad Andaluza para el Desarrollo de la Sociedad 
> > de la Información, S.A.U. (SADESI)
> > 
> > 
> > 
> > 
> > 
> > 
> 
-- 
Thorsten Scherler 
Open Source Java 

Sociedad Andaluza para el Desarrollo de la Sociedad 
de la Información, S.A.U. (SADESI)






Re: The mechanism of data replciation in Solr?

2007-09-05 Thread Thorsten Scherler
On Wed, 2007-09-05 at 15:56 +0800, Dong Wang wrote:
> Hello, everybody:-)
> I'm interested with the mechanism of data replciation in Solr, In the
> "Introduction to the solr enterprise Search Server", Replication is
> one of features of Solr, but I can't find anything about replication
> issues on the Web site and documents, including how to split the
> index, how to distribute the chunks of index, how to placement the
> replica, eager replicaton  or lazy replication..etc. I think  they are
> different from the problem in HDFS.
> Can anybody help me? Thank you in advance.

http://wiki.apache.org/solr/CollectionDistribution

HTH
> 
> Best Wishes.
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: Indexing very large files.

2007-09-06 Thread Thorsten Scherler
On Thu, 2007-09-06 at 08:55 +0200, Brian Carmalt wrote:
> Hello again,
> 
> I run Solr on Tomcat under windows and use the tomcat monitor to start 
> the service. I have set the minimum heap
> size to be 512MB and then maximum to be 1024mb. The system has 2 Gigs of 
> ram. The error that I get after sending
> approximately 300 MB is:
> 
> java.lang.OutOfMemoryError: Java heap space
> at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:2947)
> at org.xmlpull.mxp1.MXParser.more(MXParser.java:3026)
> at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1384)
> at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
> at org.xmlpull.mxp1.MXParser.nextText(MXParser.java:1058)
> at 
> org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:332)
> at 
> org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestHandler.java:162)
> at 
> org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:84)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:77)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:191)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:159)
> at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
> at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
> at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
> at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:104)
> at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:261)
> at 
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
> at 
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:581)
> at 
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
> at java.lang.Thread.run(Thread.java:619)
> 
> After sleeping on the problem I see that it does not directly stem from 
> Solr, but from the
> module  org.xmlpull.mxp1.MXParser. Hmmm. I'm open to sugestions and ideas.

Which version do you use of solr?

http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/handler/XmlUpdateRequestHandler.java?view=markup

The trunk version of the XmlUpdateRequestHandler is now based on StAX.
You may want to try whether that is working better.

Please try and report back.

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: Tagging using SOLR

2007-09-06 Thread Thorsten Scherler
On Thu, 2007-09-06 at 12:59 +0530, Doss wrote:
> Dear all,
> 
> We are running an appalication built using SOLR, now we are trying to build
> a tagging system using the existing SOLR indexed field called
> "tag_keywords", this field has different keywords seperated by comma, please
> give suggestions on how can we build tagging system using this field?

http://wiki.apache.org/solr/ConfiguringSolr

http://wiki.apache.org/solr/SchemaXml
Define a new field named keyword and use the "text_ws" as type. Instead
of comma use whitespaces instead.
...


  

      

...


HTH

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: Indexing very large files.

2007-09-06 Thread Thorsten Scherler
On Thu, 2007-09-06 at 11:26 +0200, Brian Carmalt wrote:
> Hallo again,
> 
> I checked out the solr source and built the 1.3-dev version and then I 
> tried to index the same file to the new server.
> I do get a different exception trace, but the result is the same.
> 
> java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:2882)
> at 
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)

It seems that you are reaching the limits because of the StringBuilder.

Did you try to raise the mem to the max like:
java  -Xms1536m -Xmx1788m -jar start.jar

Anyway you will have to look into 
SolrInputDocument readDoc(XMLStreamReader parser) throws
XMLStreamException {
...
StringBuilder text = new StringBuilder();
...
case XMLStreamConstants.CHARACTERS:
  text.append( parser.getText() );
  break;
...

The problem is that the "text" object is bigger then heaps, 
maybe invoking garbage collection before will help.

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



RSS syndication Plugin

2007-09-06 Thread Thorsten Scherler
Hi all,

I am curious whether somebody has written a rss plugin for solr.

The idea is to provide a rss syndication link for the current search. 

It should be really easy to implement since it would be just a
transformation solrXml -> RSS which easily can be done with a simple
xsl.

Has somebody already done this?

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: RSS syndication Plugin

2007-09-06 Thread Thorsten Scherler
On Thu, 2007-09-06 at 09:07 -0400, Ryan McKinley wrote:
> perhaps:
> https://issues.apache.org/jira/browse/SOLR-208
> 
> in http://svn.apache.org/repos/asf/lucene/solr/trunk/example/solr/conf/xslt/
> 
> check:
> example_atom.xsl
> example_rss.xsl

Awesome.

Thanks very much Ryan to point me into the right direction and Brian
Whitman for his contribution.

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: Strange behavior when searching with accents

2007-09-20 Thread Thorsten Scherler
On Thu, 2007-09-20 at 10:11 +0200, Thierry Collogne wrote:
> Hello,
> 
> We are experiencing some strange behavior while searching with words
> containing accents.
> We are using two examples "rené" and "matthé"
> 
> When we search for "rené" or for "rene", we get the same results, so that is
> ok.
> But when we search for "matthé" or for "matthe", we get two totally
> different results.
> 
> Can someone tell me why this happens? We would like the results to be the
> same.

That highly depends on your schema. Do you use ?

I am using the following an it works like a charm

  




    
  
  






  


HTH

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: Strange behavior when searching with accents

2007-09-20 Thread Thorsten Scherler
On Thu, 2007-09-20 at 13:33 +0200, Thierry Collogne wrote:
> We are using this schema definition
> 


Thierry, try to move the solr.ISOLatin1AccentFilterFactory up the filter
cue, like:

...


...

for both indexing and query. 

This way you make sure that all accent are gone before you do further
filtering.

You may need to reindex all documents to make sure we are not going to
use the old index.

HTH

salu2

> 
>   
> 
> 
> 
>  generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
> 
> 
> 
> 
>   
>   
> 
>  ignoreCase="true" expand="true"/>
> 
>  generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0"/>
> 
> 
> 
> 
>   
> 
> 
> I will take a look at the analyzer took.
> 
> Thank you both for the quick response.
> 
> On 20/09/2007, Bertrand Delacretaz <[EMAIL PROTECTED]> wrote:
> >
> > On 9/20/07, Thierry Collogne <[EMAIL PROTECTED]> wrote:
> >
> > > ..when we search for "matthé" or for "matthe", we get two totally
> > > different results
> >
> > The analyzer admin tool should help you find out what's happening, see
> >
> > http://wiki.apache.org/solr/FAQ#head-b25df8c8393bbcca28f1f344c432975002e29ca9
> >
> > -Bertrand
> >
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: Strange behavior when searching with accents

2007-09-20 Thread Thorsten Scherler
On Thu, 2007-09-20 at 14:01 +0200, Thierry Collogne wrote:
> I have entered the the matthé term in the the analyzer, but as far as I
> understand, it should be ok. I have made some screenshots with the results.
> 
> http://farm2.static.flickr.com/1407/1412619772_0b697789cd_o.jpg
> 
> http://farm2.static.flickr.com/1245/1412619774_3351b287bc_o.jpg
> 
> I find it strange that the second screenshost doesn"t give any matches.
> 
> Can someone take a look at them and perhaps clarify why it does not work?

See my other response, but the 2nd screenshoot has changed the the
"query" field using the non accent way.

Further you want to use the "verbose output" option to better analyze.

salu2

> 
> Thank you.
> 
> 
> On 20/09/2007, Thierry Collogne < [EMAIL PROTECTED]> wrote:
> >
> > We are using this schema definition
> >
> > 
> >   
> > 
> > 
> > 
> >  > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0"/>
> > 
> > 
> > 
> > 
> >   
> >   
> > 
> >  > ignoreCase="true" expand="true"/>
> > 
> >  > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0"/>
> > 
> > 
> > 
> > 
> >   
> > 
> >
> > I will take a look at the analyzer took.
> >
> > Thank you both for the quick response.
> >
> > On 20/09/2007, Bertrand Delacretaz < [EMAIL PROTECTED] > wrote:
> > >
> > > On 9/20/07, Thierry Collogne < [EMAIL PROTECTED]> wrote:
> > >
> > > > ..when we search for "matthé" or for "matthe", we get two totally
> > > > different results
> > >
> > > The analyzer admin tool should help you find out what's happening, see
> > > http://wiki.apache.org/solr/FAQ#head-b25df8c8393bbcca28f1f344c432975002e29ca9
> > >
> > >
> > > -Bertrand
> > >
> >
> >
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: Strange behavior when searching with accents

2007-09-20 Thread Thorsten Scherler
On Thu, 2007-09-20 at 15:27 +0200, Bertrand Delacretaz wrote:
> On 9/20/07, Thierry Collogne <[EMAIL PROTECTED]> wrote:
> 
> > ...Thank you very much. Moving the  up in the chain fixed it
> 
> Yes, the problem was the EnglishPorterFilterFactory before the accents
> removal: the stemmer doesn't know about accents, so no stemming
> occured on "matthé" whereas "matthe" was stemmed to "matth".
> 
> BTW, your "rené" example makes me think you're indexing french, if
> that's the case you might want to use a stemmer configured for that
> language, for example
> 
>class="Solr.SnowballPorterFilterFactory"
>   language="French"/>

Betrand, does the French Snowball work fine?

A colleague of mine exchanged mails with Porter about the Spanish filter
and he came to the conclusion that it is not really working well for
Spanish:

"So -orio on the whole changes meaning too much (acceso = access,
accessorio = accessory differ as much in Spanish as English; -atorio
similarly (aclarar to  rinse, clear (in a very general sense), brighten
up; aclaratorio = explanatory). 

Diminutives, augmentatives usually fall under (a) and (c). -illo, -ote,
-isimo are in this category. 

-al and -iz look like plausible candidates for ending removal, but,
unlike their English counterparts, removing them makes little difference
or improvement. Similarly with -ion removal after -s. 

There is a difficulty with pure vowel endings, and the stemmer can't
always get this right. So in English 'academic' is stemmed to 'academ'
but 'academy' does not lose the final -y (or -i). This explains the
residual vowels with -io, -ia 
endings etc."

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



RE: Strange behavior when searching with accents

2007-09-21 Thread Thorsten Scherler
On Thu, 2007-09-20 at 11:13 -0700, Lance Norskog wrote:
> English and French are messy, so heuristic methods are the only possible.
> Spanish is rigorously clean, and stemming should be done from the declension
> rules and irregular conjugation tables. This involves large (fast) tables in
> ram rather than small (slow) string-shuffling.
> 

Interesting do you a link for some documentation how to implement this?

salu2

> Lance Norskog
> 
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of
> Bertrand Delacretaz
> Sent: Thursday, September 20, 2007 8:11 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Strange behavior when searching with accents
> 
> On 9/20/07, Thorsten Scherler <[EMAIL PROTECTED]>
> wrote:
> > ...Betrand, does the French Snowball work fine?...
> 
> I've seen some weirdnesses, like "tennis" and "tenir" (means to hold) both
> stemmed to "ten", but in all of our (simple) tests it was ok.
> 
> The application where we're using it does not require high precision though,
> so it looked good enough and we didn't do create very extensive tests for
> it.
> 
> -Bertrand
> 
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: Scripts not working on cron - always asking for password

2007-09-21 Thread Thorsten Scherler
>> 
> >>>>> Hi, there,
> >>>>> 
> >>>>> I used an absolute path for the "dir" param in the solrconfig.xml as
> >>>>> below:
> >>>>> 
> >>>>> 
> >>>>>   snapshooter
> >>>>>   /var/SolrHome/solr/bin
> >>>>>   true
> >>>>>arg1 arg2 
> >>>>>MYVAR=val1 
> >>>>> 
> >>>>> 
> >>>>> However, I got "snapshooter: not found"  exception thrown in
> >>>> catalina.out.
> >>>>> I don't see why this doesn't work. Anything I'm missing?
> >>>>> 
> >>>>> 
> >>>>> Many thanks,
> >>>>> 
> >>>>> -Hui
> >>>>> 
> >>>> 
> >>> 
> >>> 
> >>> 
> >>> --
> >>> Regards,
> >>> 
> >>> -Hui
> >>> 
> >> 
> > 
> > 
> 
> 
> http://www.bbc.co.uk/
> This e-mail (and any attachments) is confidential and may contain personal 
> views which are not the views of the BBC unless specifically stated.
> If you have received it in error, please delete it from your system.
> Do not use, copy or disclose the information in any way nor act in reliance 
> on it and notify the sender immediately.
> Please note that the BBC monitors e-mails sent or received.
> Further communication will signify your consent to this.
>   
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: How to get all the search results - python

2007-09-24 Thread Thorsten Scherler
On Mon, 2007-09-24 at 14:34 +0530, Roopesh P Raj wrote:
> Hi,
> 
> I am using solr setup in Tomcat 5.5 with python 2.4 using python client 
> solr.py. 
> 
> When I search, all the results are not returned. 
> 
> The method call for searching is as follows : rows specifies the number of 
> rows.
> data = c.search(q='query', fl='id score unique_id Message-ID To From 
> Subject',rows=50, wt='python')
> 
> I want to specify that I want all the rows. How can I do that ?

Hi Roopesh,

I am not sure whether I understand your problem. 

Is it the limitation of rows/pagination? 
If so why not using a real high number (like rows=100)?

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: How to get all the search results - python

2007-09-24 Thread Thorsten Scherler
On Mon, 2007-09-24 at 16:29 +0530, Roopesh P Raj wrote:
> > Hi Roopesh,
> 
> > I am not sure whether I understand your problem. 
> 
> > Is it the limitation of rows/pagination? 
> > If so why not using a real high number (like rows=100)?
> 
> > salu2
> 
> Hi,
> 
> Assigning a high number will solve my problem. (I thought that there will 
> something like rows='all' to do it).
> 
> Can I do pagination using the python client? 

I am not a python expert but I think so.

> How can I specify the starting position, offset etc for 
> pagination through the python client? 

http://wiki.apache.org/solr/CommonQueryParameters

It should work as described in the above document (with the start
parameter.

e.g. 
data = c.search(q='query', fl='id score unique_id Message-ID To From
Subject',rows=50, wt='python',start=50)

HTH
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: How to get all the search results - python

2007-09-25 Thread Thorsten Scherler
On Tue, 2007-09-25 at 10:03 +0530, Roopesh P Raj wrote:

DISCLAIMER:
Please, I am subscribed to the user list and there is no need to write
me directly nor cc me in your response. More since we are an open source
project off-list communication is suboptimal and harmful to the
community. The community has many eyes which can see possible problems
with some solution and propose better ones. Further the mailing list has
an archive and proofed solution can be searched. If we all share
off-list mailings no solutions go into the archive and we always have to
repeat the same mails.

PLEASE write to the ml!

> > http://wiki.apache.org/solr/CommonQueryParameters
> 
> > It should work as described in the above document (with the start
> > parameter.
> 
> > e.g. 
> > data = c.search(q='query', fl='id score unique_id Message-ID To From
> > Subject',rows=50, wt='python',start=50)
> 
> > HTH
> > --
> 
> Hi,
> 
> I my application there is a provision to copy the archive based on date 
> indexed. 
> In this case the number of search results may exceed the high number I have 
> assigned to rows, say rows=1000. I wanted to avoid this situation. In 
> this 
> situation I don't want paginated queries. 
> 
> Can you please tell me how to approach this particular situation.

I think the best way is to
1) get the first response document  (rows=50,start=0)
2) parse the response to see how many results you have
3) do a loop (rows=50,start=50*x) and call solr till you have all
results.

Like Jérôme stated:
On Mon, 2007-09-24 at 12:45 +0100, Jérôme Etévé wrote:
> By design, it's not very efficient to ask for a large number of
> results with solr/lucene. I think you will face performance and memory
> problems if you do that. 

HTH

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: Problem with html code inside xml

2007-09-25 Thread Thorsten Scherler
On Tue, 2007-09-25 at 12:06 +0100, Jérôme Etévé wrote:
> If I understand, you want to keep the raw html code in solr like that
> (in your posting xml file):
> 
> 
>   
> 
> 
> I think you should encode your content to protect these xml entities:
> <  ->  <
> > -> >
> " -> "
> & -> &
> 
> If you use perl, have a look at HTML::Entities.

AFAIR you cannot use tags, they always are getting transformed to
entities. The solution is to have a xsl transformation after the
response that transforms the entities back to tags.

Have a look at the thread 
http://marc.info/?t=11677583791&r=1&w=2
and especially at
http://marc.info/?l=solr-user&m=116782664828926&w=2

HTH

salu2

> 
> 
> On 9/25/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> > Hello,
> >
> > I've got some problem with html code who is embedded in xml file:
> >
> > Sample source .
> >
> > 
> > 
> > 
> >  Les débats
> > 
> > 
> > Le premier tour des élections fédérales se 
> > déroulera le 21
> > octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
> > vous, dont plusieurs grands débats à l'enseigne de Forums.
> > 
> > 
> > 
> > 
> > my para textehere
> > 
> > 
> > Vous trouverez sur cette page toutes les 
> > dates et les heures de
> > ces différents rendez-vous ainsi que le nom et les partis des
> > débatteurs. De plus, vous pourrez également écouter ou réécouter
> > l'ensemble de ces émissions.
> > 
> > 
> > 
> > -
> > When a make a query on solr I've got something like that in the
> > source code of the xml result:
> >
> > http://www.w3.org/1999/xhtml";>
> > <
> > div
> > class
> > =
> > "paragraph"
> > >
> > <
> > div
> > class
> > =
> > "paragraphTitle"
> > />
> > −
> > <
> > ...
> >
> > It is not exactly what I want. I want to keep the html tags, that all
> > without formatting.
> >
> > So the br tags and a tags are well formed in xml and json result, but
> > the div tags are not kept.
> > -
> > In the schema.xml I've got this for the html content
> >
> > 
> >
> >> stored="true" multiValued="true"/>
> >
> > -
> >
> > Any help would be appreciate.
> >
> > Thanks in advance.
> >
> > S. Christin
> >
> >
> >
> >
> >
> >
> 
> 
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: Converting German special characters / umlaute

2007-09-28 Thread Thorsten Scherler
On Thu, 2007-09-27 at 13:26 -0400, J.J. Larrea wrote:
> At 12:13 PM -0400 9/27/07, Steven Rowe wrote:
> >Chris Hostetter wrote:
...
> As for implementation, the first part could easily and flexibly accomplished 
> with the current PatternReplaceFilter, and I'm thinking the second could be 
> done with an extension to that or better yet a new Filter which allows 
> parsing synonymous tokens from a flat to overlaid format, e.g. something on 
> the order of:
> 
>   pattern="(.*)(ü|ue)(.*)"
>  replacement="$1ue$3|$1u$3"
>  tokensep="|"  
>  replace="first"/>
> 
> or perhaps better,
> 
>   pattern="(.*)(ü|ue)(.*)"
>  replacement="$1ue$3|$1u$3"
>  replace="first"/>
>   tokensep="|"/>   
> 
> which in my fantasy implementation would map:
> 
> Müller -> Mueller|Muller
> Mueller -> Mueller|Muller
> Muller -> Muller
> 
> and could be run at index-time and/or query-time as appropriate.
> 
> >Does anyone know if there are other (Latin-1-utilizing) languages
> >besides German with standardized diacritic substitutions that involve
> >something other than just stripping the diacritics?
> 
> I'm curious about this too.
> 

I am German, but working in Spain so I have not faced the problem so
far. Anyhow, IMO 
Müller -> Mueller
Mueller -> Mueller

is right to further shorten the word does not seems right since one is
changing the meaning too much.

Further:
groß -> gross
gross -> gross

ß is pronounced 'sz' but only replaced by 'ss'.

salu2

> - J.J.
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: Search results problem

2007-10-17 Thread Thorsten Scherler
On Wed, 2007-10-17 at 20:44 +1000, Pieter Berkel wrote:
> There is a configuration option called "" in
> solrconfig.xmlwith the default value of 10,000.  You may need to
> increase this value if
> you are indexing fields that are longer.
> 

Is there a way to define a unlimited value? Like -1?

TIA

salu2

> 
> 
> On 17/10/2007, Maximilian Hütter <[EMAIL PROTECTED]> wrote:
> >
> > Daniel Naber schrieb:
> > > On Tuesday 16 October 2007 12:03, Maximilian Hütter wrote:
> > >
> > >> the content of one document is completely contained in another,
> > >> but search for a special word I only get one document as result.
> > >> I am absolutely sure it is contained in the other document, but I will
> > >> only get the "parent" doc if I add a word.
> > >
> > > You should try debugging the problem with Luke, e.g. use "reconstruct &
> > > edit" to see if the term is really indexed in both documents.
> > >
> > > Regards
> > >  Daniel
> > >
> >
> > Thank you for the tip, after using luke I can see that the term is
> > really missing in the other document.
> > Is there a size restriction for field content in Solr/Lucene? Because
> > from the "fulltext" field I use as default field (after luke
> > reconstruction) seem to be missing a lot strings I expected to find there.
> >
> > Best regards,
> >
> > Max
> >
> > --
> > Maximilian Hütter
> > blue elephant systems GmbH
> > Wollgrasweg 49
> > D-70599 Stuttgart
> >
> > Tel:  (+49) 0711 - 45 10 17 578
> > Fax:  (+49) 0711 - 45 10 17 573
> > e-mail :  [EMAIL PROTECTED]
> > Sitz   :  Stuttgart, Amtsgericht Stuttgart, HRB 24106
> > Geschäftsführer:  Joachim Hörnle, Thomas Gentsch, Holger Dietrich
> >
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: Tagging in solr

2007-10-19 Thread Thorsten Scherler
On Fri, 2007-10-19 at 11:01 +0100, Spas Poptchev wrote:
> Hi,
>  
> what i want to do is to store tags that belong to products. Each tag should 
> also store information about how often it was used with a certain product.
> So for example:
>  
> product1
> cool 5=> product1 was tagged 5 times with cool
>  
> What would be the best way to implement this kind of stuff in solr?

There is a wiki page on some brainstorming on how to implement  
tagging within Solr: <http://wiki.apache.org/solr/UserTagDesign>

It's easy enough to have a tag_keywords field, but updating a single  
tag_keywords field is not so straightforward without sending the  
entire document to Solr every time it is tagged.  See SOLR-139's  
extensive comments and patches to see what you're getting into.

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: escaping characters and security

2007-11-06 Thread Thorsten Scherler
On Tue, 2007-11-06 at 11:52 -0500, Micah Wedemeyer wrote:
> Are there any security risks to passing a query directly to Solr without
> doing any sort of escaping?  I am using URL encoding, so '&' and such
> are being encoded into their %XX equivalents.
> 
> Still, should I be doing anything else?  Is there such a thing as a
> Solr-injection attack?

http://wiki.apache.org/solr/mySolr

"Typically it's not recommended do have your front end users/clients
hitting Solr directly as part of an HTML form submit ... the more
conventional way to think of it is that Solr is a backend service, which
your application can talk to over HTTP -- if you were dealing with a
database, you wouldn't expect that you could generate an HTML form for
your clients and then have them submit that form in some way that
resulted in their browser using JDBC (or ODBC) to communicate directly
with your database, their client would communicate with your App, which
would validate their input, impose some security checks on the input,
and then execute the underlying query to your database -- working with
Solr should be very similar, it just so happens that instead of using
JDBC or some other binary protocol, Solr uses HTTP, and you *can* talk
to it directly from a web browser, but that's really more of a debugging
feature then anything else."

HTH

salu2

> 
> Thanks,
> Micah
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: Help with Debian solr/jetty install?

2007-11-21 Thread Thorsten Scherler
On Tue, 2007-11-20 at 22:50 -0800, Otis Gospodnetic wrote:
> Phillip,
> 
> I won't go into details, but I'll point out that the Java compiler is called 
> javac and if memory serves me well, it is defined in one of Jetty's XML 
> config files in its etc/ dir.  The java compiler is used to compile JSPs that 
> Solr uses for the admin UI.  So, make sure you have javac and make sure Jetty 
> can find it.
>  

e.g. 

cd ~
vim .bashrc

...
export JAVA_HOME=/home/thorsten/opt/java
export PATH=$JAVA_HOME/bin:$PATH

The important thing is that $JAVA_HOME points to the JDK and it is first
in your path!

salu2

> Otis
> 
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> - Original Message 
> From: Phillip Farber <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, November 20, 2007 5:55:27 PM
> Subject: Help with Debian solr/jetty install?
> 
> 
> Hi,
> 
> I've successfully run as far as the example admin page on Debian linux
>  2.6.
> 
> So I installed the solr-jetty packaged for Debian testing which gives
>  me 
> Jetty 5.1.14-1 and Solr 1.2.0+ds1-1.  Jetty starts fine and so does the
>  
> Solr home page at http://localhost:8280/solr
> 
> But I get an error when I try to run http://localhost:8280/solr/admin
> 
> HTTP ERROR: 500
> No Java compiler available
> 
> I have sun-java6-jre and sun-java6-jdk packages installed.  I'm new to 
> servlet containers and java webapps.  What should I be looking for to 
> fix this or what information could I provide the list to get me moving 
> forward from here?
> 
> I've included the trace from the Jetty log, and the java properties
>  dump 
> from the example below.
> 
> Thanks,
> Phil
> 
> ---
> 
> Java properties (from the example):
> --
> 
> sun.boot.library.path = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386
> java.vm.version = 1.6.0-b105
> java.vm.name = Java HotSpot(TM) Client VM
> user.dir = /tmp/apache-solr-1.2.0/example
> java.runtime.version = 1.6.0-b105
> os.arch = i386
> java.io.tmpdir = /tmp
> 
> java.library.path = 
> /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386/client:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/../lib/i386:/usr/java/packages/lib/i386:/lib:/usr/lib
> java.class.version = 50.0
> jetty.home = /tmp/apache-solr-1.2.0/example
> sun.management.compiler = HotSpot Client Compiler
> os.version = 2.6.22-2-686
> java.class.path = 
> /tmp/apache-solr-1.2.0/example:/tmp/apache-solr-1.2.0/example/lib/jetty-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jetty-util-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/servlet-api-2.5-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/ant-1.6.5.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/core-3.1.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-2.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-api-2.1.jar:/usr/share/ant/lib/ant.jar
> java.home = /usr/lib/jvm/java-6-sun-1.6.0.00/jre
> java.version = 1.6.0
> java.ext.dirs = 
> /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/ext:/usr/java/packages/lib/ext
> sun.boot.class.path = 
> /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/resources.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/rt.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jsse.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jce.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/charsets.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/classes
> 
> 
> 
> 
> Jetty log (from the error under Debian Solr/Jetty):
> 
> 
> org.apache.jasper.JasperException: No Java compiler available
> at 
> org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:460)
> at 
> org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:367)
> at
>  org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:329)
> at
>  org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> at
>  org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428)
> at 
> org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:473)
> at
>  org.mortbay.jetty.servlet.Dispatcher.dispatch(Dispatcher.java:286)
> at
>  org.mortbay.jetty.servlet.Dispatcher.forward(Dispatcher.java:171)
> at org.mortbay.jetty.servlet.Default.handleGet(Default.java:302)
> at org.mortbay.jetty.servlet.Default.service(Default.java:223)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> at
>  org.mo

Get last updated/committed document

2007-11-23 Thread Thorsten Scherler
Hi all,

I need to ask solr to return me the id of the last committed document.

Is there a way to archive this via a standard lucene query or do I need
a custom connector that gives me this information?

TIA for any information

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: Get last updated/committed document

2007-11-26 Thread Thorsten Scherler
On Sat, 2007-11-24 at 00:17 +1100, climbingrose wrote:
> Assuming that you have the timestamp field defined:
> q=*:*&sort=timestamp desc
> 

Thanks.

salu2

> On Nov 23, 2007 10:43 PM, Thorsten Scherler
> <[EMAIL PROTECTED]> wrote:
> > Hi all,
> >
> > I need to ask solr to return me the id of the last committed document.
> >
> > Is there a way to archive this via a standard lucene query or do I need
> > a custom connector that gives me this information?
> >
> > TIA for any information
> >
> > salu2
> > --
> > Thorsten Scherler thorsten.at.apache.org
> > Open Source Java  consulting, training and solutions
> >
> >
> 
> 
> 
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: solr to work for my web application

2008-02-13 Thread Thorsten Scherler
On Wed, 2008-02-13 at 00:06 -0800, newBea wrote:
> hi 
> 
> I am new to solr/lucene...I have installed solr nightly version..its working
> very fine.
> 
> But it is working for the exampledocs present in the example folder of the
> nightly version of solr. I need solr to work for my current web
> application...I am using tomcat5.5.23 for the application(Windows)...using
> jetty to start solr from outside of the webapps folder.
> 
> Is there any way to start the jetty using tomcat?
> 
> Help would be appreciated...

some links that you may get started:
http://wiki.apache.org/solr
http://wiki.apache.org/solr/mySolr
http://wiki.apache.org/solr/SolrTomcat

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: solr to work for my web application

2008-02-13 Thread Thorsten Scherler
On Wed, 2008-02-13 at 03:42 -0800, newBea wrote:
> Hi Thorsten,
> 
> I have my application running on 8080 port with tomcat 5.5.23I am
> starting solr on port 8983 with jetty server using command "java -jar
> start.jar".
> 
> Both the server gets started...now any search I make on tomcat application
> is interacting with solr very well. The problem is "schema.xml" and
> "solrconfig.xml" in the conf directory are default one. But after adding
> customized schema name parameter and required fields, solr is not working as
> required.

Can you post the modification you made to both files?

> 
> Customized code for parsing the xml generated from solr is working
> fine...but it is unable to find the uniquekey field which we set for all the
> documents in the schema documentand thus result is 0 means nothing.
> 

Hmm, what is your update command and your unique key?

We would need to see this modification to tell you what may be wrong.

Did you try http://YOUR_HOST:8983/solr/admin/luke?wt=xslt&tr=luke.xsl

What does this gives?

salu2

> I am not able to find the solution for this one... any suggestions wud be
> appreciated...thanks in advance. 
> 
> Thorsten Scherler-3 wrote:
> > 
> > On Wed, 2008-02-13 at 00:06 -0800, newBea wrote:
> >> hi 
> >> 
> >> I am new to solr/lucene...I have installed solr nightly version..its
> >> working
> >> very fine.
> >> 
> >> But it is working for the exampledocs present in the example folder of
> >> the
> >> nightly version of solr. I need solr to work for my current web
> >> application...I am using tomcat5.5.23 for the
> >> application(Windows)...using
> >> jetty to start solr from outside of the webapps folder.
> >> 
> >> Is there any way to start the jetty using tomcat?
> >> 
> >> Help would be appreciated...
> > 
> > some links that you may get started:
> > http://wiki.apache.org/solr
> > http://wiki.apache.org/solr/mySolr
> > http://wiki.apache.org/solr/SolrTomcat
> > 
> > salu2
> > -- 
> > Thorsten Scherler thorsten.at.apache.org
> > Open Source Java  consulting, training and solutions
> > 
> > 
> > 
> 
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: solr to work for my web application

2008-02-13 Thread Thorsten Scherler
On Wed, 2008-02-13 at 05:04 -0800, newBea wrote:
> I havnt used luke.xsl. Ya but the link provided by u gives me "Solr Luke
> Request Handler Response"...
> 
>  is simple string as: csid

So you have:
csid

and
 


> 
> till now I am updating docs thru command prompt as : post.jar *.xml
> http://localhost:8983/update

how do the docs look like? I mean since you changed the sample config
you send changed documents as well, right? How do they look?

> 
> I am not clear on how do I post xml docs

Well like you said, with the post.jar and then you will send your
modified docs but there are many ways to trigger an add command to solr.

>  or wud xml docs be posted while I
> request solr thru tomcat at the time of searching text...

To search text from tomcat you will need to have a servlet or something
similar that contacts the solr server for the search result and the
handle the response (e.g. apply custom xsl to the results).



> 
> This manually procedure when I update the xml docs on exampledocs folder
> inside distribution package restrict it to exampledocs itself

No, either copy the jar to the folder where you have your documents or
add it to the PATH.

> ...I am not
> getting a way where my sites text get searched by solr...Do I need to copy
> start.jar and relevant folders in my working directory for web application.

Hmm, it seems that you not have understood the second paragraph of 
http://wiki.apache.org/solr/mySolr

"Typically it's not recommended to have your front end users/clients
hitting Solr directly as part of an HTML form submit ... the more
conventional way to think of it is that Solr is a backend service, which
your application can talk to over HTTP ..."

Meaning you have two different server running. Alternatively you can run
solr in the same tomcat as you application. If you follow SolrTomcat
from the wiki it will be install as "solr" servlet. Your application
will then communicate with this serlvet.

salu2

> 
> any help?
> 
> Thorsten Scherler-3 wrote:
> > 
> > On Wed, 2008-02-13 at 03:42 -0800, newBea wrote:
> >> Hi Thorsten,
> >> 
> >> I have my application running on 8080 port with tomcat 5.5.23I am
> >> starting solr on port 8983 with jetty server using command "java -jar
> >> start.jar".
> >> 
> >> Both the server gets started...now any search I make on tomcat
> >> application
> >> is interacting with solr very well. The problem is "schema.xml" and
> >> "solrconfig.xml" in the conf directory are default one. But after adding
> >> customized schema name parameter and required fields, solr is not working
> >> as
> >> required.
> > 
> > Can you post the modification you made to both files?
> > 
> >> 
> >> Customized code for parsing the xml generated from solr is working
> >> fine...but it is unable to find the uniquekey field which we set for all
> >> the
> >> documents in the schema documentand thus result is 0 means nothing.
> >> 
> > 
> > Hmm, what is your update command and your unique key?
> > 
> > We would need to see this modification to tell you what may be wrong.
> > 
> > Did you try http://YOUR_HOST:8983/solr/admin/luke?wt=xslt&tr=luke.xsl
> > 
> > What does this gives?
> > 
> > salu2
> > 
> >> I am not able to find the solution for this one... any suggestions wud be
> >> appreciated...thanks in advance. 
> >> 
> >> Thorsten Scherler-3 wrote:
> >> > 
> >> > On Wed, 2008-02-13 at 00:06 -0800, newBea wrote:
> >> >> hi 
> >> >> 
> >> >> I am new to solr/lucene...I have installed solr nightly version..its
> >> >> working
> >> >> very fine.
> >> >> 
> >> >> But it is working for the exampledocs present in the example folder of
> >> >> the
> >> >> nightly version of solr. I need solr to work for my current web
> >> >> application...I am using tomcat5.5.23 for the
> >> >> application(Windows)...using
> >> >> jetty to start solr from outside of the webapps folder.
> >> >> 
> >> >> Is there any way to start the jetty using tomcat?
> >> >> 
> >> >> Help would be appreciated...
> >> > 
> >> > some links that you may get started:
> >> > http://wiki.apache.org/solr
> >> > http://wiki.apache.org/solr/mySolr
> >> > http://wiki.apache.org/solr/SolrTomcat
> >> > 
> >> > salu2
> >> > -- 
> >> > Thorsten Scherler
> >> thorsten.at.apache.org
> >> > Open Source Java  consulting, training and
> >> solutions
> >> > 
> >> > 
> >> > 
> >> 
> > -- 
> > Thorsten Scherler thorsten.at.apache.org
> > Open Source Java  consulting, training and solutions
> > 
> > 
> > 
> 
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: solr to work for my web application

2008-02-19 Thread Thorsten Scherler
On Thu, 2008-02-14 at 23:16 -0800, newBea wrote:
> Hi Thorsten...
> 
> SOrry for giving u much trouble but I need some answer regarding solr...plz
> help...
> 
> Question1
> I am using tomcat 5.5.23 so for JNDI setup of solr, adding solr.xml with
> context fragment as below in the tomcat5.5/...catalina/localhost.
> 
> 
> value="D:/Projects/csdb/solr" override="true" />
> 
> 
> Is it the correct way of doing it? 

Yes as I understand the wiki page.

> Or do I need to add context fragment in
> the server.xml of tomcat5.5?
> 
> Question2
> I am starting solr server using start.jar from another location on C:
> drive...whereas my home location indicated on D: drive. Is it the root coz I
> am not getting the search result?

Hmm, as I understand it you are starting two instance of solr! One as a
tomcat and the other as jetty. Why do you want that? If you have solr on
tomcat you do not need the jetty anymore. I does make 0 sense under
normal circumstances to do this.

> 
> Question3
> I have added parameter as C:\solr\data in
> solrconfig.xml...

That seems to be wrong. It should read ${solr.data.dir:C:\solr
\dat} but I am not using windows so I am not sure whether you
may need to escape the path.

salu2

> but the indexes are not getting stored there...indexes for
> search are getting stored in the default dir of solr...any suggestions
> 
> Thanks in advance...
> 
> 
> Thorsten Scherler wrote:
> > 
> > On Wed, 2008-02-13 at 05:04 -0800, newBea wrote:
> >> I havnt used luke.xsl. Ya but the link provided by u gives me "Solr Luke
> >> Request Handler Response"...
> >> 
> >>  is simple string as: csid
> > 
> > So you have:
> > csid
> > 
> > and
> >  > required="true" /> 
> > 
> > 
> >> 
> >> till now I am updating docs thru command prompt as : post.jar *.xml
> >> http://localhost:8983/update
> > 
> > how do the docs look like? I mean since you changed the sample config
> > you send changed documents as well, right? How do they look?
> > 
> >> 
> >> I am not clear on how do I post xml docs
> > 
> > Well like you said, with the post.jar and then you will send your
> > modified docs but there are many ways to trigger an add command to solr.
> > 
> >>  or wud xml docs be posted while I
> >> request solr thru tomcat at the time of searching text...
> > 
> > To search text from tomcat you will need to have a servlet or something
> > similar that contacts the solr server for the search result and the
> > handle the response (e.g. apply custom xsl to the results).
> > 
> > 
> > 
> >> 
> >> This manually procedure when I update the xml docs on exampledocs folder
> >> inside distribution package restrict it to exampledocs itself
> > 
> > No, either copy the jar to the folder where you have your documents or
> > add it to the PATH.
> > 
> >> ...I am not
> >> getting a way where my sites text get searched by solr...Do I need to
> >> copy
> >> start.jar and relevant folders in my working directory for web
> >> application.
> > 
> > Hmm, it seems that you not have understood the second paragraph of 
> > http://wiki.apache.org/solr/mySolr
> > 
> > "Typically it's not recommended to have your front end users/clients
> > hitting Solr directly as part of an HTML form submit ... the more
> > conventional way to think of it is that Solr is a backend service, which
> > your application can talk to over HTTP ..."
> > 
> > Meaning you have two different server running. Alternatively you can run
> > solr in the same tomcat as you application. If you follow SolrTomcat
> > from the wiki it will be install as "solr" servlet. Your application
> > will then communicate with this serlvet.
> > 
> > salu2
> > 
> >> 
> >> any help?
> >> 
> >> Thorsten Scherler-3 wrote:
> >> > 
> >> > On Wed, 2008-02-13 at 03:42 -0800, newBea wrote:
> >> >> Hi Thorsten,
> >> >> 
> >> >> I have my application running on 8080 port with tomcat 5.5.23I am
> >> >> starting solr on port 8983 with jetty server using command "java -jar
> >> >> start.jar".
> >> >> 
> >> >> Both the server gets started...now any search I make on tomcat
> >> >> application
> >> >> is interacting with solr very well. The problem is "schema.xml" and
> >> >> &q

Re: How do I secure solr server?

2008-02-21 Thread Thorsten Scherler
On Thu, 2008-02-21 at 01:46 -0500, Mel Brand wrote:
> Hi guys,
> 
> I run solr on a separate server from the application server and I'd
> like to know how to protect it. 

best with a firewall.

> I'd like to know how to prevent
> someone from communicating to the server and also prevent unauthorized
> access (through the web) to admin page.

I would not expose http://yourServer:8983 at all. I would use an Apache
httpd server as proxy and implement the ac there.

salu2

> 
> Any help is extremely appreciated!! :)
> 
> Thanks,
> 
> Mel
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: solr to work for my web application

2008-02-22 Thread Thorsten Scherler
On Fri, 2008-02-22 at 04:11 -0800, newBea wrote:
> Hi Thorsten,
> 
> Many thanks for ur replies so far...finally i set up correct environment for
> Solr. Its working:clap:

:)

Congrats, glad you got it running.

> 
> Solr Rocks!

Indeed. :)

salu2

> 
> Thorsten Scherler wrote:
> > 
> > On Thu, 2008-02-14 at 23:16 -0800, newBea wrote:
> >> Hi Thorsten...
> >> 
> >> SOrry for giving u much trouble but I need some answer regarding
> >> solr...plz
> >> help...
> >> 
> >> Question1
> >> I am using tomcat 5.5.23 so for JNDI setup of solr, adding solr.xml with
> >> context fragment as below in the tomcat5.5/...catalina/localhost.
> >> 
> >> 
> >> >> value="D:/Projects/csdb/solr" override="true" />
> >> 
> >> 
> >> Is it the correct way of doing it? 
> > 
> > Yes as I understand the wiki page.
> > 
> >> Or do I need to add context fragment in
> >> the server.xml of tomcat5.5?
> >> 
> >> Question2
> >> I am starting solr server using start.jar from another location on C:
> >> drive...whereas my home location indicated on D: drive. Is it the root
> >> coz I
> >> am not getting the search result?
> > 
> > Hmm, as I understand it you are starting two instance of solr! One as a
> > tomcat and the other as jetty. Why do you want that? If you have solr on
> > tomcat you do not need the jetty anymore. I does make 0 sense under
> > normal circumstances to do this.
> > 
> >> 
> >> Question3
> >> I have added parameter as C:\solr\data in
> >> solrconfig.xml...
> > 
> > That seems to be wrong. It should read ${solr.data.dir:C:\solr
> > \dat} but I am not using windows so I am not sure whether you
> > may need to escape the path.
> > 
> > salu2
> > 
> >> but the indexes are not getting stored there...indexes for
> >> search are getting stored in the default dir of solr...any suggestions
> >> 
> >> Thanks in advance...
> >> 
> >> 
> >> Thorsten Scherler wrote:
> >> > 
> >> > On Wed, 2008-02-13 at 05:04 -0800, newBea wrote:
> >> >> I havnt used luke.xsl. Ya but the link provided by u gives me "Solr
> >> Luke
> >> >> Request Handler Response"...
> >> >> 
> >> >>  is simple string as: csid
> >> > 
> >> > So you have:
> >> > csid
> >> > 
> >> > and
> >> >  >> > required="true" /> 
> >> > 
> >> > 
> >> >> 
> >> >> till now I am updating docs thru command prompt as : post.jar *.xml
> >> >> http://localhost:8983/update
> >> > 
> >> > how do the docs look like? I mean since you changed the sample config
> >> > you send changed documents as well, right? How do they look?
> >> > 
> >> >> 
> >> >> I am not clear on how do I post xml docs
> >> > 
> >> > Well like you said, with the post.jar and then you will send your
> >> > modified docs but there are many ways to trigger an add command to
> >> solr.
> >> > 
> >> >>  or wud xml docs be posted while I
> >> >> request solr thru tomcat at the time of searching text...
> >> > 
> >> > To search text from tomcat you will need to have a servlet or something
> >> > similar that contacts the solr server for the search result and the
> >> > handle the response (e.g. apply custom xsl to the results).
> >> > 
> >> > 
> >> > 
> >> >> 
> >> >> This manually procedure when I update the xml docs on exampledocs
> >> folder
> >> >> inside distribution package restrict it to exampledocs itself
> >> > 
> >> > No, either copy the jar to the folder where you have your documents or
> >> > add it to the PATH.
> >> > 
> >> >> ...I am not
> >> >> getting a way where my sites text get searched by solr...Do I need to
> >> >> copy
> >> >> start.jar and relevant folders in my working directory for web
> >> >> application.
> >> > 
> >> > Hmm, it seems that you not have understood the second paragraph of 
> >> > http://wiki.apache.org/solr/mySolr
> >> > 
> >> > "Typically it's not recommended to have your

Re: out of memory every time

2008-03-03 Thread Thorsten Scherler
On Mon, 2008-03-03 at 21:43 +0200, Justin wrote:
> I'm indexing a large number of documents.
> 
> As a server I'm using the /solr/example/start.jar
> 
> No matter how much memory I allocate it fails around 7200 documents.

How do you allocate the memory?

Something like:
java -Xms512M -Xmx1500M -jar start.jar

You may have a closer look as well at
http://java.sun.com/j2se/1.5.0/docs/guide/vm/gc-ergonomics.html

HTH

salu2

> I am committing every 100 docs, and optimizing every 300.
> 
> all of my xml's contain on doc, and can range in size from 2k to 700k.
> 
> when I restart the start.jar it again reports out of memory.
> 
> 
> a sample document looks like this:
> 
> 
>  
>   1851
>   TRAJ20
>   12049
>name="ft:external_ids.SourceAccession:15532">ENSG0211869
>   28735
>   HUgn28735
>   TRA_
>   TRAJ20
>   9953837
>name="ft:external_ids.SourceAccession:15538">ENSG0211869
>   T cell receptor alpha
> joining 20
>   14q11.2
>   14q11
>   14q11.2
>   AE000662.1
>   M94081.1
>   CH471078.2
>   NC_14.7
>   NT_026437.11
>   NG_001332.2
>   8188290
>   The human T-cell receptor
> TCRAC/TCRDC (C alpha/C delta) region: organization,sequence, and evolution
> of 97.6 kb of DNA.
>   Koop B.F.
>   Rowen L.
>   Hood L.
>   Wang K.
>   Kuo C.L.
>   Seto D.
>   Lenstra J.A.
>   Howard S.
>   Shan W.
>   Deshpande P.
>   31311_at
>   
> 
> 
> 
> 
> the schema is (in summary):
> 
> multiValued="false" omitNorms="true"/>
> multiValued="true"  omitNorms="true"/>
> 
> stored="true"  omitNorms="true"/>
> omitNorms="true"/>
> 
> 
> 
> PK
> text
> 
> 
> 
> 
> 
> 
> and my conf is:
>false
> 100
> 900
> 2147483647
> 1
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: Beginner questions: Jetty and solr with utf-8 + cached page + dedup

2008-03-26 Thread Thorsten Scherler
On Tue, 2008-03-25 at 10:56 -0700, Vinci wrote:
> Hi,
> 
> Thank for your reply.
> Question for apply xslt: If I use saxon, where should the saxon.jar located
> if I using the example jetty server? lib/ inside example/ or outside the
> example/?

http://wiki.apache.org/solr/mySolr
"...
Typically it's not recommended to have your front end users/clients
hitting Solr directly as part of an HTML form submit
..."

In the above page there you find answers to many of your questions.

HTH

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



search engine for regional bulletins

2006-11-28 Thread Thorsten Scherler
Hi all,

I am developing a search engine for a governmental body. This search
engine has to index pure xml documents which follow a custom xml schema.
The xml documents contain information about laws and official
announcements for Andalusia.

I need to implement different filter for the search. The current search
engine which can be found here [1] would need to be extended by ranges
about organizational bodies, kind of announcement (law,
resolution,...), ...

I played a bit with Nutch 0.8 and asked myself whether it is best
tool for the task. I got nutch to index the xml documents and I can as
well search the index, but I would need to add filter conditions for the
search. The alternative I see would be pure lucene since I am actually
not really "crawling" the site since the documents are not linked with
each other but put all the files (which have to be indexed) in the
urls/bulletin file. Then Zaheed pointed me to Solr and I had played a
wee bit around. 

To give you a better impression of the underlying architecture and xml
documents, each weekday there is a new bulletin (containing approx. 100
- 200 pages) eg [2]. This bulletin is stored on the file system and need
to be indexed. 

We have two different document types summaries and dispositions. The
summary looks like:

  1. DISPOSICIONES GENERALES
  
 Decreto
  178/2006, de 10 de octubre, por el que se establecen normas de
  protección de la avifauna para las instalaciones eléctricas de
  alta tensión
  
  

  Resolución de 10 de octubre de 2006, de la Dirección General de
  Tesorería y Deuda Pública, por la que se realiza una
  convocatoria de subasta de carácter ordinario dentro del
  Programa de Emisión de Bonos y Obligaciones de la Junta de
  Andalucía.
  


Following the tutorial and looking at the examples it seems that solr
only supports one document type. 


  3007WFP
  Dell Widescreen UltraSharp 3007WFP
  


The root element add is "just" the command for the server that we want
to add the document. Does that mean I would need to stick with this
doctype and transform our internal format for adding the document
information?

Further since the project is for a customer I would need a released
version when I put my engine in production. When does this community
expect to make its first release, or better asked which are the
blockers?

TIA for any information.

salu2

[1] http://andaluciajunta.es/portal/aj-bojaBuscador/0,22815,,00.html 
[2]
http://andaluciajunta.es/portal/boletines/2006/11/aj-bojaVerPagina-2006-11/0,23167,bi%253D693228039889,00.html



Re: search engine for regional bulletins

2006-11-28 Thread Thorsten Scherler
On Tue, 2006-11-28 at 10:00 +0100, Bertrand Delacretaz wrote:
> Hi Thorsten, good to see you here!

:)

Hi Bertrand, thanks very much for this warm welcome and I am as well
glad to meet you here.

> 
> On 11/28/06, Thorsten Scherler
> <[EMAIL PROTECTED]> wrote:
> 
> > ...Following the tutorial and looking at the examples it seems that solr
> > only supports one document type.
> >
> > 
> >   3007WFP
> >   Dell Widescreen UltraSharp 3007WFP
> >   
> > ...
> 
> That's right, to add documents to a Solr index you need to transform
> them to this model. You're basically creating fields to be indexed,
> and the Solr schema.xml allows you to define precisely how you want
> each field to be indexed, including strict data types, pluggable
> Lucene analyzers, etc.
> 
> This means some work in converting your content model to an "indexing
> model", but it's very worth it as it gives you very precise control
> about what you index and how.
> 

Yeah, I thought about it last night and I came to the same conclusion.
The "extra" work involved is "just" a xsl transformation in my use case,
so not really the biggest part of this project.

> > ...Further since the project is for a customer I would need a released
> > version when I put my engine in production. When does this community
> > expect to make its first release, or better asked which are the
> > blockers?...
> 
> I'm relatively new here so I'll let others complete this info, but
> IIUC the only work needed to do a first release is to make sure all
> source files are "clean" w.r.t required Apache license notices. I
> don't think there are any technical blockers for a release, many of us
> are happily using Solr on production sites.

That is good to hear, so if somebody (e.g. me) would check all files for
cleanness then we could release, right? Perfect.

> 
> You might want to look at these links for more info:
>   http://wiki.apache.org/solr/SolrResources
>   http://wiki.apache.org/solr/PublicServers

Thanks very much Bertrand, I will look at this information. I am still
evaluating what is best for this project, but solr sounds very
interesting ATM. 

salu2
> 
> -Bertrand



Re: search engine for regional bulletins

2006-11-28 Thread Thorsten Scherler
On Tue, 2006-11-28 at 11:30 -0500, Yonik Seeley wrote:
> On 11/28/06, Thorsten Scherler
> <[EMAIL PROTECTED]> wrote:
> > That is good to hear, so if somebody (e.g. me) would check all files for
> > cleanness then we could release, right? Perfect.
> 
> Correct.  All IP issues have been cleared, so It's just a matter of
> taking the time to put the release into a form that will be accepted
> by the incubator.  I expect we will be making a release candidate
> within a few weeks.  Of course the incubator guys always finds
> problems,  so getting an actual release out takes longer.
> 

Yeah, I have been in the incubator with lenya and we made some valuable
experience back then. Further I see many committer here with some
experience in different Apache PMC's so hopefully we get it straight
right away and the incubator PMC does not find many issues.

I will try to help the best I can.

> -Yonik

Thanks Yonik.

salu2




solr index reusable with nutch?

2006-12-13 Thread Thorsten Scherler
Hi all,

is it possible to directly use the solr index in nutch?

My client is creating a portal search based on nutch. In this portal
there is as well my project and ATM I prefer to go with solr instead of
nutch since it its much better for my use case.

Now the question is whether the portal search engine could use the solr
index for my part of the portal.

Can somebody point me to related documentation?

TIA

salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)



Re: solr index reusable with nutch?

2006-12-13 Thread Thorsten Scherler
On Wed, 2006-12-13 at 07:45 -0800, Otis Gospodnetic wrote:
> Hi,
> 
> Solr should be able to search any Lucene index,

ok, good to know. :) 

So can I guess that the same is true for nutch? Meaning the index solr
is creating could be used by a nutch searcher.

>  not just those created by Solr itself, as long as you configure it properly 
> via schema.xml.  

http://wiki.apache.org/solr/SchemaXml?highlight=%28schema%29

> Thus, you should be able to use Solr to search an index created by Nutch. 

In my use case I need the reverse. Nutch searches the index created by
my solr application. The application is just one component in the portal
and the portal will provide a "global" search engine which should use
the index from solr.

>  Haven't tried it.  It would be nice if you could contribute the 
> configuration for doing this.
> 

As I figure it out I will keep you informed.

Thanks for the feedback.

salu2

> Otis
> 
> - Original Message 
> From: Thorsten Scherler <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, December 13, 2006 8:26:51 AM
> Subject: solr index reusable with nutch?
> 
> Hi all,
> 
> is it possible to directly use the solr index in nutch?
> 
> My client is creating a portal search based on nutch. In this portal
> there is as well my project and ATM I prefer to go with solr instead of
> nutch since it its much better for my use case.
> 
> Now the question is whether the portal search engine could use the solr
> index for my part of the portal.
> 
> Can somebody point me to related documentation?
> 
> TIA
> 
> salu2



Re: solr index reusable with nutch?

2006-12-15 Thread Thorsten Scherler
On Thu, 2006-12-14 at 11:14 -0800, Chris Hostetter wrote:
> : In my use case I need the reverse. Nutch searches the index created by
> : my solr application. The application is just one component in the portal
> : and the portal will provide a "global" search engine which should use
> : the index from solr.
> 
> If you have a compatible schema, then it should be possible ... but if
> your goal is to make an index with a biz object specific schema and then
> use it as a single collection/source in a nutch installation, that may not
> sork ... 

Yeah, that makes sense. 

> i'm not sure how flexible Nutch is about the indexes it can
> hanlde: it's probably a question best asked on the Nutch user list.
> 

Yeah, you are right.

Thanks for the feedback.

salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)



Re: solr index reusable with nutch?

2006-12-20 Thread Thorsten Scherler
On Thu, 2006-12-14 at 11:14 -0800, Chris Hostetter wrote:
> : In my use case I need the reverse. Nutch searches the index created by
> : my solr application. The application is just one component in the portal
> : and the portal will provide a "global" search engine which should use
> : the index from solr.
> 
> If you have a compatible schema, then it should be possible ... but if
> your goal is to make an index with a biz object specific schema and then
> use it as a single collection/source in a nutch installation, that may not
> sork ... i'm not sure how flexible Nutch is about the indexes it can
> hanlde: it's probably a question best asked on the Nutch user list.

I did some testing with nutch searching over a solr index. Like Chris
said "compatible schema" are the only important point on this issue.

To put it in other words, nutch uses by default 
to search and returns some fields by default. So if you are not keen to
write your own nutch plugin for your custom solr schema, just make sure
that you use the field name="content" to store your main text. You can
further enhance the integration by using the "nutch" names for
"important" fields. 

Further I have  in my schema and it is the only field
that I see in the response of nutch.

sh bin/nutch org.apache.nutch.searcher.NutchBean presidencia
Total hits: 3
 0 null//2006/209/disposition/19923-a.html

 1 null//2006/209/disposition/20246-a.html

 2 null//2006/209/disposition/20034-a.html

This is good enough for my client and me since I can transform that
afterward. :)

Thanks Chris and Otis for your feedback.

salu2

> 
> 
> 
> 
> -Hoss
> 



Re: Realtime directory change...

2006-12-22 Thread Thorsten Scherler
On Thu, 2006-12-21 at 12:23 -0800, escher2k wrote:
> Hi,
>   We currently use Lucene to do index user data every couple of hours - the
> index is completely rebuilt,
> the old index is archived and the new one copied over to the directory.
> Example -
> 
> /bin/cp ${LOG_FILE} ${CRON_ROOT}/index/help/
> /bin/rm -rf ${INDEX_ROOT}/archive/help.${DATE}
> /bin/cp -R ${CRON_ROOT}/index/help ${INDEX_ROOT}/help.new
> /bin/mv ${INDEX_ROOT}/help ${INDEX_ROOT}/archive/help.${DATE}
> /bin/mv ${INDEX_ROOT}/help.new ${INDEX_ROOT}/help
> 
> This works fine since the index is retrieved every time from the disk. Is it
> possible to do the same with Solr ? 
> Assuming we also use caching to speed up the retrieval, is there a way to
> invalidate some/all caches when
> this done ?
> 

Did you look into 
http://wiki.apache.org/solr/CollectionDistribution
http://wiki.apache.org/solr/SolrCollectionDistributionScripts
http://wiki.apache.org/solr/SolrCollectionDistributionOperationsOutline

I am still very new to solr but it sounds like it is exactly what you
need (like as well said by others). 

HTH

salu2


> Thanks.
> 



Re: Help with spellchecker integration

2006-12-22 Thread Thorsten Scherler
On Thu, 2006-12-21 at 21:27 -0800, Otis Gospodnetic wrote: 
> Hi,
> I'm trying to integrate the Lucene-based spellchecker 
> (http://wiki.apache.org/jakarta-lucene/SpellChecker + contrib/spellchecker 
> under Lucene) with Solr (http://issues.apache.org/jira/browse/SOLR-81) in 
> order to provide a query spellchecking service (you enter Speers and it 
> suggest pant^H^H ... Spears).  I've created a generic NGramTokenizer (+ 
> NGramTokenizerFactory + unit test) that I'll attach to SOLR-81 shortly.
> 
> What I'm not yet sure about is:
> 1) integration of this generic n-grammer with that Lucene SpellChecker code - 
> SpellChecker & TRStringDistance classes in particular.

Hmm, reading SOLR-81, you actually have everything you need.

> 2) mapping n-gram Tokens that come out of my NGramTokenizer to specific field 
> names, like 3start, 4start, gram1, gram2, gram3 is there is scheme.xml 
> trick one can use to accomplish this?

It is in the issue:
...


...

 
   
 
 
   


The above shows how to configure the second (spellcheck) index, however
if you want to update both indexes at the same time you need to write
your own implementation of the update servlet.

> 3) once 2) is done, getting the request handler(?) to n-gram the query 
> appropriately and hit the SpellChecker index to try and find alternative 
> spelling suggestions.

hmm, not sure, actually IMHO that highly depends on how you plan to use
it in the end. I mean there is more then one way to use spell check.

In the issue they talked about AJAX suggestions but that would be IMO
before the actual search request. If you want to have it in the request
handler then you need to decide how and when the spellchecker comes into
place.

Like if the normal search does not return a result or parallel. Parallel
would search in the spell check index for alternatives, use this
alternatives to dispatch the alternative word query and later on parse
the result of directly into the output writer. Here you have again
different alternatives, you can attack the solr index directly (loosing
all the cool feature) 

Or you want the google thingy "Did you mean".

... in any form 
start with:
public class NGramRequestHandler extends StandardRequestHandler
implements SolrRequestHandler, SolrInfoMBean {
public void handleRequest(SolrQueryRequest req, SolrQueryResponse
rsp) {
// Depending on the use case do your processing here
}
}

This way you just need to implement the class specific methods. 


> 
> Damn, that's a lot of unknowns... on top of that my computer started freezing 
> every half an hour.  Hi Murphy.
> 
> 
> 
> Any pointers will be greatly appreciated. Thanks,

HTH a wee bit.

salu2

> Otis
> 
> 
> 



Re: Solr 1.1 released

2006-12-25 Thread Thorsten Scherler
On Fri, 2006-12-22 at 17:07 -0500, Yonik Seeley wrote:
> Solr 1.1 is now available for download! 

Very nice. :)

Thanks a lot to this community and especially to Yonik who packed the
release.

salu2



Is there a BasicSummarizer for solr?

2007-01-02 Thread Thorsten Scherler
Hi all,

I need to implement a summary function with solr like there is in nutch.
Basically it returns x words before and after the query term to show the
content where the term is embedded (like as google does).

In nutch this functionality is provided by 
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/summary-basic/
and especially 
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/summary-basic/src/java/org/apache/nutch/summary/basic/BasicSummarizer.java?view=markup

There is another similar plugin/class in
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/summary-lucene/

Is there something similar in solr?

If not which is the best way to implement this functionality?

TIA for any tips.

salu2



Re: Is there a BasicSummarizer for solr?

2007-01-02 Thread Thorsten Scherler
On Tue, 2007-01-02 at 08:14 -0500, Erik Hatcher wrote:
> Thorsten - there is support for the Lucene Highlighter built into  
> Solr.  You can see details of how to use it here:
> 
>   <http://wiki.apache.org/solr/HighlightingParameters>
> 
>Erik
> 

:)  

Cheers Erik, with this information and a small change in my schema
changed stored="false" to stored="true" on my main content, I get
exactly what I needed.

Now I have to see the effect of storing the content in the index
regarding size and response time.

Thanks again.

salu2

> 
> On Jan 2, 2007, at 7:26 AM, Thorsten Scherler wrote:
> 
> > Hi all,
> >
> > I need to implement a summary function with solr like there is in  
> > nutch.
> > Basically it returns x words before and after the query term to  
> > show the
> > content where the term is embedded (like as google does).
> >
> > In nutch this functionality is provided by
> > http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/summary- 
> > basic/
> > and especially
> > http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/summary- 
> > basic/src/java/org/apache/nutch/summary/basic/BasicSummarizer.java? 
> > view=markup
> >
> > There is another similar plugin/class in
> > http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/summary- 
> > lucene/
> >
> > Is there something similar in solr?
> >
> > If not which is the best way to implement this functionality?
> >
> > TIA for any tips.
> >
> > salu2
> 



How to tell the highlighter not to escape?

2007-01-02 Thread Thorsten Scherler
Hi all,

I am playing around with the highlighter and found that all highlight
terms get escaped.

I mean solr will return 
 TERM and not
 TERM 

I am not sure where this escaping is happening but I would need the
highlighting to NOT escape the hl.simple.pre and hl.simple.post tag
since it is horror to work with cdata sections in xsl.

I had a look in the lucene highlighter and it seem that it does not
escape the tags.

Can somebody point me to code which is responsible for escaping and
maybe give me a tip how I can patch to make it configurable. 

TIA

salu2



Re: How to tell the highlighter not to escape?

2007-01-03 Thread Thorsten Scherler
On Wed, 2007-01-03 at 02:16 +, Edward Garrett wrote:
> thorsten,
> 
> see the following for discussion. your case is indeed an annoyance--the
> thread below discusses motivations for it and ways of working around it. (i
> too confess that i wish it were not so.)
> 
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01483.html

Thanks Edward, the problem is with the suggestion in the above thread is
that:
"just create an XSL that
generates XML and unescapes the fields you know will contain wellformed
XML data -- then apply your second transform client side"

Is not possible with xsl. See e.g. 
http://www.biglist.com/lists/xsl-list/archives/200109/msg00318.html
"> How can I match the Cdata Section?!?
>
You can't, the XPath data model regards CDATA as merely an input shortcut,
not as an information-bearing part of the XML content. In other words,
"" and "x" look exactly the same to the XSLT processor.

Mike Kay"

Michael Kay is the xsl guru and I can say as well from my own experience
one would need to write a custom parser since 
is equal to <em>TERM</em> and this in xsl is a string (XPath
would match text()). 

IMO the highlighter should really return pure xml and not escape it. 
I will have a look in the XmlResponseWriter maybe I find a way to change this.

salu2


> 
> -edward
> 
> On 1/2/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
> >
> > Hi Thorsten,
> >
> > The highlighter does not escape anything itself: you are seeing the
> > results of solr's automatic escaping of xml data within its xml
> > response.  This should be transparent (your xml decoder should
> > un-escape the values on the way out).  I'm not really familiar with
> > xslt so I'm unsure why that isn't so (perhaps it is automatically
> > html-escaping the values after un-xml-escaping them?)
> >
> > Be careful of documents containing html fragments natively.
> >
> > cheers,
> > -MIke
> >
> > On 1/2/07, Thorsten Scherler <[EMAIL PROTECTED]>
> > wrote:
> > > Hi all,
> > >
> > > I am playing around with the highlighter and found that all highlight
> > > terms get escaped.
> > >
> > > I mean solr will return
> > >  <em>TERM</em> and not
> > >  TERM 
> > >
> > > I am not sure where this escaping is happening but I would need the
> > > highlighting to NOT escape the hl.simple.pre and hl.simple.post tag
> > > since it is horror to work with cdata sections in xsl.
> > >
> > > I had a look in the lucene highlighter and it seem that it does not
> > > escape the tags.
> > >
> > > Can somebody point me to code which is responsible for escaping and
> > > maybe give me a tip how I can patch to make it configurable.
> > >
> > > TIA
> > >
> > > salu2
> > >
> > >
> >
> 
> 
> 
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)




Re: How to tell the highlighter not to escape?

2007-01-03 Thread Thorsten Scherler
On Wed, 2007-01-03 at 12:06 +, Edward Garrett wrote:
> for what it's worth, i wrote a recursive template in xsl that replaces the
> escaped characters with actual elements. here, the variable $val would be
> the tag, e.g. "em". this has been working okay for me so far.

Yeah, many thanks for posting this template. This is actually
"imitating" a parser. 

However I still think the highlighter should return unescaped tags for
highlighting. There is IMO no benefit for the current behavior.

Thanks again Edward.

salu2

> 
> 
> 
> 
> 
> 
> 
>  select="substring($insideEm, string-length($preEm)+5)"/>
> 
> 
> 
> 
> 
>     
> 
> 
> 
> 
> 
> 
> 
> On 1/3/07, Thorsten Scherler <[EMAIL PROTECTED]> wrote:
> >
> > On Wed, 2007-01-03 at 02:16 +, Edward Garrett wrote:
> > > thorsten,
> > >
> > > see the following for discussion. your case is indeed an annoyance--the
> > > thread below discusses motivations for it and ways of working around it.
> > (i
> > > too confess that i wish it were not so.)
> > >
> > > http://www.mail-archive.com/solr-user@lucene.apache.org/msg01483.html
> >
> > Thanks Edward, the problem is with the suggestion in the above thread is
> > that:
> > "just create an XSL that
> > generates XML and unescapes the fields you know will contain wellformed
> > XML data -- then apply your second transform client side"
> >
> > Is not possible with xsl. See e.g.
> > http://www.biglist.com/lists/xsl-list/archives/200109/msg00318.html
> > "> How can I match the Cdata Section?!?
> > >
> > You can't, the XPath data model regards CDATA as merely an input shortcut,
> > not as an information-bearing part of the XML content. In other words,
> > "" and "x" look exactly the same to the XSLT processor.
> >
> > Mike Kay"
> >
> > Michael Kay is the xsl guru and I can say as well from my own experience
> > one would need to write a custom parser since 
> > is equal to <em>TERM</em> and this in xsl is a string (XPath
> > would match text()).
> >
> > IMO the highlighter should really return pure xml and not escape it.
> > I will have a look in the XmlResponseWriter maybe I find a way to change
> > this.
> >
> > salu2
> >
> >
> > >
> > > -edward
> > >
> > > On 1/2/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
> > > >
> > > > Hi Thorsten,
> > > >
> > > > The highlighter does not escape anything itself: you are seeing the
> > > > results of solr's automatic escaping of xml data within its xml
> > > > response.  This should be transparent (your xml decoder should
> > > > un-escape the values on the way out).  I'm not really familiar with
> > > > xslt so I'm unsure why that isn't so (perhaps it is automatically
> > > > html-escaping the values after un-xml-escaping them?)
> > > >
> > > > Be careful of documents containing html fragments natively.
> > > >
> > > > cheers,
> > > > -MIke
> > > >
> > > > On 1/2/07, Thorsten Scherler <
> > [EMAIL PROTECTED]>
> > > > wrote:
> > > > > Hi all,
> > > > >
> > > > > I am playing around with the highlighter and found that all
> > highlight
> > > > > terms get escaped.
> > > > >
> > > > > I mean solr will return
> > > > >  <em>TERM</em> and not
> > > > >  TERM 
> > > > >
> > > > > I am not sure where this escaping is happening but I would need the
> > > > > highlighting to NOT escape the hl.simple.pre and hl.simple.post tag
> > > > > since it is horror to work with cdata sections in xsl.
> > > > >
> > > > > I had a look in the lucene highlighter and it seem that it does not
> > > > > escape the tags.
> > > > >
> > > > > Can somebody point me to code which is responsible for escaping and
> > > > > maybe give me a tip how I can patch to make it configurable.
> > > > >
> > > > > TIA
> > > > >
> > > > > salu2
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > --
> > thorsten
> >
> > "Together we stand, divided we fall!"
> > Hey you (Pink Floyd)
> >
> >
> >
> 
> 
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)




[ANN] Apache Forrest/Cocoon based solr client plugin

2007-01-07 Thread Thorsten Scherler
Hi all,

I am happy to announce that I just add a Apache Forrest based Apache
Solr client plugin to the forrest whiteboard. It may be from interest
for the ones using Apache Cocoon based Apache Forrest and Apache Lucene
based Apache Solr.

org.apache.forrest.plugin.output.solr generates Apache Solr documents
from Apache Forrest xdos. Further when run with the Apache Forrest
Dispatcher it provides a GUI to manage your project in solr and a search
interface to search your solr server.

The documentation and a couple of screenshots can be found at
http://forrest.apache.org/pluginDocs/plugins_0_80/org.apache.forrest.plugin.output.solr/

The source code can be found at
http://svn.apache.org/viewvc/forrest/trunk/whiteboard/plugins/org.apache.forrest.plugin.output.solr/

Have fun with it and please provide feedback to this list.

salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)




Re: Seeking FAQs

2007-01-08 Thread Thorsten Scherler
On Sat, 2007-01-06 at 10:25 -0500, David Halsted wrote:
> I wonder what would happen if we used a clustering engine like Carrot
> to categorize either the e-mails in the archive or the results of
> searches against them?  Perhaps we'd find some candidates for the FAQ
> that way.

Not sure about tools but IMO this works fine done by user/committer. I
think the one that asked the question on the list is a likely candidate
to add an entry in the faq.

The typical scenario should be:
user asks question -> user get answers from community -> user adds FAQ
entry with the solution that worked for her

This way the one asking the question can give a little something back to
the community.

If you follow the lists a bit one can identify some faq's right away:
- Searching multiple indeces 
- Clustering solr (custom scorer, highlighter, ...)
- ...


> 
> Dave
> 
> On 1/5/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
> >
> > Hey everybody,
> >
> > I was lookin at the FAQ today, and I realized it hasn't really changed
> > much in the past year ... in fact, only two people besides myself have
> > added questions (thanks Thorsten and Darren) in the entire time Solr
> > has been in incubation -- which is not to say that Erik and Respaldo's
> > efforts to fix my typo's aren't equally helpful :)
> >
> > http://wiki.apache.org/solr/FAQ
> >
> > In my experience, FAQs are one of the few pieces of documentation that are
> > really hard for developers to write, because we are so use to dealing with
> > the systems we work on, we don't allways notice when a question has been
> > asked more then once or twice (unless it gets asked over and over and
> > *over*).  The best source of FAQ updates tend to come from users who have
> > a question, and either find the answer in the mailing list archives, or
> > notice the same question asked by someone else later.
> >

Yes, I totally agree. Sometimes the content for the solution can be
found in the wiki. One would just need to link to the wiki page from the
FAQ.

> > So If there are any "gotchas" you remember having when you first started
> > using Solr, or questions you've noticed asked more then once please feel
> > free to add them to the wiki.  The Convention is to only add a question if
> > you're also adding an answer, but even if you don't think a satisfactory
> > answer has ever been given, or you're not sure how to best summarize
> > multiple answers given in the past, just including links to
> > instances in the mailing list archives where the question was asked is
> > helpful -- both in the short term as pointers for people looking for help,
> > and in the long term as starter points for people who want to flesh out a
> > detailed answer.
> >

In the long run the content of the wiki that has proved solution should
IMO go directly in the official documentation. 

salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)




Re: newbie question on determining fieldtype

2007-01-08 Thread Thorsten Scherler
On Mon, 2007-01-08 at 10:29 -0300, mike topper wrote:
> Hi,
> 
> I have a question that I couldn't find the exact answer to. 
> 
> I have some fields that I want to add to my schema but will never be 
> searched on.  They are only used as additional information about a 
> document when retrieved.  They are integers, so should i just have the 
> field be:
> 
>  stored="true"/>
> 
> I'm pretty sure this is right, but I just wanted to check that I'm not 
> missing any speedups from using a different field
> or adding some other parameters.
> 

Seems pretty right to me.

Did you read 
http://wiki.apache.org/solr/SchemaXml

and saw the comment:
 

HTH
salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)




Re: Performance tuning

2007-01-11 Thread Thorsten Scherler
On Thu, 2007-01-11 at 14:57 +, Stephanie Belton wrote:
> Hello,
> 
>  
> 
> Solr is now up and running on our production environment and working great. 
> However it is taking up a lot of extra CPU and memory (CPU usage has doubled 
> and memory is swapping). Is there any documentation on performance tuning? 
> There seems to be a lot of useful info in the server output but I don’t 
> understand it.
> 
>  
> 
> E.g.
> filterCache{lookups=0,hits=0,hitratio=0.00,inserts=537,evictions=0,size=337,cumulative_lookups=4723,cumulative_hits=3708,cumulative_hitratio=0.78,cumulative_inserts=4647,cumulative_evictions=72}
> 
> 
> queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=256,evictions=0,size=256,cumulative_lookups=3779,cumulative_hits=552,cumulative_hitratio=0.14,cumulative_inserts=3632,cumulative_evictions=0}
> 
> 
> documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=66005,cumulative_hits=2460,cumulative_hitratio=0.03,cumulative_inserts=63545,cumulative_evictions=4195}
> 
>  
> 
> etc. what should I be watching out for?
> 

Hi Stephanie,

did you see http://wiki.apache.org/solr/SolrPerformanceFactors?

Further you may consider to balance the load via
http://wiki.apache.org/solr/CollectionDistribution

HTH

salu2

>  
> 
> Thanks
> 
> Stephanie
> 



Re: How can I update a specific field of an existing document?

2007-01-11 Thread Thorsten Scherler
On Thu, 2007-01-11 at 10:19 -0600, Iris Soto wrote:
> Hello everybody,
> I want update a specific field in a document, but i don't find how do it 
> in the documentation of Solr.
> Is that posible?, I need to index only a field for a document, Do i have 
> to index all the document for this?
> The problem is that i have to transform a bizdata object to a file 
> content xml in java,  i should to build all the document xml step by 
> step, field by field, retrieving all the bizdata of database to be 
> passed to Solr.
> 

On Thu, 2007-01-11 at 06:43 -0500, Erik Hatcher wrote:
> In Lucene to update a document the operation is really a delete  
> followed by an add.  You will need to add the complete document as  
> there is no such "update only a field" semantics in Lucene. 

This is from a thread in the dev list.

So no it is not possible to just update one field.

HTH

salu2

> Thanks in advance.
> 



Re: How can I update a specific field of an existing document?

2007-01-11 Thread Thorsten Scherler
On Thu, 2007-01-11 at 17:48 +0100, Thorsten Scherler wrote:
> On Thu, 2007-01-11 at 10:19 -0600, Iris Soto wrote:
> > Hello everybody,
> > I want update a specific field in a document, but i don't find how do it 
> > in the documentation of Solr.
> > Is that posible?, I need to index only a field for a document, Do i have 
> > to index all the document for this?

No, just the one document. Let's say you have a CMS and you edit one
document. You will need to re-index this document only by using the the
add solr statement for the whole document (not one field only).

> > The problem is that i have to transform a bizdata object to a file 
> > content xml in java,  i should to build all the document xml step by 
> > step, field by field, retrieving all the bizdata of database to be 
> > passed to Solr.

see above only for the document where the field are changed. I wrote a
small cocoon based plugin in forrest doing the cms related example.

It adds an document related solr gui for a cms like system. Maybe that
gives you some ideas for your own app.


> > 
> 
> On Thu, 2007-01-11 at 06:43 -0500, Erik Hatcher wrote:
> > In Lucene to update a document the operation is really a delete  
> > followed by an add.  You will need to add the complete document as  
> > there is no such "update only a field" semantics in Lucene. 
> 
> This is from a thread in the dev list.

could not access the archive the first time:
http://www.nabble.com/forum/ViewPost.jtp?post=8275908&framed=y

HTH

salu2

> 
> So no it is not possible to just update one field.
> 
> HTH
> 
> salu2
> 
> > Thanks in advance.
> > 
> 
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)




Re: [ANN] Apache Forrest/Cocoon based solr client plugin

2007-01-10 Thread Thorsten Scherler
On Tue, 2007-01-09 at 22:50 -0500, Yonik Seeley wrote:
> Thanks Thorsten,
> 
> Knowing nothing about cocoon and little about forrest, I'm not sure
> exactly what this does :-)
> 

jeje, fair enough. 

You know forrest from the solr webpage. What I did is a small generic
way to access the solr server with cocoon/forrest. 

What it does is mainly solving (basic) SOLR-20 & SOLR-30 for cocoon. You
can update and select content from the solr server connecting to the
http interface. 

The nice thing is the power of cocoon that is Bertrand always talking
about. ;) We use the output of the solr server as is and use it in the
transformation pipeline. 

The update interface is 
http://forrest.apache.org/pluginDocs/plugins_0_80/org.apache.forrest.plugin.output.solr/images/gui-actionbar.png
and it returns a small success/error page (depending of the solr
response). This interface is half way url specific (add and delete) and
you can execute the commit and optimize commands on ever page.

It is based on the solr generator which is a wrapper of  
http://svn.apache.org/viewvc/forrest/trunk/whiteboard/plugins/org.apache.forrest.plugin.output.solr/src/java/org/apache/forrest/http/client/PostFile.java?view=markup

Which is a simple class to post a file from one url to another. The
response body is provide stream and as string. I wrote this simple class
since the patches of SOLR-20 & SOLR-30 are not yet applied. 

> I'll take a guess in non-cocoon/forrest speech: does it allow you to
> update a Solr server with the content of your website at the same time
> you generate (or change) the site?

Well, it is not working so for in the static build meaning
"forrest" (not sure ATM why myself) which would exactly do what you say
regarding generating the site. In "forrest run", the dynamic mode of
forrest, however it lets ...


>So it's a push model of web
> indexing instead of spidering? 

Exactly. 

To finish above sentence ... you push update commands to the server
based on each selected page.

>  The search-box I understand, but
> presumably that needs to point to a running Solr server somewhere.

Yes.
http://forrest.apache.org/pluginDocs/plugins_0_80/org.apache.forrest.plugin.output.solr/index.html
"...
The host server urls can be configured by adding the following
properties to your project forrest.properties.xml in case you do not use
the default values.

http://localhost:8983/solr/select"/>
http://localhost:8983/solr/update"/> 
..."

The forrest.properties.xml is new in 0.8-dev.

The result will be transformed to something like:
http://forrest.apache.org/pluginDocs/plugins_0_80/org.apache.forrest.plugin.output.solr/images/result.png

I added a transformer that adds the paginator part to the solr select result. 
The paginator is the "Result pages" part of above screenshot. 

Hmm, that makes me think whether that (the paginator) would be better directly 
in solr core. 


wdyt?

salu2
> 
> -Yonik
> 
> On 1/7/07, Thorsten Scherler <[EMAIL PROTECTED]> wrote:
> > Hi all,
> >
> > I am happy to announce that I just add a Apache Forrest based Apache
> > Solr client plugin to the forrest whiteboard. It may be from interest
> > for the ones using Apache Cocoon based Apache Forrest and Apache Lucene
> > based Apache Solr.
> >
> > org.apache.forrest.plugin.output.solr generates Apache Solr documents
> > from Apache Forrest xdos. Further when run with the Apache Forrest
> > Dispatcher it provides a GUI to manage your project in solr and a search
> > interface to search your solr server.
> >
> > The documentation and a couple of screenshots can be found at
> > http://forrest.apache.org/pluginDocs/plugins_0_80/org.apache.forrest.plugin.output.solr/
> >
> > The source code can be found at
> > http://svn.apache.org/viewvc/forrest/trunk/whiteboard/plugins/org.apache.forrest.plugin.output.solr/
> >
> > Have fun with it and please provide feedback to this list.
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)




Re: XML querying

2007-01-15 Thread Thorsten Scherler
On Mon, 2007-01-15 at 12:23 +, Luis Neves wrote:
> Hello.
> What I do now to index XML documents it's to use a Filter to strip the 
> markup, 
> this works but it's impossible to know where in the document is the match 
> located.
> What would it take to make possible to specify a filter query that accepts 
> xpath 
> expressions?... something like:
> 
> fq=xmlField:/book/content/text()
> 
> This way only the "/book/content/" element was searched.
> 
> Did I make sense? Is this possible?

AFAIK short answer: no.

The field is ALWAYS plain text. There is no xmlField type.

...but why don't you just add your text in multiple field when indexing.

Instead of plain stripping the markup do above xpath on your document
and create different fields. Like
 
 

Makes sense?

HTH

salu2

> 
> --
> Luis Neves



Re: Calling Solr requests from java code - examples?

2007-01-16 Thread Thorsten Scherler
On Tue, 2007-01-16 at 12:52 +0100, [EMAIL PROTECTED] wrote:
> Thanks!
> 
> and how would you do it calling it from another web application, let's  
> say from a servlet or so? I need to do some stuff in my web java code,  
> then call the Solr service and do some more stuff afterwards
> 

Have a look at 
https://issues.apache.org/jira/browse/SOLR-86

HTH

salu2




Re: Converting Solr response back to pojo's, experiences?

2007-01-16 Thread Thorsten Scherler
On Tue, 2007-01-16 at 14:58 +0100, [EMAIL PROTECTED] wrote:
> Anyone having experience converting xml responses back to pojo's,  
> which technologies have you used?
> 
> Anyone doing json <-> pojo's?

Using pure xml myself but have a look at 
https://issues.apache.org/jira/browse/SOLR-20
and 
https://issues.apache.org/jira/secure/attachment/12348567/solr-client.zip

HTH
salu2

> 
> Grtz
> 



Re: solr + cocoon problem

2007-01-16 Thread Thorsten Scherler
On Tue, 2007-01-16 at 16:19 -0500, Walter Lewis wrote:
> [EMAIL PROTECTED] wrote:
> > Any ideas on how to implement a cocoon layer above solr?

I just finished a forrest plugin (in the whiteboard, our testing ground
in forrest) that is doing what you asked for and some pagination.
Forrest is cocoon based so you just have to build the plugin jar and add
it to your cocoon project. Please ask on the forrest list if you have
problems.

http://forrest.apache.org/pluginDocs/plugins_0_80/org.apache.forrest.plugin.output.solr/

> You're far from the only one approaching solr via cocoon ... :)
> 
> The approach we took, passes the search parameters to a "solrsearch" 
> stylesheet, the heart of which is a  block that embeds the 
> solr results.  A further transformation prepares the results of the solr 
> query for display.

That was my first version for above plugin as well, but since forrest
makes use of the cocoon crawler I needed something with a default search
string for offline generation.

You should have a closer look on 
http://svn.apache.org/viewvc/forrest/trunk/whiteboard/plugins/org.apache.forrest.plugin.output.solr/output.xmap?view=markup
and 
http://svn.apache.org/viewvc/forrest/trunk/whiteboard/plugins/org.apache.forrest.plugin.output.solr/input.xmap?view=markup

For the original use case of this thread I added a generator:



and as well a paginator transformer that calculates the next pages based on 
start, rows and numFound:

 

We use it as follows:


  


  





  


  

You may be interested in the update generator as well. 

Please give feedback to [EMAIL PROTECTED] 

It really needs more testing besides myself, you could be the first to provide 
feedback.




  

  
    

  

HTH

salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)




Re: solr + cocoon problem

2007-01-16 Thread Thorsten Scherler
On Tue, 2007-01-16 at 16:02 -0500, [EMAIL PROTECTED] wrote:
> Hi,
> 
> I am trying to implement a cocoon based application using solr for searching.
> In particular, I would like to forward the request from my response page to
> solr.  I have tried several alternatives, but none of them worked for me.
> 

Please see http://wiki.apache.org/solr/SolrForrest.

salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)




Re: Calling Solr requests from java code - examples?

2007-01-16 Thread Thorsten Scherler
On Tue, 2007-01-16 at 13:56 +0100, Bertrand Delacretaz wrote:
> On 1/16/07, Thorsten Scherler <[EMAIL PROTECTED]> wrote:
> 
> > ...Have a look at
> > https://issues.apache.org/jira/browse/SOLR-86...
> 
> Right, I should have mentioned this one as well. I have linked SOLR-20
> and SOLR-86 now, so that people can see the various options for Java
> clients.

Cheers, mate. :)

salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)




Re: XML querying

2007-01-16 Thread Thorsten Scherler
On Mon, 2007-01-15 at 13:42 +, Luis Neves wrote:
> Hi!
> 
> Thorsten Scherler wrote:
> 
> > On Mon, 2007-01-15 at 12:23 +, Luis Neves wrote:
> >> Hello.
> >> What I do now to index XML documents it's to use a Filter to strip the 
> >> markup, 
> >> this works but it's impossible to know where in the document is the match 
> >> located.
> >> What would it take to make possible to specify a filter query that accepts 
> >> xpath 
> >> expressions?... something like:
> >>
> >> fq=xmlField:/book/content/text()
> >>
> >> This way only the "/book/content/" element was searched.
> >>
> >> Did I make sense? Is this possible?
> > 
> > AFAIK short answer: no.
> > 
> > The field is ALWAYS plain text. There is no xmlField type.
> > 
> > ...but why don't you just add your text in multiple field when indexing.
> > 
> > Instead of plain stripping the markup do above xpath on your document
> > and create different fields. Like
> >   > select="/book/content/text()"/>
> >  
> > 
> > Makes sense?
> 
> Yes, but I have documents with different schemas on the same "xml field", 
> also, 
> that way I  would have to know the schema of the documents being indexed 
> (which 
> I don't).
> 
> The schema I use is something like:
> 
> 
> 
> Where each distinct DocumentType has its own schema.
> 
> I could revise this approach to use an Solr instance for each DocumentType 
> but I 
> would have to find a way to "merge" results from the different instances 
> because 
> I also need to search across different DocumentTypes... I guess I'm SOL :-(
> 

I think you should explain your use case a wee bit more.

>>> What I do now to index XML documents it's to use a Filter to strip
the markup, 
> >> this works but it's impossible to know where in the document is the match 
> >> located.

why do you need to know where? 

Maybe we can think of something.

salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)




Re: XML querying

2007-01-17 Thread Thorsten Scherler
On Wed, 2007-01-17 at 09:36 +, Luis Neves wrote:
> Hi,
> 
> Thorsten Scherler wrote:
> > On Mon, 2007-01-15 at 13:42 +, Luis Neves wrote:
> 
> > 
> > I think you should explain your use case a wee bit more.
> > 
> >>>> What I do now to index XML documents it's to use a Filter to strip
> > the markup, 
> >>>> this works but it's impossible to know where in the document is the 
> >>>> match located.
> > 
> > why do you need to know where? 
> 
> Poorly phrased from my part. Ideally I want to apply "lucene filters" to the 
> xml 
> content.
> Something like what Nux does:
> <http://dsd.lbl.gov/nux/api/nux/xom/pool/FullTextUtil.html>
> 

http://dsd.lbl.gov/nux/#Google-like realtime fulltext search via Apache
Lucene engine

If you have a look at this you will see that the lucene search is plain
and not xquery based. It is more that you can define relations like in
SQL connecting tow tables via keys. Like I understand it, it will return
the docs that have the xpath /books/book[author="James" and the
lucene:match(abstract, $query) where the lucene match is based on a
normal lucene query.

I reckon it should be very easy to do something like this in a client
environment like cocoon/forrest. See the nux code for getting an idea.
If I would need to solve this I would look for a component that allows
me XQuery like nux and a component that let me do query against a solr
server.

Then you "just" need to match the documents that return for both
components a result with a custom method.

salu2

> 
> --
> Luis Neves



Re: solr + cocoon problem

2007-01-17 Thread Thorsten Scherler
On Wed, 2007-01-17 at 10:25 -0500, [EMAIL PROTECTED] wrote:
> Hi,
> 
> I agree, this is not a legal URL.  But the thing is that cocoon itself is
> sending the unescaped URL. 

...because you told it so.

You use 
http://hostname/solr/select/?q={request-param:q}";
type="file" >

The request param module will not escape the param by default.

salu2



Re: Solr "autostart"

2007-01-28 Thread Thorsten Scherler
On Sun, 2007-01-28 at 10:34 -0500, Tim Archambault wrote:
> Using Solr with Jetty on linux VPS server. When I ssh and run "start.jar" I
> can go to a web browser and with success to the /solr/admin page.  I acn
> query with the whole "nine" no problems.

> However when I close out my terminal session (iBook) I cannot access the
> solr web interface. My intuition is that when  my terminal closes, port 8983
> is no longer available.


Try starting the server with 
java -jar start.jar 2>&1 &

That should keep it up after you disconnect.

HTH

salu2

> 
> How can I set my VPS up so that SOLR just works without manual prompting?
> Server restart, program failure, etc.
> 
> Thanks for any help.
> 
> Tim
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)




Re: Posting from Ant

2007-02-01 Thread Thorsten Scherler
On Thu, 2007-02-01 at 16:48 -0500, Erik Hatcher wrote:
> The benefit to having a  solution (now I'm beginning to  
> speak like a Rubyist, eh?  Ever toyed with Rake, Peter?) is that you  
> can handle errors yourself.
> 
> I never really expected the pipeline to be XML files -> XSLT *files* - 
>  > HTTP POST -> Solr.
> 
> The *files* part here is key.  Can't ya get your Cocoon-skinned cap  
> on and roll a pipeline that does it all on the fly with badass  
> compiled style sheet performance, IoC configurable, da works.   Right  
> Bess?  I'd be happy to collaborate with Bess to wire in a Cocoon  
> kinda Ant task wrapper if the world would be a better place with it.

I wrote something like this. 
http://wiki.apache.org/solr/SolrForrest

I am using it in my project in an ant task that will call the forrest
site target and request the indexing actions url (e.g.
index.solr.add.do)
http://forrest.apache.org/pluginDocs/plugins_0_80/org.apache.forrest.plugin.output.solr/

salu2

> 
>   Erik
> 
> On Feb 1, 2007, at 11:43 AM, Binkley, Peter wrote:
> 
> > Thanks, I'll try that out. I hope there aren't any encoding issues...
> > Nah, how likely is that? I'll report back.
> >
> > Peter
> >
> > -Original Message-
> > From: Erik Hatcher [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, February 01, 2007 6:38 AM
> > To: solr-user@lucene.apache.org
> > Subject: Fwd: Posting from Ant
> >
> > Ok, we have it on good authority that  is the way to go  
> > for Ant
> > -> POST -> Solr.
> >
> > Erik
> >
> >
> > Begin forwarded message:
> >
> >> From: Steve Loughran <[EMAIL PROTECTED]>
> >> Date: February 1, 2007 8:34:33 AM EST
> >> To: Erik Hatcher <[EMAIL PROTECTED]>
> >> Subject: Re: Posting from Ant
> >>
> >> On 01/02/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:
> >>> cool, thanks.  it only posts a single file, it looks like, but i
> >>> suppose the  ant-contrib task would be the way to go to  
> >>> post
> >>> a directory full of .xml files?   or is there now something in ant
> >>> that can do that iteration that i'm unaware of?
> >>
> >> well, someone could add multifile post, but foreach makes mores sense
> >>
> >>>
> >>> woefully ignorant of the latest stuff in ant,
> >>> Erik
> >>>
> >>> On Feb 1, 2007, at 2:52 AM, Steve Loughran wrote:
> >>>
> >>>> yes, there is an antlib (not released, you need to build it
> >>> yourself)
> >>>> that does posts, including http forms posting.
> >>>>
> >>>> http://svn.apache.org/viewvc/ant/sandbox/antlibs/http/trunk/
> >>>>
> >>>> On 01/02/07, Erik Hatcher <[EMAIL PROTECTED]> wrote:
> >>>>> Steve,
> >>>>>
> >>>>> Know of any HTTP POST tasks that could take a directory .xml files
> >>>>> and post them to Solr?   We do it with curl like this, with Solr's
> >>>>> post.sh:
> >>>>>
> >>>>>FILES=$*
> >>>>>URL=http://localhost:8983/solr/update
> >>>>>
> >>>>>for f in $FILES; do
> >>>>>  echo Posting file $f to $URL
> >>>>>  curl $URL --data-binary @$f -H 'Content-type:text/xml;
> >>>>> charset=utf-8'
> >>>>>  echo
> >>>>>done
> >>>>>
> >>>>>#send the commit command to make sure all the changes are
> >>> flushed
> >>>>> and visible
> >>>>>curl $URL --data-binary ''
> >>>>>
> >>>>> But something more Ant-centric would be tasty.
> >>>>>
> >>>>> Thanks,
> >>>>> Erik
> >>>>>
> >>>>>
> >>>>>
> >>>>> Begin forwarded message:
> >>>>>
> >>>>>> From: "Binkley, Peter" <[EMAIL PROTECTED]>
> >>>>>> Date: January 31, 2007 1:56:06 PM EST
> >>>>>> To: 
> >>>>>> Subject: Posting from Ant
> >>>>>> Reply-To: solr-user@lucene.apache.org
> >>>>>>
> >>>>>> Is there an Ant task out there somewhere that can POST
> >>> bunches of
> >>>>>> files
> >>>>>> to Solr, doing what the post.sh script does but with filesets?
> >>>>>>
> >>>>>> I've found the http post task
> >>>>>> (http://antelope.tigris.org/nonav/docs/manual/bk03ch17.html),
> >>>>> but it
> >>>>>> just posts name-value pairs, not files; and Slide's set of
> >>> webdav
> >>>>>> client
> >>>>>> tasks
> >>>>>> (http://gulus.usherbrooke.ca/pub/appl/apache/jakarta/slide/
> >>>>> binaries/
> >>>>>> jaka
> >>>>>> rta-slide-ant-webdav-bin-2.1.zip) has PUT and GET but not
> >>> POST. It
> >>>>>> shouldn't be hard to adapt one of these, but something pre-
> >>> existing
> >>>>>> would be better.
> >>>>>>
> >>>>>> Peter
> >>>>>>
> >>>>>> Peter Binkley
> >>>>>> Digital Initiatives Technology Librarian Information Technology
> >>>>>> Services 4-30 Cameron Library University of Alberta Libraries
> >>>>>> Edmonton, Alberta Canada T6G 2J8
> >>>>>> Phone: (780) 492-3743
> >>>>>> Fax: (780) 492-9243
> >>>>>> e-mail: [EMAIL PROTECTED]
> >>>>>
> >>>>>
> >>>
> >>>
> 
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java & XMLconsulting, training and solutions



Re: Analyzers and Tokenizers?

2007-02-06 Thread Thorsten Scherler
On Tue, 2007-02-06 at 17:27 +0100, rubdabadub wrote:
> Hi:
> 
> Are there more filters/tokenizers then the ones mentioned here..?
> 
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
> 
> I have found in the example/schema.xml, which are new ...
> 
>  sortMissingLast="true" omitNorms="true">
>   
> 
>  more 
> 
> 
> Is there any complete list somewhere ..or how can I find more info about them?

http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/analysis/

HTH

salu2
> 
> Kind regards,
-- 
Thorsten Scherler   thorsten.at.apache.org
Open Source Java & XML  consulting, training and solutions



Re: crawler feed?

2007-02-07 Thread Thorsten Scherler
On Wed, 2007-02-07 at 11:09 +0100, rubdabadub wrote:
> Hi:
> 
> Are there relatively stand-alone crawler that are
> suitable/customizable for Solr? has anyone done any trials.. I have
> seen some discussion about coocon crawler.. was that successfull?

http://wiki.apache.org/solr/SolrForrest

I am using this approach in a custom project that is cocoon based and is
working very fine. However cocoons crawler is not standalone but using
the cocoon cli. I am using the solr/forrest plugin for the commit and
dispatching the update. The indexing transformation in the plugin is a
wee bit different then the one in my project since I needed to extract
more information from the documents to create better filters.

However since the cocoon cli is not anymore in 2.2 (cocoon-trunk) and
forrest uses this as its main component, I am keen to write a simple
crawler that could be reused for cocoon, forrest, solr, nutch, ...

I may will start something pretty soon (I guess I will open a project in
Apache Labs) and will keep this list informed. My idea is to write
simple crawler which could be easily extended by plugins. So if a
project/app needs special processing for a crawled url one could write a
plugin to implement the functionality. A solr plugin for this crawler
would be very simple, basically it would parse the e.g. html page and
dispatches an update command for the extracted fields. I think one
should try to reuse much code from nutch as possible for this parsing.

If somebody is interested in such a standalone crawler project, I
welcome any help, ideas, suggestion, feedback and/or questions.

salu2
-- 
Thorsten Scherler   thorsten.at.apache.org
Open Source Java & XML  consulting, training and solutions



Re: crawler feed?

2007-02-07 Thread Thorsten Scherler
On Wed, 2007-02-07 at 18:03 +0200, Sami Siren wrote:
> rubdabadub wrote:
> > Hi:
> > 
> > Are there relatively stand-alone crawler that are
> > suitable/customizable for Solr? has anyone done any trials.. I have
> > seen some discussion about coocon crawler.. was that successfull?
> 
> There's also integration path available for Nutch[1] that i plan to
> integrate after 0.9.0 is out.

sounds very nice, I just finished to read. Thanks.

Today a submitted a proposal for an Apache Labs project called Apache
Druids. 

http://mail-archives.apache.org/mod_mbox/labs-labs/200702.mbox/browser

Basic idea is to create a flexible crawler framework. The core should be
a simple crawler which could be easily expended by plugins. So if a
project/app needs special processing for a crawled url one could write a
plugin to implement the functionality.

salu2

> 
> --
>  Sami Siren
> 
> [1]http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java & XMLconsulting, training and solutions



[Droids] Re: crawler feed?

2007-02-08 Thread Thorsten Scherler
On Thu, 2007-02-08 at 14:40 +0100, rubdabadub wrote:
> Thorsten:
> 
> First of all I read your lab idea with great interest as I am in need
> of such crawler. However there are certain things that I like to
> discuss. I am not sure what forum will be appropriate for this but I
> will do my idea shooting here first then please tell me where should I
> post further comments.

Since it is not an official lab project yet, I am unsure myself, but I
think we should discuss details on [EMAIL PROTECTED] Please reply to
to the labs ml.

> 
> A vertical search engine that will focus on a specific set of data i.e
> use solr for example cos it provides the maximum field flexibility
> would greatly benefit from such crawler. i.e the next big technorati
> or the next big event finding solution can use your crawler to crawl
> feeds using a feed-plugin (maybe nutch plugins) or scrape websites for
> event info using some x-path/xquery stuff (personally I think xpath is
> a pain in the a... :-)

This like you pointed out are surely some use cases for the crawler in
combination with plugins. 

Another is the wget like crawl that application can use to export a
static site (e.g. CMS, etc.).

> 
> What I worry about is those issue that has to deal with
> 
> - updating crawls

Actually if you only see the crawl there is no differences between
updating or any other crawl.

> - how many threads per host

should be configurable. 

> - scale etc.

you mean a crawl cluster?

> 
> All the maintainers headaches! 

That is why droids is a labs proposal. 

http://labs.apache.org/bylaws.html

All apache committer have write access and when a lab is promoted, the
files are moved over to the incubation area. 

>  I know you will use as much code as
> you can from Nutch plus are not planning to re-invent the wheel. But
> wouldn't be much easier to jump into Sami's idea and make it better
> and more stand-alone and still benefit from the Nutch community? 

I will start a thread on nutch dev and see whether or not it is possible
to extract the crawler from the core, but the main idea is to keep
droids simple.

Imaging something like the following pseudo code:

public void crawl(String url) {
// resolving the stream
InputStream stream = new URL(url).openStream();
// Lookup plugins that is registered for the stream
Plugin plugin = lookupPlugin(stream);
// extract links
// link pattern matcher
Links[] links = plugin.extractLinks(stream);
// Match patterns plugins for storing/excluding links
links = plugin.handleLinks(links);
// pass the stream to the plugin for further processing
plugin.main(stream);
}


> I
> wonder wouldn't it be easy to push/purse a route where nutch crawler
> becomes a standalone crawler? no? I read a post about it on the list.
> 

Can you provide some links to get some background information? TIA.

> I would like to hear more about how your plan will evolve in terms of
> druid and why not join forces with Sami and co.?

I am more familiar with solr then nutch I have to admit. 

Like said all committer have write access on droids and everybody is
welcome to join the effort. Who knows maybe the first droid is a
standalone nutch crawler with plugin extension points if some nutch
committer joins the lab.
 
Thanks rubdabadub for your feedback.

salu2

> 
> Regards
> 
> On 2/7/07, Thorsten Scherler <[EMAIL PROTECTED]> wrote:
> > On Wed, 2007-02-07 at 18:03 +0200, Sami Siren wrote:
> > > rubdabadub wrote:
> > > > Hi:
> > > >
> > > > Are there relatively stand-alone crawler that are
> > > > suitable/customizable for Solr? has anyone done any trials.. I have
> > > > seen some discussion about coocon crawler.. was that successfull?
> > >
> > > There's also integration path available for Nutch[1] that i plan to
> > > integrate after 0.9.0 is out.
> >
> > sounds very nice, I just finished to read. Thanks.
> >
> > Today a submitted a proposal for an Apache Labs project called Apache
> > Druids.
> >
> > http://mail-archives.apache.org/mod_mbox/labs-labs/200702.mbox/browser
> >
> > Basic idea is to create a flexible crawler framework. The core should be
> > a simple crawler which could be easily expended by plugins. So if a
> > project/app needs special processing for a crawled url one could write a
> > plugin to implement the functionality.
> >
> > salu2
> >
> > >
> > > --
> > >  Sami Siren
> > >
> > > [1]http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html
> > --
> > Thorsten Scherler thorsten.at.apache.org
> > Open Source Java & XMLconsulting, training and solutions
> >
> >
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java & XMLconsulting, training and solutions



RE: Using cocoon to update index

2007-03-26 Thread Thorsten Scherler
On Mon, 2007-03-26 at 09:30 -0400, Winona Salesky wrote:
> Thanks Chris, I'll take another look at the forest plugin.

Have a look as well at http://wiki.apache.org/solr/SolrForrest
it points out the cocoon components.

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java & XMLconsulting, training and solutions



Re: SolrSearchGenerator for Cocoon (2.1)

2007-03-27 Thread Thorsten Scherler
On Tue, 2007-03-27 at 10:53 -0400, [EMAIL PROTECTED] wrote:
> Hi,
> 
> I looked at the SolrSearchGenerator (this is the part which is of interest to
> me), but I could not get it work for Cocoon 2.1 yet.
> 
> It seems that the there is no getParameters method for the
> org.apache.cocoon.environment interface:
> http://cocoon.apache.org/2.1/apidocs/org/apache/cocoon/environment/Request.html
> I guess you using the getParameterNames and getParameter methods instead 
> should
> do the trick.
> 
> Or am I missing something?

No, you are right the "getParameters" is cocoon-trunk specific. I just
changed the code to be cocoon-2.1.x compatible.

http://svn.apache.org/viewvc?view=rev&rev=523081

Thanks for the feedback Mirko.

Now in cocoon-2.1.x to use the plugin in your custom project please do
the following
1) svn co http://svn.apache.org/repos/asf/forrest/trunk forrest (this
checkout is our $FORRESST_HOME)
2) cd $FORRESST_HOME/main; ./build.sh
3) cd
$FORRESST_HOME/whiteboard/plugins/org.apache.forrest.plugin.output.solr
4) $FORRESST_HOME/tools/ant/bin/ant local-deploy
5) cp \
$FORRESST_HOME/whiteboard/plugins/org.apache.forrest.plugin.output.solr/build/org.apache.forrest.plugin.output.solr.jar
$cocoon-2.1.x_webapp/WEB-INF/lib/

>From there you can use the cocoon components as usual in your project.

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java & XMLconsulting, training and solutions



Re: Solr logo poll

2007-04-07 Thread Thorsten Scherler
B

Graffiti style.
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



[Standings] Solr logo poll

2007-04-10 Thread Thorsten Scherler
Hi all,

I did a small count till now we have:
a) 21
b) 13

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: unsubscribe

2007-05-10 Thread Thorsten Scherler
On Thu, 2007-05-10 at 10:05 +0100, Kainth, Sachin wrote:
> unsubscribe

Hi Sachin,

you need to send to a different mailing address:
[EMAIL PROTECTED]

HTH

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



Re: Packaging solr for Debian: using debian-supplied lucene-*.jar

2007-06-04 Thread Thorsten Scherler
On Sun, 2007-06-03 at 09:55 +0200, Jan-Pascal van Best wrote:
> Hi all,
> 
> I'm working on packaging Solr for Debian. 

Very nice. :)

Since this is a developer topic I think this topic should be discussed
on our dev list.

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions



RE: storing the document URI in the index

2007-06-12 Thread Thorsten Scherler
On Tue, 2007-06-12 at 16:33 +0200, Ard Schrijvers wrote:
> Thanks Yonik and Walter,
> 
> putting it that way, it does make good sense to not store the transient xml 
> file which it is most of the usecases (I was thinking differently because I 
> do have xml files on file system or over http, like from a webdav call)
> 
> Anyway, thx for all answers, and again, sry for mails not indenting properly 
> at the moment, it irritates me as well :-)
> 
> Regards Ard

Hi Ard,

you may want to have a look at 
http://wiki.apache.org/solr/SolrForrest

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions