from:"thorsten"

On Thu, 2007-09-06 at 08:55 +0200, Brian Carmalt wrote:
> Hello again,
> 
> I run Solr on Tomcat under windows and use the tomcat monitor to start 
> the service. I have set the minimum heap
> size to be 512MB and then maximum to be 1024mb. The system has 2 Gigs of 
> ram. The error that I get after sending
> approximately 300 MB is:
> 
> java.lang.OutOfMemoryError: Java heap space
> at org.xmlpull.mxp1.MXParser.fillBuf(MXParser.java:2947)
> at org.xmlpull.mxp1.MXParser.more(MXParser.java:3026)
> at org.xmlpull.mxp1.MXParser.nextImpl(MXParser.java:1384)
> at org.xmlpull.mxp1.MXParser.next(MXParser.java:1093)
> at org.xmlpull.mxp1.MXParser.nextText(MXParser.java:1058)
> at 
> org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:332)
> at 
> org.apache.solr.handler.XmlUpdateRequestHandler.update(XmlUpdateRequestHandler.java:162)
> at 
> org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandler.java:84)
> at 
> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:77)
> at org.apache.solr.core.SolrCore.execute(SolrCore.java:658)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:191)
> at 
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:159)
> at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
> at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
> at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:230)
> at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175)
> at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128)
> at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:104)
> at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> at 
> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:261)
> at 
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844)
> at 
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:581)
> at 
> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
> at java.lang.Thread.run(Thread.java:619)
> 
> After sleeping on the problem I see that it does not directly stem from 
> Solr, but from the
> module  org.xmlpull.mxp1.MXParser. Hmmm. I'm open to sugestions and ideas.

Which version do you use of solr?

http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/handler/XmlUpdateRequestHandler.java?view=markup

The trunk version of the XmlUpdateRequestHandler is now based on StAX.
You may want to try whether that is working better.

Please try and report back.

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: Tagging using SOLR

On Thu, 2007-09-06 at 12:59 +0530, Doss wrote:
> Dear all,
> 
> We are running an appalication built using SOLR, now we are trying to build
> a tagging system using the existing SOLR indexed field called
> "tag_keywords", this field has different keywords seperated by comma, please
> give suggestions on how can we build tagging system using this field?

http://wiki.apache.org/solr/ConfiguringSolr

http://wiki.apache.org/solr/SchemaXml
Define a new field named keyword and use the "text_ws" as type. Instead
of comma use whitespaces instead.
...


  

      

...


HTH

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: Indexing very large files.

On Thu, 2007-09-06 at 11:26 +0200, Brian Carmalt wrote:
> Hallo again,
> 
> I checked out the solr source and built the 1.3-dev version and then I 
> tried to index the same file to the new server.
> I do get a different exception trace, but the result is the same.
> 
> java.lang.OutOfMemoryError: Java heap space
> at java.util.Arrays.copyOf(Arrays.java:2882)
> at 
> java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)

It seems that you are reaching the limits because of the StringBuilder.

Did you try to raise the mem to the max like:
java  -Xms1536m -Xmx1788m -jar start.jar

Anyway you will have to look into 
SolrInputDocument readDoc(XMLStreamReader parser) throws
XMLStreamException {
...
StringBuilder text = new StringBuilder();
...
case XMLStreamConstants.CHARACTERS:
  text.append( parser.getText() );
  break;
...

The problem is that the "text" object is bigger then heaps, 
maybe invoking garbage collection before will help.

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

RSS syndication Plugin

Hi all,

I am curious whether somebody has written a rss plugin for solr.

The idea is to provide a rss syndication link for the current search. 

It should be really easy to implement since it would be just a
transformation solrXml -> RSS which easily can be done with a simple
xsl.

Has somebody already done this?

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: RSS syndication Plugin

On Thu, 2007-09-06 at 09:07 -0400, Ryan McKinley wrote:
> perhaps:
> https://issues.apache.org/jira/browse/SOLR-208
> 
> in http://svn.apache.org/repos/asf/lucene/solr/trunk/example/solr/conf/xslt/
> 
> check:
> example_atom.xsl
> example_rss.xsl

Awesome.

Thanks very much Ryan to point me into the right direction and Brian
Whitman for his contribution.

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: Strange behavior when searching with accents

On Thu, 2007-09-20 at 10:11 +0200, Thierry Collogne wrote:
> Hello,
> 
> We are experiencing some strange behavior while searching with words
> containing accents.
> We are using two examples "rené" and "matthé"
> 
> When we search for "rené" or for "rene", we get the same results, so that is
> ok.
> But when we search for "matthé" or for "matthe", we get two totally
> different results.
> 
> Can someone tell me why this happens? We would like the results to be the
> same.

That highly depends on your schema. Do you use ?

I am using the following an it works like a charm

  




    
  
  






  


HTH

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: Strange behavior when searching with accents

On Thu, 2007-09-20 at 13:33 +0200, Thierry Collogne wrote:
> We are using this schema definition
> 


Thierry, try to move the solr.ISOLatin1AccentFilterFactory up the filter
cue, like:

...


...

for both indexing and query. 

This way you make sure that all accent are gone before you do further
filtering.

You may need to reindex all documents to make sure we are not going to
use the old index.

HTH

salu2

> 
>   
> 
> 
> 
>  generateWordParts="1" generateNumberParts="1" catenateWords="1"
> catenateNumbers="1" catenateAll="0"/>
> 
> 
> 
> 
>   
>   
> 
>  ignoreCase="true" expand="true"/>
> 
>  generateWordParts="1" generateNumberParts="1" catenateWords="0"
> catenateNumbers="0" catenateAll="0"/>
> 
> 
> 
> 
>   
> 
> 
> I will take a look at the analyzer took.
> 
> Thank you both for the quick response.
> 
> On 20/09/2007, Bertrand Delacretaz <[EMAIL PROTECTED]> wrote:
> >
> > On 9/20/07, Thierry Collogne <[EMAIL PROTECTED]> wrote:
> >
> > > ..when we search for "matthé" or for "matthe", we get two totally
> > > different results
> >
> > The analyzer admin tool should help you find out what's happening, see
> >
> > http://wiki.apache.org/solr/FAQ#head-b25df8c8393bbcca28f1f344c432975002e29ca9
> >
> > -Bertrand
> >
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: Strange behavior when searching with accents

On Thu, 2007-09-20 at 14:01 +0200, Thierry Collogne wrote:
> I have entered the the matthé term in the the analyzer, but as far as I
> understand, it should be ok. I have made some screenshots with the results.
> 
> http://farm2.static.flickr.com/1407/1412619772_0b697789cd_o.jpg
> 
> http://farm2.static.flickr.com/1245/1412619774_3351b287bc_o.jpg
> 
> I find it strange that the second screenshost doesn"t give any matches.
> 
> Can someone take a look at them and perhaps clarify why it does not work?

See my other response, but the 2nd screenshoot has changed the the
"query" field using the non accent way.

Further you want to use the "verbose output" option to better analyze.

salu2

> 
> Thank you.
> 
> 
> On 20/09/2007, Thierry Collogne < [EMAIL PROTECTED]> wrote:
> >
> > We are using this schema definition
> >
> > 
> >   
> > 
> > 
> > 
> >  > generateWordParts="1" generateNumberParts="1" catenateWords="1"
> > catenateNumbers="1" catenateAll="0"/>
> > 
> > 
> > 
> > 
> >   
> >   
> > 
> >  > ignoreCase="true" expand="true"/>
> > 
> >  > generateWordParts="1" generateNumberParts="1" catenateWords="0"
> > catenateNumbers="0" catenateAll="0"/>
> > 
> > 
> > 
> > 
> >   
> > 
> >
> > I will take a look at the analyzer took.
> >
> > Thank you both for the quick response.
> >
> > On 20/09/2007, Bertrand Delacretaz < [EMAIL PROTECTED] > wrote:
> > >
> > > On 9/20/07, Thierry Collogne < [EMAIL PROTECTED]> wrote:
> > >
> > > > ..when we search for "matthé" or for "matthe", we get two totally
> > > > different results
> > >
> > > The analyzer admin tool should help you find out what's happening, see
> > > http://wiki.apache.org/solr/FAQ#head-b25df8c8393bbcca28f1f344c432975002e29ca9
> > >
> > >
> > > -Bertrand
> > >
> >
> >
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: Strange behavior when searching with accents

On Thu, 2007-09-20 at 15:27 +0200, Bertrand Delacretaz wrote:
> On 9/20/07, Thierry Collogne <[EMAIL PROTECTED]> wrote:
> 
> > ...Thank you very much. Moving the  up in the chain fixed it
> 
> Yes, the problem was the EnglishPorterFilterFactory before the accents
> removal: the stemmer doesn't know about accents, so no stemming
> occured on "matthé" whereas "matthe" was stemmed to "matth".
> 
> BTW, your "rené" example makes me think you're indexing french, if
> that's the case you might want to use a stemmer configured for that
> language, for example
> 
>class="Solr.SnowballPorterFilterFactory"
>   language="French"/>

Betrand, does the French Snowball work fine?

A colleague of mine exchanged mails with Porter about the Spanish filter
and he came to the conclusion that it is not really working well for
Spanish:

"So -orio on the whole changes meaning too much (acceso = access,
accessorio = accessory differ as much in Spanish as English; -atorio
similarly (aclarar to  rinse, clear (in a very general sense), brighten
up; aclaratorio = explanatory). 

Diminutives, augmentatives usually fall under (a) and (c). -illo, -ote,
-isimo are in this category. 

-al and -iz look like plausible candidates for ending removal, but,
unlike their English counterparts, removing them makes little difference
or improvement. Similarly with -ion removal after -s. 

There is a difficulty with pure vowel endings, and the stemmer can't
always get this right. So in English 'academic' is stemmed to 'academ'
but 'academy' does not lose the final -y (or -i). This explains the
residual vowels with -io, -ia 
endings etc."

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

RE: Strange behavior when searching with accents

2007-09-21 Thread Thorsten Scherler

On Thu, 2007-09-20 at 11:13 -0700, Lance Norskog wrote:
> English and French are messy, so heuristic methods are the only possible.
> Spanish is rigorously clean, and stemming should be done from the declension
> rules and irregular conjugation tables. This involves large (fast) tables in
> ram rather than small (slow) string-shuffling.
> 

Interesting do you a link for some documentation how to implement this?

salu2

> Lance Norskog
> 
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of
> Bertrand Delacretaz
> Sent: Thursday, September 20, 2007 8:11 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Strange behavior when searching with accents
> 
> On 9/20/07, Thorsten Scherler <[EMAIL PROTECTED]>
> wrote:
> > ...Betrand, does the French Snowball work fine?...
> 
> I've seen some weirdnesses, like "tennis" and "tenir" (means to hold) both
> stemmed to "ten", but in all of our (simple) tests it was ok.
> 
> The application where we're using it does not require high precision though,
> so it looked good enough and we didn't do create very extensive tests for
> it.
> 
> -Bertrand
> 
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: Scripts not working on cron - always asking for password

2007-09-21 Thread Thorsten Scherler

>> 
> >>>>> Hi, there,
> >>>>> 
> >>>>> I used an absolute path for the "dir" param in the solrconfig.xml as
> >>>>> below:
> >>>>> 
> >>>>> 
> >>>>>   snapshooter
> >>>>>   /var/SolrHome/solr/bin
> >>>>>   true
> >>>>>arg1 arg2 
> >>>>>MYVAR=val1 
> >>>>> 
> >>>>> 
> >>>>> However, I got "snapshooter: not found"  exception thrown in
> >>>> catalina.out.
> >>>>> I don't see why this doesn't work. Anything I'm missing?
> >>>>> 
> >>>>> 
> >>>>> Many thanks,
> >>>>> 
> >>>>> -Hui
> >>>>> 
> >>>> 
> >>> 
> >>> 
> >>> 
> >>> --
> >>> Regards,
> >>> 
> >>> -Hui
> >>> 
> >> 
> > 
> > 
> 
> 
> http://www.bbc.co.uk/
> This e-mail (and any attachments) is confidential and may contain personal 
> views which are not the views of the BBC unless specifically stated.
> If you have received it in error, please delete it from your system.
> Do not use, copy or disclose the information in any way nor act in reliance 
> on it and notify the sender immediately.
> Please note that the BBC monitors e-mails sent or received.
> Further communication will signify your consent to this.
>   
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: How to get all the search results - python

2007-09-24 Thread Thorsten Scherler

On Mon, 2007-09-24 at 14:34 +0530, Roopesh P Raj wrote:
> Hi,
> 
> I am using solr setup in Tomcat 5.5 with python 2.4 using python client 
> solr.py. 
> 
> When I search, all the results are not returned. 
> 
> The method call for searching is as follows : rows specifies the number of 
> rows.
> data = c.search(q='query', fl='id score unique_id Message-ID To From 
> Subject',rows=50, wt='python')
> 
> I want to specify that I want all the rows. How can I do that ?

Hi Roopesh,

I am not sure whether I understand your problem. 

Is it the limitation of rows/pagination? 
If so why not using a real high number (like rows=100)?

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: How to get all the search results - python

2007-09-24 Thread Thorsten Scherler

On Mon, 2007-09-24 at 16:29 +0530, Roopesh P Raj wrote:
> > Hi Roopesh,
> 
> > I am not sure whether I understand your problem. 
> 
> > Is it the limitation of rows/pagination? 
> > If so why not using a real high number (like rows=100)?
> 
> > salu2
> 
> Hi,
> 
> Assigning a high number will solve my problem. (I thought that there will 
> something like rows='all' to do it).
> 
> Can I do pagination using the python client? 

I am not a python expert but I think so.

> How can I specify the starting position, offset etc for 
> pagination through the python client? 

http://wiki.apache.org/solr/CommonQueryParameters

It should work as described in the above document (with the start
parameter.

e.g. 
data = c.search(q='query', fl='id score unique_id Message-ID To From
Subject',rows=50, wt='python',start=50)

HTH
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: How to get all the search results - python

2007-09-25 Thread Thorsten Scherler

On Tue, 2007-09-25 at 10:03 +0530, Roopesh P Raj wrote:

DISCLAIMER:
Please, I am subscribed to the user list and there is no need to write
me directly nor cc me in your response. More since we are an open source
project off-list communication is suboptimal and harmful to the
community. The community has many eyes which can see possible problems
with some solution and propose better ones. Further the mailing list has
an archive and proofed solution can be searched. If we all share
off-list mailings no solutions go into the archive and we always have to
repeat the same mails.

PLEASE write to the ml!

> > http://wiki.apache.org/solr/CommonQueryParameters
> 
> > It should work as described in the above document (with the start
> > parameter.
> 
> > e.g. 
> > data = c.search(q='query', fl='id score unique_id Message-ID To From
> > Subject',rows=50, wt='python',start=50)
> 
> > HTH
> > --
> 
> Hi,
> 
> I my application there is a provision to copy the archive based on date 
> indexed. 
> In this case the number of search results may exceed the high number I have 
> assigned to rows, say rows=1000. I wanted to avoid this situation. In 
> this 
> situation I don't want paginated queries. 
> 
> Can you please tell me how to approach this particular situation.

I think the best way is to
1) get the first response document  (rows=50,start=0)
2) parse the response to see how many results you have
3) do a loop (rows=50,start=50*x) and call solr till you have all
results.

Like Jérôme stated:
On Mon, 2007-09-24 at 12:45 +0100, Jérôme Etévé wrote:
> By design, it's not very efficient to ask for a large number of
> results with solr/lucene. I think you will face performance and memory
> problems if you do that. 

HTH

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: Problem with html code inside xml

2007-09-25 Thread Thorsten Scherler

On Tue, 2007-09-25 at 12:06 +0100, Jérôme Etévé wrote:
> If I understand, you want to keep the raw html code in solr like that
> (in your posting xml file):
> 
> 
>   
> 
> 
> I think you should encode your content to protect these xml entities:
> <  ->  <
> > -> >
> " -> "
> & -> &
> 
> If you use perl, have a look at HTML::Entities.

AFAIR you cannot use tags, they always are getting transformed to
entities. The solution is to have a xsl transformation after the
response that transforms the entities back to tags.

Have a look at the thread 
http://marc.info/?t=11677583791&r=1&w=2
and especially at
http://marc.info/?l=solr-user&m=116782664828926&w=2

HTH

salu2

> 
> 
> On 9/25/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> > Hello,
> >
> > I've got some problem with html code who is embedded in xml file:
> >
> > Sample source .
> >
> > 
> > 
> > 
> >  Les débats
> > 
> > 
> > Le premier tour des élections fédérales se 
> > déroulera le 21
> > octobre prochain. D'ici là, La 1ère vous propose plusieurs rendez-
> > vous, dont plusieurs grands débats à l'enseigne de Forums.
> > 
> > 
> > 
> > 
> > my para textehere
> > 
> > 
> > Vous trouverez sur cette page toutes les 
> > dates et les heures de
> > ces différents rendez-vous ainsi que le nom et les partis des
> > débatteurs. De plus, vous pourrez également écouter ou réécouter
> > l'ensemble de ces émissions.
> > 
> > 
> > 
> > -
> > When a make a query on solr I've got something like that in the
> > source code of the xml result:
> >
> > http://www.w3.org/1999/xhtml";>
> > <
> > div
> > class
> > =
> > "paragraph"
> > >
> > <
> > div
> > class
> > =
> > "paragraphTitle"
> > />
> > −
> > <
> > ...
> >
> > It is not exactly what I want. I want to keep the html tags, that all
> > without formatting.
> >
> > So the br tags and a tags are well formed in xml and json result, but
> > the div tags are not kept.
> > -
> > In the schema.xml I've got this for the html content
> >
> > 
> >
> >> stored="true" multiValued="true"/>
> >
> > -
> >
> > Any help would be appreciate.
> >
> > Thanks in advance.
> >
> > S. Christin
> >
> >
> >
> >
> >
> >
> 
> 
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: Converting German special characters / umlaute

2007-09-28 Thread Thorsten Scherler

On Thu, 2007-09-27 at 13:26 -0400, J.J. Larrea wrote:
> At 12:13 PM -0400 9/27/07, Steven Rowe wrote:
> >Chris Hostetter wrote:
...
> As for implementation, the first part could easily and flexibly accomplished 
> with the current PatternReplaceFilter, and I'm thinking the second could be 
> done with an extension to that or better yet a new Filter which allows 
> parsing synonymous tokens from a flat to overlaid format, e.g. something on 
> the order of:
> 
>   pattern="(.*)(ü|ue)(.*)"
>  replacement="$1ue$3|$1u$3"
>  tokensep="|"  
>  replace="first"/>
> 
> or perhaps better,
> 
>   pattern="(.*)(ü|ue)(.*)"
>  replacement="$1ue$3|$1u$3"
>  replace="first"/>
>   tokensep="|"/>   
> 
> which in my fantasy implementation would map:
> 
> Müller -> Mueller|Muller
> Mueller -> Mueller|Muller
> Muller -> Muller
> 
> and could be run at index-time and/or query-time as appropriate.
> 
> >Does anyone know if there are other (Latin-1-utilizing) languages
> >besides German with standardized diacritic substitutions that involve
> >something other than just stripping the diacritics?
> 
> I'm curious about this too.
> 

I am German, but working in Spain so I have not faced the problem so
far. Anyhow, IMO 
Müller -> Mueller
Mueller -> Mueller

is right to further shorten the word does not seems right since one is
changing the meaning too much.

Further:
groß -> gross
gross -> gross

ß is pronounced 'sz' but only replaced by 'ss'.

salu2

> - J.J.
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: Search results problem

2007-10-17 Thread Thorsten Scherler

On Wed, 2007-10-17 at 20:44 +1000, Pieter Berkel wrote:
> There is a configuration option called "" in
> solrconfig.xmlwith the default value of 10,000.  You may need to
> increase this value if
> you are indexing fields that are longer.
> 

Is there a way to define a unlimited value? Like -1?

TIA

salu2

> 
> 
> On 17/10/2007, Maximilian Hütter <[EMAIL PROTECTED]> wrote:
> >
> > Daniel Naber schrieb:
> > > On Tuesday 16 October 2007 12:03, Maximilian Hütter wrote:
> > >
> > >> the content of one document is completely contained in another,
> > >> but search for a special word I only get one document as result.
> > >> I am absolutely sure it is contained in the other document, but I will
> > >> only get the "parent" doc if I add a word.
> > >
> > > You should try debugging the problem with Luke, e.g. use "reconstruct &
> > > edit" to see if the term is really indexed in both documents.
> > >
> > > Regards
> > >  Daniel
> > >
> >
> > Thank you for the tip, after using luke I can see that the term is
> > really missing in the other document.
> > Is there a size restriction for field content in Solr/Lucene? Because
> > from the "fulltext" field I use as default field (after luke
> > reconstruction) seem to be missing a lot strings I expected to find there.
> >
> > Best regards,
> >
> > Max
> >
> > --
> > Maximilian Hütter
> > blue elephant systems GmbH
> > Wollgrasweg 49
> > D-70599 Stuttgart
> >
> > Tel:  (+49) 0711 - 45 10 17 578
> > Fax:  (+49) 0711 - 45 10 17 573
> > e-mail :  [EMAIL PROTECTED]
> > Sitz   :  Stuttgart, Amtsgericht Stuttgart, HRB 24106
> > Geschäftsführer:  Joachim Hörnle, Thomas Gentsch, Holger Dietrich
> >
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: Tagging in solr

2007-10-19 Thread Thorsten Scherler

On Fri, 2007-10-19 at 11:01 +0100, Spas Poptchev wrote:
> Hi,
>  
> what i want to do is to store tags that belong to products. Each tag should 
> also store information about how often it was used with a certain product.
> So for example:
>  
> product1
> cool 5=> product1 was tagged 5 times with cool
>  
> What would be the best way to implement this kind of stuff in solr?

There is a wiki page on some brainstorming on how to implement  
tagging within Solr: <http://wiki.apache.org/solr/UserTagDesign>

It's easy enough to have a tag_keywords field, but updating a single  
tag_keywords field is not so straightforward without sending the  
entire document to Solr every time it is tagged.  See SOLR-139's  
extensive comments and patches to see what you're getting into.

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: escaping characters and security

2007-11-06 Thread Thorsten Scherler

On Tue, 2007-11-06 at 11:52 -0500, Micah Wedemeyer wrote:
> Are there any security risks to passing a query directly to Solr without
> doing any sort of escaping?  I am using URL encoding, so '&' and such
> are being encoded into their %XX equivalents.
> 
> Still, should I be doing anything else?  Is there such a thing as a
> Solr-injection attack?

http://wiki.apache.org/solr/mySolr

"Typically it's not recommended do have your front end users/clients
hitting Solr directly as part of an HTML form submit ... the more
conventional way to think of it is that Solr is a backend service, which
your application can talk to over HTTP -- if you were dealing with a
database, you wouldn't expect that you could generate an HTML form for
your clients and then have them submit that form in some way that
resulted in their browser using JDBC (or ODBC) to communicate directly
with your database, their client would communicate with your App, which
would validate their input, impose some security checks on the input,
and then execute the underlying query to your database -- working with
Solr should be very similar, it just so happens that instead of using
JDBC or some other binary protocol, Solr uses HTTP, and you *can* talk
to it directly from a web browser, but that's really more of a debugging
feature then anything else."

HTH

salu2

> 
> Thanks,
> Micah
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: Help with Debian solr/jetty install?

2007-11-21 Thread Thorsten Scherler

On Tue, 2007-11-20 at 22:50 -0800, Otis Gospodnetic wrote:
> Phillip,
> 
> I won't go into details, but I'll point out that the Java compiler is called 
> javac and if memory serves me well, it is defined in one of Jetty's XML 
> config files in its etc/ dir.  The java compiler is used to compile JSPs that 
> Solr uses for the admin UI.  So, make sure you have javac and make sure Jetty 
> can find it.
>  

e.g. 

cd ~
vim .bashrc

...
export JAVA_HOME=/home/thorsten/opt/java
export PATH=$JAVA_HOME/bin:$PATH

The important thing is that $JAVA_HOME points to the JDK and it is first
in your path!

salu2

> Otis
> 
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> - Original Message 
> From: Phillip Farber <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Tuesday, November 20, 2007 5:55:27 PM
> Subject: Help with Debian solr/jetty install?
> 
> 
> Hi,
> 
> I've successfully run as far as the example admin page on Debian linux
>  2.6.
> 
> So I installed the solr-jetty packaged for Debian testing which gives
>  me 
> Jetty 5.1.14-1 and Solr 1.2.0+ds1-1.  Jetty starts fine and so does the
>  
> Solr home page at http://localhost:8280/solr
> 
> But I get an error when I try to run http://localhost:8280/solr/admin
> 
> HTTP ERROR: 500
> No Java compiler available
> 
> I have sun-java6-jre and sun-java6-jdk packages installed.  I'm new to 
> servlet containers and java webapps.  What should I be looking for to 
> fix this or what information could I provide the list to get me moving 
> forward from here?
> 
> I've included the trace from the Jetty log, and the java properties
>  dump 
> from the example below.
> 
> Thanks,
> Phil
> 
> ---
> 
> Java properties (from the example):
> --
> 
> sun.boot.library.path = /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386
> java.vm.version = 1.6.0-b105
> java.vm.name = Java HotSpot(TM) Client VM
> user.dir = /tmp/apache-solr-1.2.0/example
> java.runtime.version = 1.6.0-b105
> os.arch = i386
> java.io.tmpdir = /tmp
> 
> java.library.path = 
> /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386/client:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/i386:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/../lib/i386:/usr/java/packages/lib/i386:/lib:/usr/lib
> java.class.version = 50.0
> jetty.home = /tmp/apache-solr-1.2.0/example
> sun.management.compiler = HotSpot Client Compiler
> os.version = 2.6.22-2-686
> java.class.path = 
> /tmp/apache-solr-1.2.0/example:/tmp/apache-solr-1.2.0/example/lib/jetty-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jetty-util-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/servlet-api-2.5-6.1.3.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/ant-1.6.5.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/core-3.1.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-2.1.jar:/tmp/apache-solr-1.2.0/example/lib/jsp-2.1/jsp-api-2.1.jar:/usr/share/ant/lib/ant.jar
> java.home = /usr/lib/jvm/java-6-sun-1.6.0.00/jre
> java.version = 1.6.0
> java.ext.dirs = 
> /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/ext:/usr/java/packages/lib/ext
> sun.boot.class.path = 
> /usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/resources.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/rt.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/sunrsasign.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jsse.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/jce.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/lib/charsets.jar:/usr/lib/jvm/java-6-sun-1.6.0.00/jre/classes
> 
> 
> 
> 
> Jetty log (from the error under Debian Solr/Jetty):
> 
> 
> org.apache.jasper.JasperException: No Java compiler available
> at 
> org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:460)
> at 
> org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:367)
> at
>  org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:329)
> at
>  org.apache.jasper.servlet.JspServlet.service(JspServlet.java:265)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> at
>  org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:428)
> at 
> org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationHandler.java:473)
> at
>  org.mortbay.jetty.servlet.Dispatcher.dispatch(Dispatcher.java:286)
> at
>  org.mortbay.jetty.servlet.Dispatcher.forward(Dispatcher.java:171)
> at org.mortbay.jetty.servlet.Default.handleGet(Default.java:302)
> at org.mortbay.jetty.servlet.Default.service(Default.java:223)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> at
>  org.mo

Get last updated/committed document

2007-11-23 Thread Thorsten Scherler

Hi all,

I need to ask solr to return me the id of the last committed document.

Is there a way to archive this via a standard lucene query or do I need
a custom connector that gives me this information?

TIA for any information

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: Get last updated/committed document

2007-11-26 Thread Thorsten Scherler

On Sat, 2007-11-24 at 00:17 +1100, climbingrose wrote:
> Assuming that you have the timestamp field defined:
> q=*:*&sort=timestamp desc
> 

Thanks.

salu2

> On Nov 23, 2007 10:43 PM, Thorsten Scherler
> <[EMAIL PROTECTED]> wrote:
> > Hi all,
> >
> > I need to ask solr to return me the id of the last committed document.
> >
> > Is there a way to archive this via a standard lucene query or do I need
> > a custom connector that gives me this information?
> >
> > TIA for any information
> >
> > salu2
> > --
> > Thorsten Scherler thorsten.at.apache.org
> > Open Source Java  consulting, training and solutions
> >
> >
> 
> 
> 
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: solr to work for my web application

2008-02-13 Thread Thorsten Scherler

On Wed, 2008-02-13 at 00:06 -0800, newBea wrote:
> hi 
> 
> I am new to solr/lucene...I have installed solr nightly version..its working
> very fine.
> 
> But it is working for the exampledocs present in the example folder of the
> nightly version of solr. I need solr to work for my current web
> application...I am using tomcat5.5.23 for the application(Windows)...using
> jetty to start solr from outside of the webapps folder.
> 
> Is there any way to start the jetty using tomcat?
> 
> Help would be appreciated...

some links that you may get started:
http://wiki.apache.org/solr
http://wiki.apache.org/solr/mySolr
http://wiki.apache.org/solr/SolrTomcat

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: solr to work for my web application

2008-02-13 Thread Thorsten Scherler

On Wed, 2008-02-13 at 03:42 -0800, newBea wrote:
> Hi Thorsten,
> 
> I have my application running on 8080 port with tomcat 5.5.23I am
> starting solr on port 8983 with jetty server using command "java -jar
> start.jar".
> 
> Both the server gets started...now any search I make on tomcat application
> is interacting with solr very well. The problem is "schema.xml" and
> "solrconfig.xml" in the conf directory are default one. But after adding
> customized schema name parameter and required fields, solr is not working as
> required.

Can you post the modification you made to both files?

> 
> Customized code for parsing the xml generated from solr is working
> fine...but it is unable to find the uniquekey field which we set for all the
> documents in the schema documentand thus result is 0 means nothing.
> 

Hmm, what is your update command and your unique key?

We would need to see this modification to tell you what may be wrong.

Did you try http://YOUR_HOST:8983/solr/admin/luke?wt=xslt&tr=luke.xsl

What does this gives?

salu2

> I am not able to find the solution for this one... any suggestions wud be
> appreciated...thanks in advance. 
> 
> Thorsten Scherler-3 wrote:
> > 
> > On Wed, 2008-02-13 at 00:06 -0800, newBea wrote:
> >> hi 
> >> 
> >> I am new to solr/lucene...I have installed solr nightly version..its
> >> working
> >> very fine.
> >> 
> >> But it is working for the exampledocs present in the example folder of
> >> the
> >> nightly version of solr. I need solr to work for my current web
> >> application...I am using tomcat5.5.23 for the
> >> application(Windows)...using
> >> jetty to start solr from outside of the webapps folder.
> >> 
> >> Is there any way to start the jetty using tomcat?
> >> 
> >> Help would be appreciated...
> > 
> > some links that you may get started:
> > http://wiki.apache.org/solr
> > http://wiki.apache.org/solr/mySolr
> > http://wiki.apache.org/solr/SolrTomcat
> > 
> > salu2
> > -- 
> > Thorsten Scherler thorsten.at.apache.org
> > Open Source Java  consulting, training and solutions
> > 
> > 
> > 
> 
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: solr to work for my web application

2008-02-13 Thread Thorsten Scherler

On Wed, 2008-02-13 at 05:04 -0800, newBea wrote:
> I havnt used luke.xsl. Ya but the link provided by u gives me "Solr Luke
> Request Handler Response"...
> 
>  is simple string as: csid

So you have:
csid

and
 


> 
> till now I am updating docs thru command prompt as : post.jar *.xml
> http://localhost:8983/update

how do the docs look like? I mean since you changed the sample config
you send changed documents as well, right? How do they look?

> 
> I am not clear on how do I post xml docs

Well like you said, with the post.jar and then you will send your
modified docs but there are many ways to trigger an add command to solr.

>  or wud xml docs be posted while I
> request solr thru tomcat at the time of searching text...

To search text from tomcat you will need to have a servlet or something
similar that contacts the solr server for the search result and the
handle the response (e.g. apply custom xsl to the results).



> 
> This manually procedure when I update the xml docs on exampledocs folder
> inside distribution package restrict it to exampledocs itself

No, either copy the jar to the folder where you have your documents or
add it to the PATH.

> ...I am not
> getting a way where my sites text get searched by solr...Do I need to copy
> start.jar and relevant folders in my working directory for web application.

Hmm, it seems that you not have understood the second paragraph of 
http://wiki.apache.org/solr/mySolr

"Typically it's not recommended to have your front end users/clients
hitting Solr directly as part of an HTML form submit ... the more
conventional way to think of it is that Solr is a backend service, which
your application can talk to over HTTP ..."

Meaning you have two different server running. Alternatively you can run
solr in the same tomcat as you application. If you follow SolrTomcat
from the wiki it will be install as "solr" servlet. Your application
will then communicate with this serlvet.

salu2

> 
> any help?
> 
> Thorsten Scherler-3 wrote:
> > 
> > On Wed, 2008-02-13 at 03:42 -0800, newBea wrote:
> >> Hi Thorsten,
> >> 
> >> I have my application running on 8080 port with tomcat 5.5.23I am
> >> starting solr on port 8983 with jetty server using command "java -jar
> >> start.jar".
> >> 
> >> Both the server gets started...now any search I make on tomcat
> >> application
> >> is interacting with solr very well. The problem is "schema.xml" and
> >> "solrconfig.xml" in the conf directory are default one. But after adding
> >> customized schema name parameter and required fields, solr is not working
> >> as
> >> required.
> > 
> > Can you post the modification you made to both files?
> > 
> >> 
> >> Customized code for parsing the xml generated from solr is working
> >> fine...but it is unable to find the uniquekey field which we set for all
> >> the
> >> documents in the schema documentand thus result is 0 means nothing.
> >> 
> > 
> > Hmm, what is your update command and your unique key?
> > 
> > We would need to see this modification to tell you what may be wrong.
> > 
> > Did you try http://YOUR_HOST:8983/solr/admin/luke?wt=xslt&tr=luke.xsl
> > 
> > What does this gives?
> > 
> > salu2
> > 
> >> I am not able to find the solution for this one... any suggestions wud be
> >> appreciated...thanks in advance. 
> >> 
> >> Thorsten Scherler-3 wrote:
> >> > 
> >> > On Wed, 2008-02-13 at 00:06 -0800, newBea wrote:
> >> >> hi 
> >> >> 
> >> >> I am new to solr/lucene...I have installed solr nightly version..its
> >> >> working
> >> >> very fine.
> >> >> 
> >> >> But it is working for the exampledocs present in the example folder of
> >> >> the
> >> >> nightly version of solr. I need solr to work for my current web
> >> >> application...I am using tomcat5.5.23 for the
> >> >> application(Windows)...using
> >> >> jetty to start solr from outside of the webapps folder.
> >> >> 
> >> >> Is there any way to start the jetty using tomcat?
> >> >> 
> >> >> Help would be appreciated...
> >> > 
> >> > some links that you may get started:
> >> > http://wiki.apache.org/solr
> >> > http://wiki.apache.org/solr/mySolr
> >> > http://wiki.apache.org/solr/SolrTomcat
> >> > 
> >> > salu2
> >> > -- 
> >> > Thorsten Scherler
> >> thorsten.at.apache.org
> >> > Open Source Java  consulting, training and
> >> solutions
> >> > 
> >> > 
> >> > 
> >> 
> > -- 
> > Thorsten Scherler thorsten.at.apache.org
> > Open Source Java  consulting, training and solutions
> > 
> > 
> > 
> 
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: solr to work for my web application

2008-02-19 Thread Thorsten Scherler

On Thu, 2008-02-14 at 23:16 -0800, newBea wrote:
> Hi Thorsten...
> 
> SOrry for giving u much trouble but I need some answer regarding solr...plz
> help...
> 
> Question1
> I am using tomcat 5.5.23 so for JNDI setup of solr, adding solr.xml with
> context fragment as below in the tomcat5.5/...catalina/localhost.
> 
> 
> value="D:/Projects/csdb/solr" override="true" />
> 
> 
> Is it the correct way of doing it? 

Yes as I understand the wiki page.

> Or do I need to add context fragment in
> the server.xml of tomcat5.5?
> 
> Question2
> I am starting solr server using start.jar from another location on C:
> drive...whereas my home location indicated on D: drive. Is it the root coz I
> am not getting the search result?

Hmm, as I understand it you are starting two instance of solr! One as a
tomcat and the other as jetty. Why do you want that? If you have solr on
tomcat you do not need the jetty anymore. I does make 0 sense under
normal circumstances to do this.

> 
> Question3
> I have added parameter as C:\solr\data in
> solrconfig.xml...

That seems to be wrong. It should read ${solr.data.dir:C:\solr
\dat} but I am not using windows so I am not sure whether you
may need to escape the path.

salu2

> but the indexes are not getting stored there...indexes for
> search are getting stored in the default dir of solr...any suggestions
> 
> Thanks in advance...
> 
> 
> Thorsten Scherler wrote:
> > 
> > On Wed, 2008-02-13 at 05:04 -0800, newBea wrote:
> >> I havnt used luke.xsl. Ya but the link provided by u gives me "Solr Luke
> >> Request Handler Response"...
> >> 
> >>  is simple string as: csid
> > 
> > So you have:
> > csid
> > 
> > and
> >  > required="true" /> 
> > 
> > 
> >> 
> >> till now I am updating docs thru command prompt as : post.jar *.xml
> >> http://localhost:8983/update
> > 
> > how do the docs look like? I mean since you changed the sample config
> > you send changed documents as well, right? How do they look?
> > 
> >> 
> >> I am not clear on how do I post xml docs
> > 
> > Well like you said, with the post.jar and then you will send your
> > modified docs but there are many ways to trigger an add command to solr.
> > 
> >>  or wud xml docs be posted while I
> >> request solr thru tomcat at the time of searching text...
> > 
> > To search text from tomcat you will need to have a servlet or something
> > similar that contacts the solr server for the search result and the
> > handle the response (e.g. apply custom xsl to the results).
> > 
> > 
> > 
> >> 
> >> This manually procedure when I update the xml docs on exampledocs folder
> >> inside distribution package restrict it to exampledocs itself
> > 
> > No, either copy the jar to the folder where you have your documents or
> > add it to the PATH.
> > 
> >> ...I am not
> >> getting a way where my sites text get searched by solr...Do I need to
> >> copy
> >> start.jar and relevant folders in my working directory for web
> >> application.
> > 
> > Hmm, it seems that you not have understood the second paragraph of 
> > http://wiki.apache.org/solr/mySolr
> > 
> > "Typically it's not recommended to have your front end users/clients
> > hitting Solr directly as part of an HTML form submit ... the more
> > conventional way to think of it is that Solr is a backend service, which
> > your application can talk to over HTTP ..."
> > 
> > Meaning you have two different server running. Alternatively you can run
> > solr in the same tomcat as you application. If you follow SolrTomcat
> > from the wiki it will be install as "solr" servlet. Your application
> > will then communicate with this serlvet.
> > 
> > salu2
> > 
> >> 
> >> any help?
> >> 
> >> Thorsten Scherler-3 wrote:
> >> > 
> >> > On Wed, 2008-02-13 at 03:42 -0800, newBea wrote:
> >> >> Hi Thorsten,
> >> >> 
> >> >> I have my application running on 8080 port with tomcat 5.5.23I am
> >> >> starting solr on port 8983 with jetty server using command "java -jar
> >> >> start.jar".
> >> >> 
> >> >> Both the server gets started...now any search I make on tomcat
> >> >> application
> >> >> is interacting with solr very well. The problem is "schema.xml" and
> >> >> &q

Re: How do I secure solr server?

2008-02-21 Thread Thorsten Scherler

On Thu, 2008-02-21 at 01:46 -0500, Mel Brand wrote:
> Hi guys,
> 
> I run solr on a separate server from the application server and I'd
> like to know how to protect it. 

best with a firewall.

> I'd like to know how to prevent
> someone from communicating to the server and also prevent unauthorized
> access (through the web) to admin page.

I would not expose http://yourServer:8983 at all. I would use an Apache
httpd server as proxy and implement the ac there.

salu2

> 
> Any help is extremely appreciated!! :)
> 
> Thanks,
> 
> Mel
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: solr to work for my web application

2008-02-22 Thread Thorsten Scherler

On Fri, 2008-02-22 at 04:11 -0800, newBea wrote:
> Hi Thorsten,
> 
> Many thanks for ur replies so far...finally i set up correct environment for
> Solr. Its working:clap:

:)

Congrats, glad you got it running.

> 
> Solr Rocks!

Indeed. :)

salu2

> 
> Thorsten Scherler wrote:
> > 
> > On Thu, 2008-02-14 at 23:16 -0800, newBea wrote:
> >> Hi Thorsten...
> >> 
> >> SOrry for giving u much trouble but I need some answer regarding
> >> solr...plz
> >> help...
> >> 
> >> Question1
> >> I am using tomcat 5.5.23 so for JNDI setup of solr, adding solr.xml with
> >> context fragment as below in the tomcat5.5/...catalina/localhost.
> >> 
> >> 
> >> >> value="D:/Projects/csdb/solr" override="true" />
> >> 
> >> 
> >> Is it the correct way of doing it? 
> > 
> > Yes as I understand the wiki page.
> > 
> >> Or do I need to add context fragment in
> >> the server.xml of tomcat5.5?
> >> 
> >> Question2
> >> I am starting solr server using start.jar from another location on C:
> >> drive...whereas my home location indicated on D: drive. Is it the root
> >> coz I
> >> am not getting the search result?
> > 
> > Hmm, as I understand it you are starting two instance of solr! One as a
> > tomcat and the other as jetty. Why do you want that? If you have solr on
> > tomcat you do not need the jetty anymore. I does make 0 sense under
> > normal circumstances to do this.
> > 
> >> 
> >> Question3
> >> I have added parameter as C:\solr\data in
> >> solrconfig.xml...
> > 
> > That seems to be wrong. It should read ${solr.data.dir:C:\solr
> > \dat} but I am not using windows so I am not sure whether you
> > may need to escape the path.
> > 
> > salu2
> > 
> >> but the indexes are not getting stored there...indexes for
> >> search are getting stored in the default dir of solr...any suggestions
> >> 
> >> Thanks in advance...
> >> 
> >> 
> >> Thorsten Scherler wrote:
> >> > 
> >> > On Wed, 2008-02-13 at 05:04 -0800, newBea wrote:
> >> >> I havnt used luke.xsl. Ya but the link provided by u gives me "Solr
> >> Luke
> >> >> Request Handler Response"...
> >> >> 
> >> >>  is simple string as: csid
> >> > 
> >> > So you have:
> >> > csid
> >> > 
> >> > and
> >> >  >> > required="true" /> 
> >> > 
> >> > 
> >> >> 
> >> >> till now I am updating docs thru command prompt as : post.jar *.xml
> >> >> http://localhost:8983/update
> >> > 
> >> > how do the docs look like? I mean since you changed the sample config
> >> > you send changed documents as well, right? How do they look?
> >> > 
> >> >> 
> >> >> I am not clear on how do I post xml docs
> >> > 
> >> > Well like you said, with the post.jar and then you will send your
> >> > modified docs but there are many ways to trigger an add command to
> >> solr.
> >> > 
> >> >>  or wud xml docs be posted while I
> >> >> request solr thru tomcat at the time of searching text...
> >> > 
> >> > To search text from tomcat you will need to have a servlet or something
> >> > similar that contacts the solr server for the search result and the
> >> > handle the response (e.g. apply custom xsl to the results).
> >> > 
> >> > 
> >> > 
> >> >> 
> >> >> This manually procedure when I update the xml docs on exampledocs
> >> folder
> >> >> inside distribution package restrict it to exampledocs itself
> >> > 
> >> > No, either copy the jar to the folder where you have your documents or
> >> > add it to the PATH.
> >> > 
> >> >> ...I am not
> >> >> getting a way where my sites text get searched by solr...Do I need to
> >> >> copy
> >> >> start.jar and relevant folders in my working directory for web
> >> >> application.
> >> > 
> >> > Hmm, it seems that you not have understood the second paragraph of 
> >> > http://wiki.apache.org/solr/mySolr
> >> > 
> >> > "Typically it's not recommended to have your

Re: out of memory every time

2008-03-03 Thread Thorsten Scherler

On Mon, 2008-03-03 at 21:43 +0200, Justin wrote:
> I'm indexing a large number of documents.
> 
> As a server I'm using the /solr/example/start.jar
> 
> No matter how much memory I allocate it fails around 7200 documents.

How do you allocate the memory?

Something like:
java -Xms512M -Xmx1500M -jar start.jar

You may have a closer look as well at
http://java.sun.com/j2se/1.5.0/docs/guide/vm/gc-ergonomics.html

HTH

salu2

> I am committing every 100 docs, and optimizing every 300.
> 
> all of my xml's contain on doc, and can range in size from 2k to 700k.
> 
> when I restart the start.jar it again reports out of memory.
> 
> 
> a sample document looks like this:
> 
> 
>  
>   1851
>   TRAJ20
>   12049
>name="ft:external_ids.SourceAccession:15532">ENSG0211869
>   28735
>   HUgn28735
>   TRA_
>   TRAJ20
>   9953837
>name="ft:external_ids.SourceAccession:15538">ENSG0211869
>   T cell receptor alpha
> joining 20
>   14q11.2
>   14q11
>   14q11.2
>   AE000662.1
>   M94081.1
>   CH471078.2
>   NC_14.7
>   NT_026437.11
>   NG_001332.2
>   8188290
>   The human T-cell receptor
> TCRAC/TCRDC (C alpha/C delta) region: organization,sequence, and evolution
> of 97.6 kb of DNA.
>   Koop B.F.
>   Rowen L.
>   Hood L.
>   Wang K.
>   Kuo C.L.
>   Seto D.
>   Lenstra J.A.
>   Howard S.
>   Shan W.
>   Deshpande P.
>   31311_at
>   
> 
> 
> 
> 
> the schema is (in summary):
> 
> multiValued="false" omitNorms="true"/>
> multiValued="true"  omitNorms="true"/>
> 
> stored="true"  omitNorms="true"/>
> omitNorms="true"/>
> 
> 
> 
> PK
> text
> 
> 
> 
> 
> 
> 
> and my conf is:
>false
> 100
> 900
> 2147483647
> 1
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

Re: Beginner questions: Jetty and solr with utf-8 + cached page + dedup

2008-03-26 Thread Thorsten Scherler

On Tue, 2008-03-25 at 10:56 -0700, Vinci wrote:
> Hi,
> 
> Thank for your reply.
> Question for apply xslt: If I use saxon, where should the saxon.jar located
> if I using the example jetty server? lib/ inside example/ or outside the
> example/?

http://wiki.apache.org/solr/mySolr
"...
Typically it's not recommended to have your front end users/clients
hitting Solr directly as part of an HTML form submit
..."

In the above page there you find answers to many of your questions.

HTH

salu2
-- 
Thorsten Scherler thorsten.at.apache.org
Open Source Java  consulting, training and solutions

search engine for regional bulletins

2006-11-28 Thread Thorsten Scherler

Hi all,

I am developing a search engine for a governmental body. This search
engine has to index pure xml documents which follow a custom xml schema.
The xml documents contain information about laws and official
announcements for Andalusia.

I need to implement different filter for the search. The current search
engine which can be found here [1] would need to be extended by ranges
about organizational bodies, kind of announcement (law,
resolution,...), ...

I played a bit with Nutch 0.8 and asked myself whether it is best
tool for the task. I got nutch to index the xml documents and I can as
well search the index, but I would need to add filter conditions for the
search. The alternative I see would be pure lucene since I am actually
not really "crawling" the site since the documents are not linked with
each other but put all the files (which have to be indexed) in the
urls/bulletin file. Then Zaheed pointed me to Solr and I had played a
wee bit around.

To give you a better impression of the underlying architecture and xml
documents, each weekday there is a new bulletin (containing approx. 100
- 200 pages) eg [2]. This bulletin is stored on the file system and need
to be indexed.

We have two different document types summaries and dispositions. The
summary looks like:

1. DISPOSICIONES GENERALES

Decreto
178/2006, de 10 de octubre, por el que se establecen normas de
protección de la avifauna para las instalaciones eléctricas de
alta tensión

Resolución de 10 de octubre de 2006, de la Dirección General de
Tesorería y Deuda Pública, por la que se realiza una
convocatoria de subasta de carácter ordinario dentro del
Programa de Emisión de Bonos y Obligaciones de la Junta de
Andalucía.

Following the tutorial and looking at the examples it seems that solr
only supports one document type.

3007WFP
Dell Widescreen UltraSharp 3007WFP

The root element add is "just" the command for the server that we want
to add the document. Does that mean I would need to stick with this
doctype and transform our internal format for adding the document
information?

Further since the project is for a customer I would need a released
version when I put my engine in production. When does this community
expect to make its first release, or better asked which are the
blockers?

TIA for any information.

salu2

[1] http://andaluciajunta.es/portal/aj-bojaBuscador/0,22815,,00.html
[2]
http://andaluciajunta.es/portal/boletines/2006/11/aj-bojaVerPagina-2006-11/0,23167,bi%253D693228039889,00.html

Re: search engine for regional bulletins

2006-11-28 Thread Thorsten Scherler

On Tue, 2006-11-28 at 10:00 +0100, Bertrand Delacretaz wrote:
> Hi Thorsten, good to see you here!

:)

Hi Bertrand, thanks very much for this warm welcome and I am as well
glad to meet you here.

> 
> On 11/28/06, Thorsten Scherler
> <[EMAIL PROTECTED]> wrote:
> 
> > ...Following the tutorial and looking at the examples it seems that solr
> > only supports one document type.
> >
> > 
> >   3007WFP
> >   Dell Widescreen UltraSharp 3007WFP
> >   
> > ...
> 
> That's right, to add documents to a Solr index you need to transform
> them to this model. You're basically creating fields to be indexed,
> and the Solr schema.xml allows you to define precisely how you want
> each field to be indexed, including strict data types, pluggable
> Lucene analyzers, etc.
> 
> This means some work in converting your content model to an "indexing
> model", but it's very worth it as it gives you very precise control
> about what you index and how.
> 

Yeah, I thought about it last night and I came to the same conclusion.
The "extra" work involved is "just" a xsl transformation in my use case,
so not really the biggest part of this project.

> > ...Further since the project is for a customer I would need a released
> > version when I put my engine in production. When does this community
> > expect to make its first release, or better asked which are the
> > blockers?...
> 
> I'm relatively new here so I'll let others complete this info, but
> IIUC the only work needed to do a first release is to make sure all
> source files are "clean" w.r.t required Apache license notices. I
> don't think there are any technical blockers for a release, many of us
> are happily using Solr on production sites.

That is good to hear, so if somebody (e.g. me) would check all files for
cleanness then we could release, right? Perfect.

> 
> You might want to look at these links for more info:
>   http://wiki.apache.org/solr/SolrResources
>   http://wiki.apache.org/solr/PublicServers

Thanks very much Bertrand, I will look at this information. I am still
evaluating what is best for this project, but solr sounds very
interesting ATM. 

salu2
> 
> -Bertrand

Re: search engine for regional bulletins

2006-11-28 Thread Thorsten Scherler

On Tue, 2006-11-28 at 11:30 -0500, Yonik Seeley wrote:
> On 11/28/06, Thorsten Scherler
> <[EMAIL PROTECTED]> wrote:
> > That is good to hear, so if somebody (e.g. me) would check all files for
> > cleanness then we could release, right? Perfect.
> 
> Correct.  All IP issues have been cleared, so It's just a matter of
> taking the time to put the release into a form that will be accepted
> by the incubator.  I expect we will be making a release candidate
> within a few weeks.  Of course the incubator guys always finds
> problems,  so getting an actual release out takes longer.
> 

Yeah, I have been in the incubator with lenya and we made some valuable
experience back then. Further I see many committer here with some
experience in different Apache PMC's so hopefully we get it straight
right away and the incubator PMC does not find many issues.

I will try to help the best I can.

> -Yonik

Thanks Yonik.

salu2

solr index reusable with nutch?

2006-12-13 Thread Thorsten Scherler

Hi all,

is it possible to directly use the solr index in nutch?

My client is creating a portal search based on nutch. In this portal
there is as well my project and ATM I prefer to go with solr instead of
nutch since it its much better for my use case.

Now the question is whether the portal search engine could use the solr
index for my part of the portal.

Can somebody point me to related documentation?

TIA

salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)

Re: solr index reusable with nutch?

2006-12-13 Thread Thorsten Scherler

On Wed, 2006-12-13 at 07:45 -0800, Otis Gospodnetic wrote:
> Hi,
> 
> Solr should be able to search any Lucene index,

ok, good to know. :) 

So can I guess that the same is true for nutch? Meaning the index solr
is creating could be used by a nutch searcher.

>  not just those created by Solr itself, as long as you configure it properly 
> via schema.xml.  

http://wiki.apache.org/solr/SchemaXml?highlight=%28schema%29

> Thus, you should be able to use Solr to search an index created by Nutch. 

In my use case I need the reverse. Nutch searches the index created by
my solr application. The application is just one component in the portal
and the portal will provide a "global" search engine which should use
the index from solr.

>  Haven't tried it.  It would be nice if you could contribute the 
> configuration for doing this.
> 

As I figure it out I will keep you informed.

Thanks for the feedback.

salu2

> Otis
> 
> - Original Message 
> From: Thorsten Scherler <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, December 13, 2006 8:26:51 AM
> Subject: solr index reusable with nutch?
> 
> Hi all,
> 
> is it possible to directly use the solr index in nutch?
> 
> My client is creating a portal search based on nutch. In this portal
> there is as well my project and ATM I prefer to go with solr instead of
> nutch since it its much better for my use case.
> 
> Now the question is whether the portal search engine could use the solr
> index for my part of the portal.
> 
> Can somebody point me to related documentation?
> 
> TIA
> 
> salu2

Re: solr index reusable with nutch?

2006-12-15 Thread Thorsten Scherler

On Thu, 2006-12-14 at 11:14 -0800, Chris Hostetter wrote:
> : In my use case I need the reverse. Nutch searches the index created by
> : my solr application. The application is just one component in the portal
> : and the portal will provide a "global" search engine which should use
> : the index from solr.
> 
> If you have a compatible schema, then it should be possible ... but if
> your goal is to make an index with a biz object specific schema and then
> use it as a single collection/source in a nutch installation, that may not
> sork ... 

Yeah, that makes sense. 

> i'm not sure how flexible Nutch is about the indexes it can
> hanlde: it's probably a question best asked on the Nutch user list.
> 

Yeah, you are right.

Thanks for the feedback.

salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)

Re: solr index reusable with nutch?

2006-12-20 Thread Thorsten Scherler

On Thu, 2006-12-14 at 11:14 -0800, Chris Hostetter wrote:
> : In my use case I need the reverse. Nutch searches the index created by
> : my solr application. The application is just one component in the portal
> : and the portal will provide a "global" search engine which should use
> : the index from solr.
> 
> If you have a compatible schema, then it should be possible ... but if
> your goal is to make an index with a biz object specific schema and then
> use it as a single collection/source in a nutch installation, that may not
> sork ... i'm not sure how flexible Nutch is about the indexes it can
> hanlde: it's probably a question best asked on the Nutch user list.

I did some testing with nutch searching over a solr index. Like Chris
said "compatible schema" are the only important point on this issue.

To put it in other words, nutch uses by default 
to search and returns some fields by default. So if you are not keen to
write your own nutch plugin for your custom solr schema, just make sure
that you use the field name="content" to store your main text. You can
further enhance the integration by using the "nutch" names for
"important" fields. 

Further I have  in my schema and it is the only field
that I see in the response of nutch.

sh bin/nutch org.apache.nutch.searcher.NutchBean presidencia
Total hits: 3
 0 null//2006/209/disposition/19923-a.html

 1 null//2006/209/disposition/20246-a.html

 2 null//2006/209/disposition/20034-a.html

This is good enough for my client and me since I can transform that
afterward. :)

Thanks Chris and Otis for your feedback.

salu2

> 
> 
> 
> 
> -Hoss
>

Re: Realtime directory change...

2006-12-22 Thread Thorsten Scherler

On Thu, 2006-12-21 at 12:23 -0800, escher2k wrote:
> Hi,
>   We currently use Lucene to do index user data every couple of hours - the
> index is completely rebuilt,
> the old index is archived and the new one copied over to the directory.
> Example -
> 
> /bin/cp ${LOG_FILE} ${CRON_ROOT}/index/help/
> /bin/rm -rf ${INDEX_ROOT}/archive/help.${DATE}
> /bin/cp -R ${CRON_ROOT}/index/help ${INDEX_ROOT}/help.new
> /bin/mv ${INDEX_ROOT}/help ${INDEX_ROOT}/archive/help.${DATE}
> /bin/mv ${INDEX_ROOT}/help.new ${INDEX_ROOT}/help
> 
> This works fine since the index is retrieved every time from the disk. Is it
> possible to do the same with Solr ? 
> Assuming we also use caching to speed up the retrieval, is there a way to
> invalidate some/all caches when
> this done ?
> 

Did you look into 
http://wiki.apache.org/solr/CollectionDistribution
http://wiki.apache.org/solr/SolrCollectionDistributionScripts
http://wiki.apache.org/solr/SolrCollectionDistributionOperationsOutline

I am still very new to solr but it sounds like it is exactly what you
need (like as well said by others). 

HTH

salu2


> Thanks.
>

Re: Help with spellchecker integration

2006-12-22 Thread Thorsten Scherler

On Thu, 2006-12-21 at 21:27 -0800, Otis Gospodnetic wrote: 
> Hi,
> I'm trying to integrate the Lucene-based spellchecker 
> (http://wiki.apache.org/jakarta-lucene/SpellChecker + contrib/spellchecker 
> under Lucene) with Solr (http://issues.apache.org/jira/browse/SOLR-81) in 
> order to provide a query spellchecking service (you enter Speers and it 
> suggest pant^H^H ... Spears).  I've created a generic NGramTokenizer (+ 
> NGramTokenizerFactory + unit test) that I'll attach to SOLR-81 shortly.
> 
> What I'm not yet sure about is:
> 1) integration of this generic n-grammer with that Lucene SpellChecker code - 
> SpellChecker & TRStringDistance classes in particular.

Hmm, reading SOLR-81, you actually have everything you need.

> 2) mapping n-gram Tokens that come out of my NGramTokenizer to specific field 
> names, like 3start, 4start, gram1, gram2, gram3 is there is scheme.xml 
> trick one can use to accomplish this?

It is in the issue:
...

...

The above shows how to configure the second (spellcheck) index, however
if you want to update both indexes at the same time you need to write
your own implementation of the update servlet.

> 3) once 2) is done, getting the request handler(?) to n-gram the query 
> appropriately and hit the SpellChecker index to try and find alternative 
> spelling suggestions.

hmm, not sure, actually IMHO that highly depends on how you plan to use
it in the end. I mean there is more then one way to use spell check.

In the issue they talked about AJAX suggestions but that would be IMO
before the actual search request. If you want to have it in the request
handler then you need to decide how and when the spellchecker comes into
place.

Like if the normal search does not return a result or parallel. Parallel
would search in the spell check index for alternatives, use this
alternatives to dispatch the alternative word query and later on parse
the result of directly into the output writer. Here you have again
different alternatives, you can attack the solr index directly (loosing
all the cool feature) 

Or you want the google thingy "Did you mean".

... in any form 
start with:
public class NGramRequestHandler extends StandardRequestHandler
implements SolrRequestHandler, SolrInfoMBean {
public void handleRequest(SolrQueryRequest req, SolrQueryResponse
rsp) {
// Depending on the use case do your processing here
}
}

This way you just need to implement the class specific methods. 

> 
> Damn, that's a lot of unknowns... on top of that my computer started freezing 
> every half an hour.  Hi Murphy.
> 
> 
> 
> Any pointers will be greatly appreciated. Thanks,

HTH a wee bit.

salu2

> Otis
> 
> 
>

Re: Solr 1.1 released

2006-12-25 Thread Thorsten Scherler

On Fri, 2006-12-22 at 17:07 -0500, Yonik Seeley wrote:
> Solr 1.1 is now available for download! 

Very nice. :)

Thanks a lot to this community and especially to Yonik who packed the
release.

salu2

Is there a BasicSummarizer for solr?

2007-01-02 Thread Thorsten Scherler

Hi all,

I need to implement a summary function with solr like there is in nutch.
Basically it returns x words before and after the query term to show the
content where the term is embedded (like as google does).

In nutch this functionality is provided by 
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/summary-basic/
and especially 
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/summary-basic/src/java/org/apache/nutch/summary/basic/BasicSummarizer.java?view=markup

There is another similar plugin/class in
http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/summary-lucene/

Is there something similar in solr?

If not which is the best way to implement this functionality?

TIA for any tips.

salu2

Re: Is there a BasicSummarizer for solr?

2007-01-02 Thread Thorsten Scherler

On Tue, 2007-01-02 at 08:14 -0500, Erik Hatcher wrote:
> Thorsten - there is support for the Lucene Highlighter built into  
> Solr.  You can see details of how to use it here:
> 
>   <http://wiki.apache.org/solr/HighlightingParameters>
> 
>Erik
> 

:)  

Cheers Erik, with this information and a small change in my schema
changed stored="false" to stored="true" on my main content, I get
exactly what I needed.

Now I have to see the effect of storing the content in the index
regarding size and response time.

Thanks again.

salu2

> 
> On Jan 2, 2007, at 7:26 AM, Thorsten Scherler wrote:
> 
> > Hi all,
> >
> > I need to implement a summary function with solr like there is in  
> > nutch.
> > Basically it returns x words before and after the query term to  
> > show the
> > content where the term is embedded (like as google does).
> >
> > In nutch this functionality is provided by
> > http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/summary- 
> > basic/
> > and especially
> > http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/summary- 
> > basic/src/java/org/apache/nutch/summary/basic/BasicSummarizer.java? 
> > view=markup
> >
> > There is another similar plugin/class in
> > http://svn.apache.org/viewvc/lucene/nutch/trunk/src/plugin/summary- 
> > lucene/
> >
> > Is there something similar in solr?
> >
> > If not which is the best way to implement this functionality?
> >
> > TIA for any tips.
> >
> > salu2
>

How to tell the highlighter not to escape?

2007-01-02 Thread Thorsten Scherler

Hi all,

I am playing around with the highlighter and found that all highlight
terms get escaped.

I mean solr will return 
 TERM and not
 TERM 

I am not sure where this escaping is happening but I would need the
highlighting to NOT escape the hl.simple.pre and hl.simple.post tag
since it is horror to work with cdata sections in xsl.

I had a look in the lucene highlighter and it seem that it does not
escape the tags.

Can somebody point me to code which is responsible for escaping and
maybe give me a tip how I can patch to make it configurable. 

TIA

salu2

Re: How to tell the highlighter not to escape?

2007-01-03 Thread Thorsten Scherler

On Wed, 2007-01-03 at 02:16 +, Edward Garrett wrote:
> thorsten,
> 
> see the following for discussion. your case is indeed an annoyance--the
> thread below discusses motivations for it and ways of working around it. (i
> too confess that i wish it were not so.)
> 
> http://www.mail-archive.com/solr-user@lucene.apache.org/msg01483.html

Thanks Edward, the problem is with the suggestion in the above thread is
that:
"just create an XSL that
generates XML and unescapes the fields you know will contain wellformed
XML data -- then apply your second transform client side"

Is not possible with xsl. See e.g. 
http://www.biglist.com/lists/xsl-list/archives/200109/msg00318.html
"> How can I match the Cdata Section?!?
>
You can't, the XPath data model regards CDATA as merely an input shortcut,
not as an information-bearing part of the XML content. In other words,
"" and "x" look exactly the same to the XSLT processor.

Mike Kay"

Michael Kay is the xsl guru and I can say as well from my own experience
one would need to write a custom parser since 
is equal to <em>TERM</em> and this in xsl is a string (XPath
would match text()). 

IMO the highlighter should really return pure xml and not escape it. 
I will have a look in the XmlResponseWriter maybe I find a way to change this.

salu2


> 
> -edward
> 
> On 1/2/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
> >
> > Hi Thorsten,
> >
> > The highlighter does not escape anything itself: you are seeing the
> > results of solr's automatic escaping of xml data within its xml
> > response.  This should be transparent (your xml decoder should
> > un-escape the values on the way out).  I'm not really familiar with
> > xslt so I'm unsure why that isn't so (perhaps it is automatically
> > html-escaping the values after un-xml-escaping them?)
> >
> > Be careful of documents containing html fragments natively.
> >
> > cheers,
> > -MIke
> >
> > On 1/2/07, Thorsten Scherler <[EMAIL PROTECTED]>
> > wrote:
> > > Hi all,
> > >
> > > I am playing around with the highlighter and found that all highlight
> > > terms get escaped.
> > >
> > > I mean solr will return
> > >  <em>TERM</em> and not
> > >  TERM 
> > >
> > > I am not sure where this escaping is happening but I would need the
> > > highlighting to NOT escape the hl.simple.pre and hl.simple.post tag
> > > since it is horror to work with cdata sections in xsl.
> > >
> > > I had a look in the lucene highlighter and it seem that it does not
> > > escape the tags.
> > >
> > > Can somebody point me to code which is responsible for escaping and
> > > maybe give me a tip how I can patch to make it configurable.
> > >
> > > TIA
> > >
> > > salu2
> > >
> > >
> >
> 
> 
> 
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)

Re: How to tell the highlighter not to escape?

2007-01-03 Thread Thorsten Scherler

On Wed, 2007-01-03 at 12:06 +, Edward Garrett wrote:
> for what it's worth, i wrote a recursive template in xsl that replaces the
> escaped characters with actual elements. here, the variable $val would be
> the tag, e.g. "em". this has been working okay for me so far.

Yeah, many thanks for posting this template. This is actually
"imitating" a parser. 

However I still think the highlighter should return unescaped tags for
highlighting. There is IMO no benefit for the current behavior.

Thanks again Edward.

salu2

> 
> 
> 
> 
> 
> 
> 
>  select="substring($insideEm, string-length($preEm)+5)"/>
> 
> 
> 
> 
> 
>     
> 
> 
> 
> 
> 
> 
> 
> On 1/3/07, Thorsten Scherler <[EMAIL PROTECTED]> wrote:
> >
> > On Wed, 2007-01-03 at 02:16 +, Edward Garrett wrote:
> > > thorsten,
> > >
> > > see the following for discussion. your case is indeed an annoyance--the
> > > thread below discusses motivations for it and ways of working around it.
> > (i
> > > too confess that i wish it were not so.)
> > >
> > > http://www.mail-archive.com/solr-user@lucene.apache.org/msg01483.html
> >
> > Thanks Edward, the problem is with the suggestion in the above thread is
> > that:
> > "just create an XSL that
> > generates XML and unescapes the fields you know will contain wellformed
> > XML data -- then apply your second transform client side"
> >
> > Is not possible with xsl. See e.g.
> > http://www.biglist.com/lists/xsl-list/archives/200109/msg00318.html
> > "> How can I match the Cdata Section?!?
> > >
> > You can't, the XPath data model regards CDATA as merely an input shortcut,
> > not as an information-bearing part of the XML content. In other words,
> > "" and "x" look exactly the same to the XSLT processor.
> >
> > Mike Kay"
> >
> > Michael Kay is the xsl guru and I can say as well from my own experience
> > one would need to write a custom parser since 
> > is equal to <em>TERM</em> and this in xsl is a string (XPath
> > would match text()).
> >
> > IMO the highlighter should really return pure xml and not escape it.
> > I will have a look in the XmlResponseWriter maybe I find a way to change
> > this.
> >
> > salu2
> >
> >
> > >
> > > -edward
> > >
> > > On 1/2/07, Mike Klaas <[EMAIL PROTECTED]> wrote:
> > > >
> > > > Hi Thorsten,
> > > >
> > > > The highlighter does not escape anything itself: you are seeing the
> > > > results of solr's automatic escaping of xml data within its xml
> > > > response.  This should be transparent (your xml decoder should
> > > > un-escape the values on the way out).  I'm not really familiar with
> > > > xslt so I'm unsure why that isn't so (perhaps it is automatically
> > > > html-escaping the values after un-xml-escaping them?)
> > > >
> > > > Be careful of documents containing html fragments natively.
> > > >
> > > > cheers,
> > > > -MIke
> > > >
> > > > On 1/2/07, Thorsten Scherler <
> > [EMAIL PROTECTED]>
> > > > wrote:
> > > > > Hi all,
> > > > >
> > > > > I am playing around with the highlighter and found that all
> > highlight
> > > > > terms get escaped.
> > > > >
> > > > > I mean solr will return
> > > > >  <em>TERM</em> and not
> > > > >  TERM 
> > > > >
> > > > > I am not sure where this escaping is happening but I would need the
> > > > > highlighting to NOT escape the hl.simple.pre and hl.simple.post tag
> > > > > since it is horror to work with cdata sections in xsl.
> > > > >
> > > > > I had a look in the lucene highlighter and it seem that it does not
> > > > > escape the tags.
> > > > >
> > > > > Can somebody point me to code which is responsible for escaping and
> > > > > maybe give me a tip how I can patch to make it configurable.
> > > > >
> > > > > TIA
> > > > >
> > > > > salu2
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > --
> > thorsten
> >
> > "Together we stand, divided we fall!"
> > Hey you (Pink Floyd)
> >
> >
> >
> 
> 
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)

[ANN] Apache Forrest/Cocoon based solr client plugin

2007-01-07 Thread Thorsten Scherler

Hi all,

I am happy to announce that I just add a Apache Forrest based Apache
Solr client plugin to the forrest whiteboard. It may be from interest
for the ones using Apache Cocoon based Apache Forrest and Apache Lucene
based Apache Solr.

org.apache.forrest.plugin.output.solr generates Apache Solr documents
from Apache Forrest xdos. Further when run with the Apache Forrest
Dispatcher it provides a GUI to manage your project in solr and a search
interface to search your solr server.

The documentation and a couple of screenshots can be found at
http://forrest.apache.org/pluginDocs/plugins_0_80/org.apache.forrest.plugin.output.solr/

The source code can be found at
http://svn.apache.org/viewvc/forrest/trunk/whiteboard/plugins/org.apache.forrest.plugin.output.solr/

Have fun with it and please provide feedback to this list.

salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)

Re: Seeking FAQs

2007-01-08 Thread Thorsten Scherler

On Sat, 2007-01-06 at 10:25 -0500, David Halsted wrote:
> I wonder what would happen if we used a clustering engine like Carrot
> to categorize either the e-mails in the archive or the results of
> searches against them?  Perhaps we'd find some candidates for the FAQ
> that way.

Not sure about tools but IMO this works fine done by user/committer. I
think the one that asked the question on the list is a likely candidate
to add an entry in the faq.

The typical scenario should be:
user asks question -> user get answers from community -> user adds FAQ
entry with the solution that worked for her

This way the one asking the question can give a little something back to
the community.

If you follow the lists a bit one can identify some faq's right away:
- Searching multiple indeces 
- Clustering solr (custom scorer, highlighter, ...)
- ...


> 
> Dave
> 
> On 1/5/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
> >
> > Hey everybody,
> >
> > I was lookin at the FAQ today, and I realized it hasn't really changed
> > much in the past year ... in fact, only two people besides myself have
> > added questions (thanks Thorsten and Darren) in the entire time Solr
> > has been in incubation -- which is not to say that Erik and Respaldo's
> > efforts to fix my typo's aren't equally helpful :)
> >
> > http://wiki.apache.org/solr/FAQ
> >
> > In my experience, FAQs are one of the few pieces of documentation that are
> > really hard for developers to write, because we are so use to dealing with
> > the systems we work on, we don't allways notice when a question has been
> > asked more then once or twice (unless it gets asked over and over and
> > *over*).  The best source of FAQ updates tend to come from users who have
> > a question, and either find the answer in the mailing list archives, or
> > notice the same question asked by someone else later.
> >

Yes, I totally agree. Sometimes the content for the solution can be
found in the wiki. One would just need to link to the wiki page from the
FAQ.

> > So If there are any "gotchas" you remember having when you first started
> > using Solr, or questions you've noticed asked more then once please feel
> > free to add them to the wiki.  The Convention is to only add a question if
> > you're also adding an answer, but even if you don't think a satisfactory
> > answer has ever been given, or you're not sure how to best summarize
> > multiple answers given in the past, just including links to
> > instances in the mailing list archives where the question was asked is
> > helpful -- both in the short term as pointers for people looking for help,
> > and in the long term as starter points for people who want to flesh out a
> > detailed answer.
> >

In the long run the content of the wiki that has proved solution should
IMO go directly in the official documentation. 

salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)

Re: newbie question on determining fieldtype

2007-01-08 Thread Thorsten Scherler

On Mon, 2007-01-08 at 10:29 -0300, mike topper wrote:
> Hi,
> 
> I have a question that I couldn't find the exact answer to. 
> 
> I have some fields that I want to add to my schema but will never be 
> searched on.  They are only used as additional information about a 
> document when retrieved.  They are integers, so should i just have the 
> field be:
> 
>  stored="true"/>
> 
> I'm pretty sure this is right, but I just wanted to check that I'm not 
> missing any speedups from using a different field
> or adding some other parameters.
> 

Seems pretty right to me.

Did you read 
http://wiki.apache.org/solr/SchemaXml

and saw the comment:
 

HTH
salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)

Re: Performance tuning

2007-01-11 Thread Thorsten Scherler

On Thu, 2007-01-11 at 14:57 +, Stephanie Belton wrote:
> Hello,
> 
>  
> 
> Solr is now up and running on our production environment and working great. 
> However it is taking up a lot of extra CPU and memory (CPU usage has doubled 
> and memory is swapping). Is there any documentation on performance tuning? 
> There seems to be a lot of useful info in the server output but I don’t 
> understand it.
> 
>  
> 
> E.g.
> filterCache{lookups=0,hits=0,hitratio=0.00,inserts=537,evictions=0,size=337,cumulative_lookups=4723,cumulative_hits=3708,cumulative_hitratio=0.78,cumulative_inserts=4647,cumulative_evictions=72}
> 
> 
> queryResultCache{lookups=0,hits=0,hitratio=0.00,inserts=256,evictions=0,size=256,cumulative_lookups=3779,cumulative_hits=552,cumulative_hitratio=0.14,cumulative_inserts=3632,cumulative_evictions=0}
> 
> 
> documentCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,cumulative_lookups=66005,cumulative_hits=2460,cumulative_hitratio=0.03,cumulative_inserts=63545,cumulative_evictions=4195}
> 
>  
> 
> etc. what should I be watching out for?
> 

Hi Stephanie,

did you see http://wiki.apache.org/solr/SolrPerformanceFactors?

Further you may consider to balance the load via
http://wiki.apache.org/solr/CollectionDistribution

HTH

salu2

>  
> 
> Thanks
> 
> Stephanie
>

Re: How can I update a specific field of an existing document?

2007-01-11 Thread Thorsten Scherler

On Thu, 2007-01-11 at 10:19 -0600, Iris Soto wrote:
> Hello everybody,
> I want update a specific field in a document, but i don't find how do it 
> in the documentation of Solr.
> Is that posible?, I need to index only a field for a document, Do i have 
> to index all the document for this?
> The problem is that i have to transform a bizdata object to a file 
> content xml in java,  i should to build all the document xml step by 
> step, field by field, retrieving all the bizdata of database to be 
> passed to Solr.
> 

On Thu, 2007-01-11 at 06:43 -0500, Erik Hatcher wrote:
> In Lucene to update a document the operation is really a delete  
> followed by an add.  You will need to add the complete document as  
> there is no such "update only a field" semantics in Lucene. 

This is from a thread in the dev list.

So no it is not possible to just update one field.

HTH

salu2

> Thanks in advance.
>

Re: How can I update a specific field of an existing document?

2007-01-11 Thread Thorsten Scherler

On Thu, 2007-01-11 at 17:48 +0100, Thorsten Scherler wrote:
> On Thu, 2007-01-11 at 10:19 -0600, Iris Soto wrote:
> > Hello everybody,
> > I want update a specific field in a document, but i don't find how do it 
> > in the documentation of Solr.
> > Is that posible?, I need to index only a field for a document, Do i have 
> > to index all the document for this?

No, just the one document. Let's say you have a CMS and you edit one
document. You will need to re-index this document only by using the the
add solr statement for the whole document (not one field only).

> > The problem is that i have to transform a bizdata object to a file 
> > content xml in java,  i should to build all the document xml step by 
> > step, field by field, retrieving all the bizdata of database to be 
> > passed to Solr.

see above only for the document where the field are changed. I wrote a
small cocoon based plugin in forrest doing the cms related example.

It adds an document related solr gui for a cms like system. Maybe that
gives you some ideas for your own app.


> > 
> 
> On Thu, 2007-01-11 at 06:43 -0500, Erik Hatcher wrote:
> > In Lucene to update a document the operation is really a delete  
> > followed by an add.  You will need to add the complete document as  
> > there is no such "update only a field" semantics in Lucene. 
> 
> This is from a thread in the dev list.

could not access the archive the first time:
http://www.nabble.com/forum/ViewPost.jtp?post=8275908&framed=y

HTH

salu2

> 
> So no it is not possible to just update one field.
> 
> HTH
> 
> salu2
> 
> > Thanks in advance.
> > 
> 
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)

Re: [ANN] Apache Forrest/Cocoon based solr client plugin

2007-01-10 Thread Thorsten Scherler

On Tue, 2007-01-09 at 22:50 -0500, Yonik Seeley wrote:
> Thanks Thorsten,
> 
> Knowing nothing about cocoon and little about forrest, I'm not sure
> exactly what this does :-)
> 

jeje, fair enough. 

You know forrest from the solr webpage. What I did is a small generic
way to access the solr server with cocoon/forrest. 

What it does is mainly solving (basic) SOLR-20 & SOLR-30 for cocoon. You
can update and select content from the solr server connecting to the
http interface. 

The nice thing is the power of cocoon that is Bertrand always talking
about. ;) We use the output of the solr server as is and use it in the
transformation pipeline. 

The update interface is 
http://forrest.apache.org/pluginDocs/plugins_0_80/org.apache.forrest.plugin.output.solr/images/gui-actionbar.png
and it returns a small success/error page (depending of the solr
response). This interface is half way url specific (add and delete) and
you can execute the commit and optimize commands on ever page.

It is based on the solr generator which is a wrapper of  
http://svn.apache.org/viewvc/forrest/trunk/whiteboard/plugins/org.apache.forrest.plugin.output.solr/src/java/org/apache/forrest/http/client/PostFile.java?view=markup

Which is a simple class to post a file from one url to another. The
response body is provide stream and as string. I wrote this simple class
since the patches of SOLR-20 & SOLR-30 are not yet applied. 

> I'll take a guess in non-cocoon/forrest speech: does it allow you to
> update a Solr server with the content of your website at the same time
> you generate (or change) the site?

Well, it is not working so for in the static build meaning
"forrest" (not sure ATM why myself) which would exactly do what you say
regarding generating the site. In "forrest run", the dynamic mode of
forrest, however it lets ...

>So it's a push model of web
> indexing instead of spidering? 

Exactly. 

To finish above sentence ... you push update commands to the server
based on each selected page.

>  The search-box I understand, but
> presumably that needs to point to a running Solr server somewhere.

Yes.
http://forrest.apache.org/pluginDocs/plugins_0_80/org.apache.forrest.plugin.output.solr/index.html
"...
The host server urls can be configured by adding the following
properties to your project forrest.properties.xml in case you do not use
the default values.

http://localhost:8983/solr/select"/>
http://localhost:8983/solr/update"/> 
..."

The forrest.properties.xml is new in 0.8-dev.

The result will be transformed to something like:
http://forrest.apache.org/pluginDocs/plugins_0_80/org.apache.forrest.plugin.output.solr/images/result.png

I added a transformer that adds the paginator part to the solr select result. 
The paginator is the "Result pages" part of above screenshot. 

Hmm, that makes me think whether that (the paginator) would be better directly 
in solr core. 

wdyt?

salu2
> 
> -Yonik
> 
> On 1/7/07, Thorsten Scherler <[EMAIL PROTECTED]> wrote:
> > Hi all,
> >
> > I am happy to announce that I just add a Apache Forrest based Apache
> > Solr client plugin to the forrest whiteboard. It may be from interest
> > for the ones using Apache Cocoon based Apache Forrest and Apache Lucene
> > based Apache Solr.
> >
> > org.apache.forrest.plugin.output.solr generates Apache Solr documents
> > from Apache Forrest xdos. Further when run with the Apache Forrest
> > Dispatcher it provides a GUI to manage your project in solr and a search
> > interface to search your solr server.
> >
> > The documentation and a couple of screenshots can be found at
> > http://forrest.apache.org/pluginDocs/plugins_0_80/org.apache.forrest.plugin.output.solr/
> >
> > The source code can be found at
> > http://svn.apache.org/viewvc/forrest/trunk/whiteboard/plugins/org.apache.forrest.plugin.output.solr/
> >
> > Have fun with it and please provide feedback to this list.
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)

Re: XML querying

2007-01-15 Thread Thorsten Scherler

On Mon, 2007-01-15 at 12:23 +, Luis Neves wrote:
> Hello.
> What I do now to index XML documents it's to use a Filter to strip the 
> markup, 
> this works but it's impossible to know where in the document is the match 
> located.
> What would it take to make possible to specify a filter query that accepts 
> xpath 
> expressions?... something like:
> 
> fq=xmlField:/book/content/text()
> 
> This way only the "/book/content/" element was searched.
> 
> Did I make sense? Is this possible?

AFAIK short answer: no.

The field is ALWAYS plain text. There is no xmlField type.

...but why don't you just add your text in multiple field when indexing.

Instead of plain stripping the markup do above xpath on your document
and create different fields. Like

Makes sense?

HTH

salu2

> 
> --
> Luis Neves

Re: Calling Solr requests from java code - examples?

On Tue, 2007-01-16 at 12:52 +0100, [EMAIL PROTECTED] wrote:
> Thanks!
> 
> and how would you do it calling it from another web application, let's  
> say from a servlet or so? I need to do some stuff in my web java code,  
> then call the Solr service and do some more stuff afterwards
> 

Have a look at 
https://issues.apache.org/jira/browse/SOLR-86

HTH

salu2

Re: Converting Solr response back to pojo's, experiences?

On Tue, 2007-01-16 at 14:58 +0100, [EMAIL PROTECTED] wrote:
> Anyone having experience converting xml responses back to pojo's,  
> which technologies have you used?
> 
> Anyone doing json <-> pojo's?

Using pure xml myself but have a look at 
https://issues.apache.org/jira/browse/SOLR-20
and 
https://issues.apache.org/jira/secure/attachment/12348567/solr-client.zip

HTH
salu2

> 
> Grtz
>

Re: solr + cocoon problem

On Tue, 2007-01-16 at 16:19 -0500, Walter Lewis wrote:
> [EMAIL PROTECTED] wrote:
> > Any ideas on how to implement a cocoon layer above solr?

I just finished a forrest plugin (in the whiteboard, our testing ground
in forrest) that is doing what you asked for and some pagination.
Forrest is cocoon based so you just have to build the plugin jar and add
it to your cocoon project. Please ask on the forrest list if you have
problems.

http://forrest.apache.org/pluginDocs/plugins_0_80/org.apache.forrest.plugin.output.solr/

> You're far from the only one approaching solr via cocoon ... :)
> 
> The approach we took, passes the search parameters to a "solrsearch" 
> stylesheet, the heart of which is a  block that embeds the 
> solr results.  A further transformation prepares the results of the solr 
> query for display.

That was my first version for above plugin as well, but since forrest
makes use of the cocoon crawler I needed something with a default search
string for offline generation.

You should have a closer look on 
http://svn.apache.org/viewvc/forrest/trunk/whiteboard/plugins/org.apache.forrest.plugin.output.solr/output.xmap?view=markup
and 
http://svn.apache.org/viewvc/forrest/trunk/whiteboard/plugins/org.apache.forrest.plugin.output.solr/input.xmap?view=markup

For the original use case of this thread I added a generator:

and as well a paginator transformer that calculates the next pages based on 
start, rows and numFound:

We use it as follows:

You may be interested in the update generator as well. 

Please give feedback to [EMAIL PROTECTED] 

It really needs more testing besides myself, you could be the first to provide 
feedback.

HTH

salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)

Re: solr + cocoon problem

On Tue, 2007-01-16 at 16:02 -0500, [EMAIL PROTECTED] wrote:
> Hi,
> 
> I am trying to implement a cocoon based application using solr for searching.
> In particular, I would like to forward the request from my response page to
> solr.  I have tried several alternatives, but none of them worked for me.
> 

Please see http://wiki.apache.org/solr/SolrForrest.

salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)

Re: Calling Solr requests from java code - examples?

On Tue, 2007-01-16 at 13:56 +0100, Bertrand Delacretaz wrote:
> On 1/16/07, Thorsten Scherler <[EMAIL PROTECTED]> wrote:
> 
> > ...Have a look at
> > https://issues.apache.org/jira/browse/SOLR-86...
> 
> Right, I should have mentioned this one as well. I have linked SOLR-20
> and SOLR-86 now, so that people can see the various options for Java
> clients.

Cheers, mate. :)

salu2
-- 
thorsten

"Together we stand, divided we fall!" 
Hey you (Pink Floyd)

Re: XML querying