Re: Search Multiple indexes In Solr

2007-11-08 Thread zx zhang
It is said that this new feather will be added in solr1.3, but I am not sure
about that.

I think the following  maybe useful for you:
https://issues.apache.org/jira/browse/SOLR-303
https://issues.apache.org/jira/browse/SOLR-255


2007/11/8, j 90 <[EMAIL PROTECTED]>:
>
> Hi, I'm new to Solr but very familiar with Lucene.
>
> Is there a way to have Solr search in more than once index, much like the
> MultiSearcher in Lucene ?
>
> If so how so I configure the location of the indexes ?
>


Re: SOLR 1.2 - Duplicate Documents??

2007-11-08 Thread Yonik Seeley
On Nov 7, 2007 12:30 PM, realw5 <[EMAIL PROTECTED]> wrote:
> We did have Tomcat crash once (JVM OutOfMem) durning an indexing process,
> could that be a possible source of the issue?

Yes.
Deletes are buffered and carried out in a different phase.

-Yonik


AW: What is the best way to index xml data preserving the mark up?

2007-11-08 Thread Hausherr, Jens
Hi, 

if you just need to preserve the xml for storing you could simply wrap the xml 
markup in CDATA. Splitting your structure beforehand and using dynamic fields 
might be a viable solution...

eg. 

  
value 1
value 2


  



 

Mit freundlichen Grüßen / Best Regards / Avec mes meilleures salutations

 
Jens Hausherr 
 
Dipl.-Wirtsch.Inf. (Univ.) 
Senior Consultant 
 
Tel: 040-27071-233
Fax: 040-27071-244
Fax: +49-(0)178-998866-097
Mobile: +49-(0)178-8866-097
 
mailto: mailto:[EMAIL PROTECTED]  
 
Unilog Avinci - a LogicaCMG company
Am Sandtorkai 72
D-20457 Hamburg
http://www.unilog.de  
 
Unilog Avinci GmbH
Zettachring 4, 70567 Stuttgart
Amtsgericht Stuttgart HRB 721369
Geschäftsführer: Torsten Straß / Eric Guyot / Rudolf Kuhn / Olaf Scholz
 


This e-mail and any attachment is for authorised use by the intended 
recipient(s) only. It may contain proprietary material, confidential 
information and/or be subject to legal privilege. It should not be copied, 
disclosed to, retained or used by, any other party. If you are not an intended 
recipient then please promptly delete this e-mail and any attachment and all 
copies and inform the sender. Thank you.


Discovering RequestHandler parameters at runtime

2007-11-08 Thread Grant Ingersoll

Hi,

Is there anyway to interrogate a RequestHandler to discover what  
parameters it supports at runtime?  Kind of like a BeanInfo for  
RequestHandlers?  Has anyone else thought about doing this and what it  
might look like?  Seems like it would be useful for building dynamic  
web forms.


Thanks,
Grant


RE: What is the best way to index xml data preserving the mark up?

2007-11-08 Thread Binkley, Peter
I've used eXist for this kind of thing and had good experiences, once I
got a grip on Xquery (which is definitely worth learning). But I've only
used it for small collections (under 10k documents); I gather its
effective ceiling is much lower than Solr's. 

Possibly it will be possible to use Lucene's new payloads to do this
kind of thing (at least, storing Xpath information is one of the
proposed uses: http://lucene.grantingersoll.com/2007/03/18/payloads/ ),
as Erik Hatcher suggested in relation to
https://issues.apache.org/jira/browse/SOLR-380 .

Peter

-Original Message-
From: David Neubert [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 07, 2007 9:52 PM
To: solr-user@lucene.apache.org
Subject: Re: What is the best way to index xml data preserving the mark
up?

Thanks Walter -- 

I am aware of MarkLogic -- and agree -- but I have a very low budget on
licensed software in this case (near 0) -- 

have you used eXists or Xindices? 

Dave

- Original Message 
From: Walter Underwood <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, November 7, 2007 11:37:38 PM
Subject: Re: What is the best way to index xml data preserving the mark
up?

If you really, really need to preserve the XML structure, you'll be
doing a LOT of work to make Solr do that. It might be cheaper to start
with software that already does that. I recommend MarkLogic -- I know
the principals there, and it is some seriously fine software. Not free
or open, but very, very good.

If your problem can be expressed in a flat field model, then the your
problem is mapping your document model into Solr. You might be able to
use structured field names to represent the XML context, but that is
just a guess.

With a mixed corpus of XML and arbitrary text, requiring special
handling of XML, yow, that's a lot of work.

One thought -- you can do flat fields in an XML engine (like MarkLogic)
much more easily than you can do XML in a flat field engine (like
Lucene).

wunder

On 11/7/07 8:18 PM, "David Neubert" <[EMAIL PROTECTED]> wrote:

> I am sure this is 101 question, but I am bit confused about indexing
 xml data
> using SOLR.
> 
> I have rich xml content (books) that need to searched at granular
 levels
> (specifically paragraph and sentence levels very accurately, no 
> approximations).  My source text has exact  and  tags
 for this
> purpose.  I have built this app in previous versions (using other
 search
> engines) indexing the text twice, (1) where every paragraph was a
 virtual
> document and (2) where every sentence was a virtual document  -- both 
> extracted from the source file (which was a singe xml file for the
 entire
> book).  I have of course thought about using an XML engine eXists or
 Xindices,
> but I am prefer to the stability and user base and performance that 
> Lucene/SOLR seems to have, and also there is a large body of text
 that is
> regular documents and not well formed XML as well.
> 
> I am brand new to SOLR (one day) and at a basic level understand
 SOLR's nice
> simple xml scheme to add documents:
> 
> 
>   
> foo value 1
> foo value 2
>   
>   ...
> 
> 
> But my problem is that I believe I need to perserve the xml markup at
 the
> paragraph and sentence levels, so I was hoping to create a content
 field that
> could just contain the source xml for the paragraph or sentence
 respectively.
> There are reasons for this that I won't go into -- alot of granular
 work in
> this app, accessing pars and sens.
> 
> Obviously an XML mechanism that could leverage the xml structure (via
 XPath or
> XPointers) would work great.  Still I think Lucene can do this in a
 field
> level way-- and I also can't imagine that users who are indexing XML
 documents
> have to go through the trouble of striping all the markup before
 indexing?
> Hopefully I missing something basic?
> 
> It would be great to pointed in the right direction on this matter?
> 
> I think I need something along this line:
> 
> 
>   
> value 1
> value 2
> 
> 
>   
> 
> 
> Maybe the overall question -- is what is the best way to index XML
 content
> using SOLR -- is all this tag stripping really necessary?
> 
> Thanks for any help,
> 
> Dave
> 
> 
> 
> 
> 
> __
> Do You Yahoo!?
> Tired of spam?  Yahoo! Mail has the best spam protection around 
> http://mail.yahoo.com






__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around
http://mail.yahoo.com 


Re: AW: What is the best way to index xml data preserving the mark up?

2007-11-08 Thread David Neubert
Thanks -- C-Data might be useful -- and I was looking into dynamic fields as 
solution as well -- I think a combination of the two might work.

- Original Message 
From: "Hausherr, Jens" <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Thursday, November 8, 2007 4:03:02 AM
Subject: AW: What is the best way to index xml data preserving the mark up?


Hi, 

if you just need to preserve the xml for storing you could simply wrap
 the xml markup in CDATA. Splitting your structure beforehand and using
 dynamic fields might be a viable solution...

eg. 

  
value 1
value 2


  



 

Mit freundlichen Grüßen / Best Regards / Avec mes meilleures
 salutations

 
Jens Hausherr 
 
Dipl.-Wirtsch.Inf. (Univ.) 
Senior Consultant 
 
Tel: 040-27071-233
Fax: 040-27071-244
Fax: +49-(0)178-998866-097
Mobile: +49-(0)178-8866-097
 
mailto: mailto:[EMAIL PROTECTED]
  
 
Unilog Avinci - a LogicaCMG company
Am Sandtorkai 72
D-20457 Hamburg
http://www.unilog.de  
 
Unilog Avinci GmbH
Zettachring 4, 70567 Stuttgart
Amtsgericht Stuttgart HRB 721369
Geschäftsführer: Torsten Straß / Eric Guyot / Rudolf Kuhn / Olaf
 Scholz
 


This e-mail and any attachment is for authorised use by the intended
 recipient(s) only. It may contain proprietary material, confidential
 information and/or be subject to legal privilege. It should not be copied,
 disclosed to, retained or used by, any other party. If you are not an
 intended recipient then please promptly delete this e-mail and any
 attachment and all copies and inform the sender. Thank you.





__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Discovering RequestHandler parameters at runtime

2007-11-08 Thread Chris Hostetter

: > Is there anyway to interrogate a RequestHandler to discover what parameters
: > it supports at runtime?  Kind of like a BeanInfo for RequestHandlers?  Has

: Also, check:
: http://wiki.apache.org/solr/MakeSolrMoreSelfService

Yeah, that wiki is as far as i ever got.  note that it vastly predates a 
lot of the LukeRequestHandler type stuff and even the general additude of 
moving more towards RequestHandlers as general processing units of solr 
for handling all requests (even admin style requests)

Note that while it might be handy to have something like BeanInfo where 
the *class* tells you what params it supports, the important feature would 
be something where the *instance* tells you what params it supports, 
because it won't want to advertise params that it has invariants set for.  
(i touch on this in that wiki)

Ultimatley i think it would be good if RequestHandlers implemented a 
method that returned a big data structure containing everything they 
wanted to "advertise" about themselves. and most of the admin screen and 
the "form.jsp" in the current codebase got replaced by a 
"FormRequestHandler" that would inspect the SolrCore for a list of all 
RequestHandlers that were advertising themselves and create forms for 
them.

-Hoss



Re: Tomcat JNDI Settings

2007-11-08 Thread Wayne Graham
Hi Hoss,

I just wanted to follow up to the list on this one...I could never get
the JNDI settings to work with Tomcat. I went to Jetty and everything is
working quite nicely.

Wayne

Chris Hostetter wrote:
> : Thanks for getting back to me. The folder /var/lib/tomcat5/solr/home
> : exists as does /var/lib/tomcat5/solr/home/conf/solrconfig.xml. It's
> : basically a copy of the files from examples folder at this point.
> : 
> : I put war files in /var/lib/tomcat5/webapps, so I have the
> : apache-solr-1.2.0.war file outside of the webapps folder.
> : 
> : Are there any special permissions these files need? I have them owned by
> : the tomcat user.
> 
> that should be fine ... is /var/lib/tomcat5/solr/home/ writable by the 
> tomcat user so it can make the ./data and ./data/index directories?
> 
> are you sure there aren't any other errors in the logs above the one you 
> mentioned already?
> 
> 
> 
> 
> -Hoss
> 

-- 
/**
 * Wayne Graham
 * Earl Gregg Swem Library
 * PO Box 8794
 * Williamsburg, VA 23188
 * 757.221.3112
 * http://swem.wm.edu/blogs/waynegraham/
 */



Re: Discovering RequestHandler parameters at runtime

2007-11-08 Thread Ryan McKinley

Grant Ingersoll wrote:

Hi,

Is there anyway to interrogate a RequestHandler to discover what 
parameters it supports at runtime?  Kind of like a BeanInfo for 
RequestHandlers?  Has anyone else thought about doing this and what it 
might look like?  Seems like it would be useful for building dynamic web 
forms.




currently there is not...  I started down that route a while ago, but 
got distracted by other things.  I think its a good idea.


Also, check:
http://wiki.apache.org/solr/MakeSolrMoreSelfService

ryan


Re: AW: What is the best way to index xml data preserving the mark up?

2007-11-08 Thread Chris Hostetter

: Thanks -- C-Data might be useful -- and I was looking into dynamic 
: fields as solution as well -- I think a combination of the two might 
: work.

I must admit i haven't been following this thread that closely, so i'm not 
sure how much of the "structure" of the XML you want to preserve for the 
purposes of querying, or if it's jsut an issue of wanting to store the raw 
XML, but on the the broader topic of indexing/searching arbitrary XML, i'd 
like to through out a few misc ideas i've had in the past that you might 
want to run with...

1) there's a Jira issue i pened a while back with a rough patch for 
applying a user specific XSLTs on the server to transforming arbitrary XML 
into the Solr XML update format (i don't have the issue number handy, and 
my browser is in the throws of death at the moment).  this might solve the 
"i want to send solr XML in my own schema, and i want to be able to tell 
it how to pull out various pieces to use as a field values.

2) I was once toying with the idea of an XPathTokenizer.  it would parse 
the fieldValues as XML, then apply arbitrary configured XPath expressions 
against the DOM and use the resulting NodeList to produce the TokenStream.


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com



-Hoss



Re: How to do GeoSpatial search in SOLR/Lucene

2007-11-08 Thread Chris Hostetter
: How to do Geo Spatial search in SOLR/Lucene?

i still haven't had a chance to play with any of the good stuff people 
have been talking about, but there have been several recent threads 
talking about it...

http://www.nabble.com/forum/Search.jtp?query=geographic&local=y&forum=14479



-Hoss



Multiple indexes

2007-11-08 Thread Jae Joo
Hi,

I am looking for the way to utilize the multiple indexes for signle sole
instance.
I saw that there is the patch 215  available  and would like to ask someone
who knows how to use multiple indexes.

Thanks,

Jae Joo


Re: AW: What is the best way to index xml data preserving the mark up?

2007-11-08 Thread Chris Hostetter

: Seems like Lucene/SOLR is geared to take record and non-xml-oriented 
: content and put it into XML format for ingest -- but really can't digest 
: XML content itself at all without significant setup and constraints.  I 
: am surprised -- but I could really use it for my project big time.

Lucene is geared towards indexing records containing key=>value 
pairs.  The values are then passed to "Analyzers" to break them up into 
individual terms.

Solr is geared towards providing a non-Java interface to accept those 
Documents and hand them off to Lucene, and to providing a simple way to 
define Analyzers using configuration without compiling custom java code.  
A specific XML format is one way way to communicate with Solr what those 
"records" are, CSV is another, ... other generic formats can be added as 
plugins.

(Mind You -- Lucene and Solr are "geared" for a lot of things in addition 
to those, but forthe purposes of this ocnveration, and the focus on 
indexing, those are the distinction).

the  aspect of your situation that neither Solr nor Lucene 
really focus on is extracting the key->val pairs from a larger stream 
of text (ie: XML in a user defined schema).   this is where something like 
the XSLT appraoch i discribed could be helpful: you (as more of an expert 
on the XML Schema or your documents then solr) could write an XSLT for 
extracting the field=>value pairs foreach doc, to give to Solr.

you could do the same thingclient side before sending the data to Solr -- 
the Jira issue i refered to (SOLR-285 BTW) would just allow this transform 
to happen server side)



-Hoss



Re: AW: What is the best way to index xml data preserving the mark up?

2007-11-08 Thread Tricia Williams

Hi Dave,

This sounds like what I've been trying to work out with 
https://issues.apache.org/jira/browse/SOLR-380.  The idea that I'm 
running with right now is indexing the xml and storing the data in the 
xml tags as a Payload.  Payload is a relatively new idea from  Lucene.  
A custom SolrHighlighter provides position hits (our need for this is 
highlighting on an image while searching the OCR text of the image) and 
some context to where they appear in the document using the stored Payload.


Tricia

David Neubert wrote:

Chris

I'll try to track down your Jira issue.

(2) sounds very helpful -- I am only 2 days old in SOLR/Lucene experience, but know 
what I need -- and basically its to search by the main granules in an xml document, 
with usually turn out to be for books" book (rarley), chapter (more often), 
paragraph: (often) sentence: (often).  Then there are niceties like chapter title, 
headings, etc. but I can live without that -- but it seems like if you can exploit 
the text nodes of arbitrary XML you are looking good, if not, you gotta a lot of 
machination in front of you.

Seems like Lucene/SOLR is geared to take record and non-xml-oriented content 
and put it into XML format for ingest -- but really can't digest XML content 
itself at all without significant setup and constraints.  I am surprised -- but 
I could really use it for my project big time.

Another problem I am having related (which I will probably repost separately) 
is boolean searches across fields with multiple values.  At this point, because 
of my work arounds for Lucene (to this point) I am indexing paragraphs as 
single documents with multiple fields, thinking I could copy the sentences to 
text.  In that way, I can search field text (for the paragraph) -- and search 
field sentence -- for sentence granularity.  The problem is that a search for 
sentence:foo AND sentence:bar is matching if foo matches in any sentence of the 
paragraph, and bar also matches in any sentence of the paragraph.  I need it to 
match only if foo and bar are found in the same sentence. If this can't be do, 
looks like I will have to index paragraphs as documents, and redundantly index 
sentences as unique documents. Again, I will post this question separately 
immediately.

Thanks,

Dave
  




Boolean matches in a unique instance of a multi-value field?

2007-11-08 Thread David Neubert


Is it possible to find boolean matches (foo AND bar) in a single unique 
instance of a multi-value field.  So if foo is found in one instance of 
multi-value field, and is also found in another instance of the multi-value 
field -- this WOULD NOT be a match, but only if both words are found in the 
same instance of the multi-value field.

Thanks,

Dave




__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 




__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: Simple sorting questions

2007-11-08 Thread Chris Hostetter

: 1. There appears to be (at least) two ways to specify sorting, one
: involving an append to the q parm and the other using the sort parm.
: Are these exactly equivalent?
: 
:http://localhost/solr/select/?q=martha;author+asc
:http://localhost/solr/select/?q=martha&sort=author+asc

They should be, but the first form is heavily deprecated and should not be 
used

: 2. The docs say that sorting can only be applied to non-multivalued
: fields.  Does this mean that sorting won't work *at all* for
: multi-valued fields or only that the behaviour is indeterminate?

The behavior is undefined, in that it might return results in an 
indeterminant order, or it might flat out fail -- it all depends on the 
nature of the data in the field.

Note: it's not specificly that the field must be "non-multivalued" ... 
even if a field says multiValue="false" it still might not be a valid 
field to sort on if it uses an Analyzer that produces multiple tokens per 
field value (so *most* TextField based fields won't work, unless you use 
the KeywordTOkenizer or something equivilent)

: Based on a brief test, sorting a multi-valued field appeared to work
: by picking an arbitrary value when multiple values are present and

as i recall, that will happen when the number of distinct terms indexed 
for that field is less then the number of documents in the index ... but 
if tomorow you add a document that contains a bunch of new terms, and 
shifts the balance so that there are more terms then documents, any search 
attempting to sort on that field will start to fail completly.

(the specifics of why that happens relate to the underlying Lucene 
FieldCache specifics ... i won't bother trying to explain it or deven to 
defend it, because i'm not fond of it at all -- but i haven't thought of 
any easy ways to improve it that don't suffer performance penalties for 
the more common case of people sorting on fields that are "ok" to sort 
on).




-Hoss



Re: Multiple indexes

2007-11-08 Thread John Reuning
I've had good luck with MultiCore, but you have to sync trunk from svn 
and apply the most recent patch in SOLR-350.


https://issues.apache.org/jira/browse/SOLR-350

-jrr

Jae Joo wrote:

Hi,

I am looking for the way to utilize the multiple indexes for signle sole
instance.
I saw that there is the patch 215  available  and would like to ask someone
who knows how to use multiple indexes.

Thanks,

Jae Joo





Re: Tomcat JNDI Settings

2007-11-08 Thread Chris Hostetter
: I just wanted to follow up to the list on this one...I could never get
: the JNDI settings to work with Tomcat. I went to Jetty and everything is

I'm not sure what to tell you.  

I've been preping my ApacheCon demo for next week using Tomcat and JNDI 
and i haven't had any problems  i've got a few helper scripts that 
save me typing when i set it up (they use "sh -x" to echo the shell 
commands they execute when they run), but here's everything i do just so 
you can see what i've got going on ... it might help you figure out what's 
not working about your setup.

At the end of all of this Solr is up and running in tomcat using my 
configured SolrHome...

[EMAIL PROTECTED]:/var/tmp/ac-demo$ pwd
/var/tmp/ac-demo
[EMAIL PROTECTED]:/var/tmp/ac-demo$ ls
books-solr-home   demo-links.html raw-data   
tomcat-context.xml
create-tomcat-context.sh  install-tomcat-and-solr.sh  tar-balls
[EMAIL PROTECTED]:/var/tmp/ac-demo$ find books-solr-home/
books-solr-home/
books-solr-home/conf
books-solr-home/conf/xslt
books-solr-home/conf/xslt/example.xsl
books-solr-home/conf/xslt/example_atom.xsl
books-solr-home/conf/schema_minimal.xml
books-solr-home/conf/solrconfig.xml
books-solr-home/conf/synonyms.txt
books-solr-home/conf/schema_books.xml
books-solr-home/conf/schema.xml
[EMAIL PROTECTED]:/var/tmp/ac-demo$ cat tomcat-context.xml




  

[EMAIL PROTECTED]:/var/tmp/ac-demo$ ./install-tomcat-and-solr.sh
+ cd /var/tmp/ac-demo/
+ tar -xzf tar-balls/apache-tomcat-6.0.14.tar.gz
+ tar -xzf tar-balls/apache-solr-1.2.0.tgz
[EMAIL PROTECTED]:/var/tmp/ac-demo$ ls
apache-solr-1.2.0 books-solr-home   demo-links.html 
raw-data   tomcat-context.xml
apache-tomcat-6.0.14  create-tomcat-context.sh  install-tomcat-and-solr.sh  
tar-balls
[EMAIL PROTECTED]:/var/tmp/ac-demo$ ./create-tomcat-context.sh
+ mkdir -p apache-tomcat-6.0.14/conf/Catalina/localhost/
+ cp tomcat-context.xml 
apache-tomcat-6.0.14/conf/Catalina/localhost/books-solr.xml
[EMAIL PROTECTED]:/var/tmp/ac-demo$ apache-tomcat-6.0.14/bin/catalina.sh
Using CATALINA_BASE:   /var/tmp/ac-demo/apache-tomcat-6.0.14
Using CATALINA_HOME:   /var/tmp/ac-demo/apache-tomcat-6.0.14
Using CATALINA_TMPDIR: /var/tmp/ac-demo/apache-tomcat-6.0.14/temp
Using JRE_HOME:   /opt/jdk1.5
Usage: catalina.sh ( commands ... )
commands:
  debug Start Catalina in a debugger
  debug -security   Debug Catalina with a security manager
  jpda startStart Catalina under JPDA debugger
  run   Start Catalina in the current window
  run -security Start in the current window with security manager
  start Start Catalina in a separate window
  start -security   Start in a separate window with security manager
  stop  Stop Catalina
  stop -force   Stop Catalina (followed by kill -KILL)
  version   What version of tomcat are you running?
[EMAIL PROTECTED]:/var/tmp/ac-demo$ apache-tomcat-6.0.14/bin/catalina.sh start
Using CATALINA_BASE:   /var/tmp/ac-demo/apache-tomcat-6.0.14
Using CATALINA_HOME:   /var/tmp/ac-demo/apache-tomcat-6.0.14
Using CATALINA_TMPDIR: /var/tmp/ac-demo/apache-tomcat-6.0.14/temp
Using JRE_HOME:   /opt/jdk1.5
[EMAIL PROTECTED]:/var/tmp/ac-demo$



Re: AW: What is the best way to index xml data preserving the mark up?

2007-11-08 Thread David Neubert
Chris

I'll try to track down your Jira issue.

(2) sounds very helpful -- I am only 2 days old in SOLR/Lucene experience, but 
know what I need -- and basically its to search by the main granules in an xml 
document, with usually turn out to be for books" book (rarley), chapter (more 
often), paragraph: (often) sentence: (often).  Then there are niceties like 
chapter title, headings, etc. but I can live without that -- but it seems like 
if you can exploit the text nodes of arbitrary XML you are looking good, if 
not, you gotta a lot of machination in front of you.

Seems like Lucene/SOLR is geared to take record and non-xml-oriented content 
and put it into XML format for ingest -- but really can't digest XML content 
itself at all without significant setup and constraints.  I am surprised -- but 
I could really use it for my project big time.

Another problem I am having related (which I will probably repost separately) 
is boolean searches across fields with multiple values.  At this point, because 
of my work arounds for Lucene (to this point) I am indexing paragraphs as 
single documents with multiple fields, thinking I could copy the sentences to 
text.  In that way, I can search field text (for the paragraph) -- and search 
field sentence -- for sentence granularity.  The problem is that a search for 
sentence:foo AND sentence:bar is matching if foo matches in any sentence of the 
paragraph, and bar also matches in any sentence of the paragraph.  I need it to 
match only if foo and bar are found in the same sentence. If this can't be do, 
looks like I will have to index paragraphs as documents, and redundantly index 
sentences as unique documents. Again, I will post this question separately 
immediately.

Thanks,

Dave


- Original Message 
From: Chris Hostetter <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Thursday, November 8, 2007 1:19:40 PM
Subject: Re: AW: What is the best way to index xml data preserving the mark up?



: Thanks -- C-Data might be useful -- and I was looking into dynamic 
: fields as solution as well -- I think a combination of the two might 
: work.

I must admit i haven't been following this thread that closely, so i'm
 not 
sure how much of the "structure" of the XML you want to preserve for
 the 
purposes of querying, or if it's jsut an issue of wanting to store the
 raw 
XML, but on the the broader topic of indexing/searching arbitrary XML,
 i'd 
like to through out a few misc ideas i've had in the past that you
 might 
want to run with...

1) there's a Jira issue i pened a while back with a rough patch for 
applying a user specific XSLTs on the server to transforming arbitrary
 XML 
into the Solr XML update format (i don't have the issue number handy,
 and 
my browser is in the throws of death at the moment).  this might solve
 the 
"i want to send solr XML in my own schema, and i want to be able to
 tell 
it how to pull out various pieces to use as a field values.

2) I was once toying with the idea of an XPathTokenizer.  it would
 parse 
the fieldValues as XML, then apply arbitrary configured XPath
 expressions 
against the DOM and use the resulting NodeList to produce the
 TokenStream.


__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com



-Hoss






__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: How to do GeoSpatial search in SOLR/Lucene

2007-11-08 Thread patrick o'leary




There is an SF project (locallucene) with geographical searching in
progress. (As in if it works for you and meets your needs
then please help yourself, if it doesn't and requires additional work,
then please help us) ;-)

https://sourceforge.net/projects/locallucene/

A basic solr port & demo, which is out of date, but still gives an
idea of capabilities is available with some details
http://www.nsshutdown.com/blog/index.php?itemid=87 (download is called
solr-example.tgz)

A good few folks are using locallucene and seem happy, we're still
working on the localsolr port, and a few
folks in the solr community are really helping out a lot!

Once we feel we're in much firmer release position then we'll
contribute back to the ASF, but for now we're on SF.
A few things we're working on are improving filter caching (geo
boundaries caching) and looking at limitations
of geographical searching.
e.g. You can only process so many unique lats/longs on one box. But
we're working on it.

Hope that helps
Patrick

Chris Hostetter wrote:

  : How to do Geo Spatial search in SOLR/Lucene?

i still haven't had a chance to play with any of the good stuff people 
have been talking about, but there have been several recent threads 
talking about it...

http://www.nabble.com/forum/Search.jtp?query=geographic&local=y&forum=14479



-Hoss


  


-- 
Patrick O'Leary


You see, wire telegraph is a kind of a very, very long cat. You pull his tail in New York and his head is meowing in Los Angeles.
 Do you understand this? 
And radio operates exactly the same way: you send signals here, they receive them there. The only difference is that there is no cat.
  - Albert Einstein

View
Patrick O Leary's profile





Re: where to hook in to SOLR to read field-label from functionquery

2007-11-08 Thread Chris Hostetter

: Say I have a custom functionquery MinFloatFunction which takes as its
: arguments an array of valuesources. 
: 
: MinFloatFunction(ValueSource[] sources)
: 
: In my case all these valuesources are the values of a collection of fields.

a ValueSource isn't required to be field specifc (it may already be the 
mathematical combination of other multiple fields) so there is no generic 
way to get the "field name" form a ValueSource ... but you could define 
your MinFloatFunction only accept FieldCacheSource[] as input ... hmmm, 
ecept that FieldCacheSource doesn't expose the field name.  so instead you 
write...

  public class MyFieldCacheSource extends FieldCacheSource {
public MyFieldCacheSource(String field) {
  super(field);
}
public String getField() {
  return field;
}
  }
  public class MinFloatFunction ... {
public MinFloatFunction(MyFieldCacheSource[] values);
  }


: For this I designed a schema in which each 'row' in the index represents a
: product (indepdent of variants) (which takes care of the 1 variant max) and
: every variant is represented as 2 fields in this row:
: 
: variant_p_* <-- represents price (stored / indexed)
: variant_source_*  <-- represents the other fields dependent on the
: variant (stored / multivalued)

Note: if you have a lot of varients you may wind up with the same problem 
as described here...

http://www.nabble.com/sorting-on-dynamic-fields---good%2C-bad%2C-neither--tf4694098.html

...because of the underlying FieldCache usage in FieldCacheValueSource


-Hoss



Re: What is the best way to index xml data preserving the mark up?

2007-11-08 Thread David Neubert
Thanks, I think storing the XPath is where I will ultimately wind up -- I will 
look into your links recommended below.

Its an interesting debate where the break even point is between Lucene/XPath 
storing XPath info -- utilizing that for lookup and position within DOM 
structures, verse a full fledged XML engine.  Most corporations are in the 
mixed mode -- I am surprised that Lucene (or some other vendor) doesn't really 
focus on handling both easily.  Maybe I just need to clue in on the Lucene way 
of handing XML (which so far it seems to me as you suggest  is a combo using 
dynamic fields and storing XPath info)

Dave


- Original Message 
From: "Binkley, Peter" <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Thursday, November 8, 2007 11:23:46 AM
Subject: RE: What is the best way to index xml data preserving the mark up?

I've used eXist for this kind of thing and had good experiences, once I
got a grip on Xquery (which is definitely worth learning). But I've
 only
used it for small collections (under 10k documents); I gather its
effective ceiling is much lower than Solr's. 

Possibly it will be possible to use Lucene's new payloads to do this
kind of thing (at least, storing Xpath information is one of the
proposed uses: http://lucene.grantingersoll.com/2007/03/18/payloads/ ),
as Erik Hatcher suggested in relation to
https://issues.apache.org/jira/browse/SOLR-380 .

Peter

-Original Message-
From: David Neubert [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 07, 2007 9:52 PM
To: solr-user@lucene.apache.org
Subject: Re: What is the best way to index xml data preserving the mark
up?

Thanks Walter -- 

I am aware of MarkLogic -- and agree -- but I have a very low budget on
licensed software in this case (near 0) -- 

have you used eXists or Xindices? 

Dave

- Original Message 
From: Walter Underwood <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Wednesday, November 7, 2007 11:37:38 PM
Subject: Re: What is the best way to index xml data preserving the mark
up?

If you really, really need to preserve the XML structure, you'll be
doing a LOT of work to make Solr do that. It might be cheaper to start
with software that already does that. I recommend MarkLogic -- I know
the principals there, and it is some seriously fine software. Not free
or open, but very, very good.

If your problem can be expressed in a flat field model, then the your
problem is mapping your document model into Solr. You might be able to
use structured field names to represent the XML context, but that is
just a guess.

With a mixed corpus of XML and arbitrary text, requiring special
handling of XML, yow, that's a lot of work.

One thought -- you can do flat fields in an XML engine (like MarkLogic)
much more easily than you can do XML in a flat field engine (like
Lucene).

wunder

On 11/7/07 8:18 PM, "David Neubert" <[EMAIL PROTECTED]> wrote:

> I am sure this is 101 question, but I am bit confused about indexing
 xml data
> using SOLR.
> 
> I have rich xml content (books) that need to searched at granular
 levels
> (specifically paragraph and sentence levels very accurately, no 
> approximations).  My source text has exact  and  tags
 for this
> purpose.  I have built this app in previous versions (using other
 search
> engines) indexing the text twice, (1) where every paragraph was a
 virtual
> document and (2) where every sentence was a virtual document  -- both
 
> extracted from the source file (which was a singe xml file for the
 entire
> book).  I have of course thought about using an XML engine eXists or
 Xindices,
> but I am prefer to the stability and user base and performance that 
> Lucene/SOLR seems to have, and also there is a large body of text
 that is
> regular documents and not well formed XML as well.
> 
> I am brand new to SOLR (one day) and at a basic level understand
 SOLR's nice
> simple xml scheme to add documents:
> 
> 
>   
> foo value 1
> foo value 2
>   
>   ...
> 
> 
> But my problem is that I believe I need to perserve the xml markup at
 the
> paragraph and sentence levels, so I was hoping to create a content
 field that
> could just contain the source xml for the paragraph or sentence
 respectively.
> There are reasons for this that I won't go into -- alot of granular
 work in
> this app, accessing pars and sens.
> 
> Obviously an XML mechanism that could leverage the xml structure (via
 XPath or
> XPointers) would work great.  Still I think Lucene can do this in a
 field
> level way-- and I also can't imagine that users who are indexing XML
 documents
> have to go through the trouble of striping all the markup before
 indexing?
> Hopefully I missing something basic?
> 
> It would be great to pointed in the right direction on this matter?
> 
> I think I need something along this line:
> 
> 
>   
> value 1
> value 2
> 
> 
>   
> 
> 
> Maybe the overall question -- is what is the best way to index XML
 content
> using SOLR -- is 

2Gb process on 32 bits

2007-11-08 Thread Isart Montane

Hi all,

i'm experiencing some trouble when i'm trying to lauch solr with more 
than 1.6GB. My server is a FC5 with 8GB RAM but when I start solr like this


java -Xmx2000m -jar start.jar

i get the following errors:

Error occurred during initialization of VM
Could not reserve enough space for object heap
Could not create the Java virtual machine.

I've tried to start a virtual machine like this

java -Xmx2000m -version

but i get the same errors.

I've read there's a kernel limitation for a 32 bits architecture of 2Gb 
per process, and i just wanna know if anybody knows an alternative to 
get a new 64bits server.


Thanks
Isart


Re: Boolean matches in a unique instance of a multi-value field?

2007-11-08 Thread Chris Hostetter

: Is it possible to find boolean matches (foo AND bar) in a single unique 
: instance of a multi-value field.  So if foo is found in one instance of 
: multi-value field, and is also found in another instance of the 
: multi-value field -- this WOULD NOT be a match, but only if both words 
: are found in the same instance of the multi-value field.

The conventional "trick" to this is to use a positionIncrimentGap (and 
option on TextFields) which is larger then any single "sentence" will be, 
then at query time instead of using a boolena query use a sloppy phrase.  
so if you assume "sentences" are never more then 100 words long, use 
positionIncrimentGap=100, and query for "foo bar"~100

since there will be an (emulated) "gap" of 100 terms between each distinct 
sentence value, the only way "foo" will appear within 100 terms of "bar" 
is if they both appear in the same sentence.

for things like sentence:"+foo +(bar baz yak)" you need to use SpanQueries 
instead of PhraseQueries ... which don't have native query parser support 
so you'd need a custom plugin to help with that.

-Hoss



Re: Score of exact matches

2007-11-08 Thread Papalagi Pakeha
On 11/6/07, Walter Underwood <[EMAIL PROTECTED]> wrote:
> This is fairly straightforward and works well with the DisMax
> handler. Indes the text into three different fields with three
> different sets of analyzers. Use something like this in the
> request handler:
> [...]
> 
>   exact^16 noaccent^4 stemmed
> 

Thanks, that's exactly what I needed. being new to Solr I didn't know
exactly how the filters and analyzers work together. With your hint I
leaned it all and now it works beautifully :-)

PaPa