Re: Handling disparate data sources in Solr

2007-01-08 Thread Alan Burlison
Walter Underwood wrote: Cracking documents and spidering URLs are both big, big problems. PDF is a horrid mess, as are old versions of MS Office. Proxies, logins, cookies, all sort of issues show up with fetching URLs, along with a fun variety of misbehaving servers. I remember crashing one ser

Re: Handling disparate data sources in Solr

2007-01-08 Thread Alan Burlison
Erik Hatcher wrote: The idea of having Solr handle various document types is a good one, for sure. I'm not sure what specifics would need to be implemented, but I at least wanted to reply and say its a good idea! Care has to be taken when passing a URL to Solr for it to go fetch, though. T

Re: Handling disparate data sources in Solr

2007-01-08 Thread Walter Underwood
On 1/7/07 7:24 AM, "Erik Hatcher" <[EMAIL PROTECTED]> wrote: > Care has to be taken when passing a URL to Solr for it to go fetch, > though. There are a lot of complexities in fetching resources via > HTTP, especially when handing something off to Solr which should be > behind a firewall and may

Re: Handling disparate data sources in Solr

2007-01-07 Thread Erik Hatcher
The idea of having Solr handle various document types is a good one, for sure. I'm not sure what specifics would need to be implemented, but I at least wanted to reply and say its a good idea! Care has to be taken when passing a URL to Solr for it to go fetch, though. There are a lot of c

Re: Handling disparate data sources in Solr

2007-01-04 Thread Alan Burlison
Original problem statement: -- I'm considering using Solr to replace an existing bare-metal Lucene deployment - the current Lucene setup is embedded inside an existing monolithic webapp, and I want to factor out the search functionality into a separate webapp so it can be reused more e

Re: Handling disparate data sources in Solr

2007-01-04 Thread Alan Burlison
Chris Hostetter wrote: For your purposes, if you've got a system that works and does the Document conversion for you, then you are probably right: Solr may not be a usefull addition to your architecture. Solr doesn't really attempt to solve the problem of parsing differnet kinds of data streams

Re: Handling disparate data sources in Solr

2006-12-24 Thread Walter Underwood
On 12/23/06 5:28 AM, "Alan Burlison" <[EMAIL PROTECTED]> wrote: >> You could do it in Solr. The difficulty is that arbitrary binary data >> is not easily transferred via xml. So you must specify that the input >> is in base64 or some other encoding. Then you could decode it on the >> fly using

Re: Handling disparate data sources in Solr

2006-12-24 Thread Alan Burlison
Chris Hostetter wrote: : Why won't cdata work? because your binary data might the byte sequence: 0x5D 0x5D 0x3E -- indicating hte end of the CDATA section. CDATA is short for "Charatacter DATA" -- you can't put arbitrary binary data in (or even arbitrary text in it) and be sure thta it will wor

Re: Handling disparate data sources in Solr

2006-12-23 Thread Chris Hostetter
: > You could do it in Solr. The difficulty is that arbitrary binary data : > is not easily transferred via xml. So you must specify that the input : > is in base64 or some other encoding. Then you could decode it on the : > fly using a custom Analyzer before passing it along. : : Why won't cda

Re: Handling disparate data sources in Solr

2006-12-23 Thread Chris Hostetter
: > omitNorms let's you not use field norms for certain field when : > calculating document matching score. This can save you some RAM. : Thanks. What eddect does this have on the quality of the returned : matches? Are there any guidelines as to when you would disable field : norms, and on whi

Re: Handling disparate data sources in Solr

2006-12-23 Thread Otis Gospodnetic
Hi Alan, - Original Message From: Alan Burlison <[EMAIL PROTECTED]> To: solr-user@lucene.apache.org Sent: Saturday, December 23, 2006 8:19:21 AM Subject: Re: Handling disparate data sources in Solr Otis Gospodnetic wrote: > omitNorms let's you not use field norms for cert

Re: Handling disparate data sources in Solr

2006-12-23 Thread Alan Burlison
Bertrand Delacretaz wrote: My "Subversion and Solr" presentation from the last Cocoon GetTogether might give you ideas for how to handle this, see the link at http://wiki.apache.org/solr/SolrResources. Hmm, I'm beginning to think the only way to do this is to write a complete custom front-end

Re: Handling disparate data sources in Solr

2006-12-23 Thread Alan Burlison
Mike Klaas wrote: You could do it in Solr. The difficulty is that arbitrary binary data is not easily transferred via xml. So you must specify that the input is in base64 or some other encoding. Then you could decode it on the fly using a custom Analyzer before passing it along. Why won't c

Re: Handling disparate data sources in Solr

2006-12-23 Thread Alan Burlison
Otis Gospodnetic wrote: omitNorms let's you not use field norms for certain field when calculating document matching score. This can save you some RAM. See http://issues.apache.org/jira/browse/LUCENE-448. Thanks. What eddect does this have on the quality of the returned matches? Are there

Re: Handling disparate data sources in Solr

2006-12-23 Thread Bertrand Delacretaz
On 12/23/06, Alan Burlison <[EMAIL PROTECTED]> wrote: ...As well as centralising the index, I also want to centralise the handling of the different document types... My "Subversion and Solr" presentation from the last Cocoon GetTogether might give you ideas for how to handle this, see the link

Re: Handling disparate data sources in Solr

2006-12-22 Thread Mike Klaas
On 12/22/06, Alan Burlison <[EMAIL PROTECTED]> wrote: At present the content of the Lucene index comes from many different sources (web pages, documents, blog posts etc) and can be different formats (plaintext, HTML, PDF etc). All the various content types are rendered to plaintext before being

Re: Handling disparate data sources in Solr

2006-12-22 Thread Otis Gospodnetic
Alan, omitNorms let's you not use field norms for certain field when calculating document matching score. This can save you some RAM. See http://issues.apache.org/jira/browse/LUCENE-448 . For position increment gap, have a look at http://lucene.apache.org/java/docs/api/org/apache/lucene/analy