Walter Underwood wrote:
Cracking documents and spidering URLs are both big, big problems.
PDF is a horrid mess, as are old versions of MS Office. Proxies,
logins, cookies, all sort of issues show up with fetching URLs,
along with a fun variety of misbehaving servers.
I remember crashing one ser
Erik Hatcher wrote:
The idea of having Solr handle various document types is a good one, for
sure. I'm not sure what specifics would need to be implemented, but I
at least wanted to reply and say its a good idea!
Care has to be taken when passing a URL to Solr for it to go fetch,
though. T
On 1/7/07 7:24 AM, "Erik Hatcher" <[EMAIL PROTECTED]> wrote:
> Care has to be taken when passing a URL to Solr for it to go fetch,
> though. There are a lot of complexities in fetching resources via
> HTTP, especially when handing something off to Solr which should be
> behind a firewall and may
The idea of having Solr handle various document types is a good one,
for sure. I'm not sure what specifics would need to be implemented,
but I at least wanted to reply and say its a good idea!
Care has to be taken when passing a URL to Solr for it to go fetch,
though. There are a lot of c
Original problem statement:
--
I'm considering using Solr to replace an existing bare-metal Lucene
deployment - the current Lucene setup is embedded inside an existing
monolithic webapp, and I want to factor out the search functionality
into a separate webapp so it can be reused more e
Chris Hostetter wrote:
For your purposes, if you've got a system that works and does the Document
conversion for you, then you are probably right: Solr may not be a usefull
addition to your architecture. Solr doesn't really attempt to solve the
problem of parsing differnet kinds of data streams
On 12/23/06 5:28 AM, "Alan Burlison" <[EMAIL PROTECTED]> wrote:
>> You could do it in Solr. The difficulty is that arbitrary binary data
>> is not easily transferred via xml. So you must specify that the input
>> is in base64 or some other encoding. Then you could decode it on the
>> fly using
Chris Hostetter wrote:
: Why won't cdata work?
because your binary data might the byte sequence: 0x5D 0x5D 0x3E --
indicating hte end of the CDATA section. CDATA is short for "Charatacter
DATA" -- you can't put arbitrary binary data in (or even arbitrary text in
it) and be sure thta it will wor
: > You could do it in Solr. The difficulty is that arbitrary binary data
: > is not easily transferred via xml. So you must specify that the input
: > is in base64 or some other encoding. Then you could decode it on the
: > fly using a custom Analyzer before passing it along.
:
: Why won't cda
: > omitNorms let's you not use field norms for certain field when
: > calculating document matching score. This can save you some RAM.
: Thanks. What eddect does this have on the quality of the returned
: matches? Are there any guidelines as to when you would disable field
: norms, and on whi
Hi Alan,
- Original Message
From: Alan Burlison <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Saturday, December 23, 2006 8:19:21 AM
Subject: Re: Handling disparate data sources in Solr
Otis Gospodnetic wrote:
> omitNorms let's you not use field norms for cert
Bertrand Delacretaz wrote:
My "Subversion and Solr" presentation from the last Cocoon GetTogether
might give you ideas for how to handle this, see the link at
http://wiki.apache.org/solr/SolrResources.
Hmm, I'm beginning to think the only way to do this is to write a
complete custom front-end
Mike Klaas wrote:
You could do it in Solr. The difficulty is that arbitrary binary data
is not easily transferred via xml. So you must specify that the input
is in base64 or some other encoding. Then you could decode it on the
fly using a custom Analyzer before passing it along.
Why won't c
Otis Gospodnetic wrote:
omitNorms let's you not use field norms for certain field when
calculating document matching score. This can save you some RAM.
See http://issues.apache.org/jira/browse/LUCENE-448.
Thanks. What eddect does this have on the quality of the returned
matches? Are there
On 12/23/06, Alan Burlison <[EMAIL PROTECTED]> wrote:
...As well as centralising the index, I also want
to centralise the handling of the different document types...
My "Subversion and Solr" presentation from the last Cocoon GetTogether
might give you ideas for how to handle this, see the link
On 12/22/06, Alan Burlison <[EMAIL PROTECTED]> wrote:
At present the content of the Lucene index comes from many different
sources (web pages, documents, blog posts etc) and can be different
formats (plaintext, HTML, PDF etc). All the various content types are
rendered to plaintext before being
Alan,
omitNorms let's you not use field norms for certain field when calculating
document matching score. This can save you some RAM. See
http://issues.apache.org/jira/browse/LUCENE-448 .
For position increment gap, have a look at
http://lucene.apache.org/java/docs/api/org/apache/lucene/analy
17 matches
Mail list logo