Re: Handling disparate data sources in Solr

Walter Underwood Mon, 08 Jan 2007 09:07:15 -0800

On 1/7/07 7:24 AM, "Erik Hatcher" <[EMAIL PROTECTED]> wrote:


> Care has to be taken when passing a URL to Solr for it to go fetch,
> though.  There are a lot of complexities in fetching resources via
> HTTP, especially when handing something off to Solr which should be
> behind a firewall and may not be able to see the web as you would
> with your browser.

Cracking documents and spidering URLs are both big, big problems.
PDF is a horrid mess, as are old versions of MS Office. Proxies,
logins, cookies, all sort of issues show up with fetching URLs,
along with a fun variety of misbehaving servers.

I remember crashing one server with 25 GET requests before we
implemented session cookies in our spider. That used all that
DB connections and killed the server.

If you need to do a lot of spidering and parse lots of kinds of
documents, I don't know of an open source solution for that.
Products like Ultraseek and the Googlebox are about your only
choice.

wunder
-- 
Walter Underwood
Search Guru, Netflix
Former Architect for Ultraseek

Re: Handling disparate data sources in Solr

Reply via email to