Solr is indexing XML only?

2006-04-26 Thread David Trattnig
Hello!

I'd like to setup/develop a search-server. I thought I would use Lucene,
then I read about Solr. So I have done the Solr-Tutorial. Firstly really
happy about the additional features to the Lucene-Functionality I now
noticed that Solr can index only XML files. Or am I completely wrong?

What should I use for the following situation:

1. Copy HTML-files to the Live-Server (via RSync)
2. Index them by the search engine
3. Exclude some "tagged" files (these files for example would have a
specific meta-data-tag)
4. Exclude HTML-tags and other unworthy stuff

How much work of development would that be with Lucene or Solr (If
possible)?

Any help would be appreciated!

Thx in advance,
david


Re: Solr is indexing XML only?

2006-04-26 Thread Erik Hatcher

David,

Solr doesn't index XML files, but rather XML is used as the wrapper  
of the text that does get indexed.  The document structure is defined  
in schema.xml, and the field text to be indexed is sent wrapped in an  
XML request.


Regarding your scenario, you would need to write code that parsed the  
HTML as desired, taking into account any exclude rules, wrap the text  
to be indexed (along with any metadata such as the HTML filename or  
URL) into XML and POST it to Solr using the XML structure described  
here:




The XML request body is just a carrier of the data in a structured  
way, nothing more.


Erik


On Apr 26, 2006, at 4:27 AM, David Trattnig wrote:


Hello!

I'd like to setup/develop a search-server. I thought I would use  
Lucene,
then I read about Solr. So I have done the Solr-Tutorial. Firstly  
really

happy about the additional features to the Lucene-Functionality I now
noticed that Solr can index only XML files. Or am I completely wrong?

What should I use for the following situation:

1. Copy HTML-files to the Live-Server (via RSync)
2. Index them by the search engine
3. Exclude some "tagged" files (these files for example would have a
specific meta-data-tag)
4. Exclude HTML-tags and other unworthy stuff

How much work of development would that be with Lucene or Solr (If
possible)?

Any help would be appreciated!

Thx in advance,
david




Re: Solr is indexing XML only?

2006-04-26 Thread Bill Au
With Solr you can index anything Lucene can index since Solr uses
Lucene under the cover.  The input to Solr is in XML format.  You
will need to process that data you want to index (ie exclude certain
files and remove HTML tags) and put them into Solr's input format.

Bill


On 4/26/06, David Trattnig <[EMAIL PROTECTED]> wrote:
>
> Hello!
>
> I'd like to setup/develop a search-server. I thought I would use Lucene,
> then I read about Solr. So I have done the Solr-Tutorial. Firstly really
> happy about the additional features to the Lucene-Functionality I now
> noticed that Solr can index only XML files. Or am I completely wrong?
>
> What should I use for the following situation:
>
> 1. Copy HTML-files to the Live-Server (via RSync)
> 2. Index them by the search engine
> 3. Exclude some "tagged" files (these files for example would have a
> specific meta-data-tag)
> 4. Exclude HTML-tags and other unworthy stuff
>
> How much work of development would that be with Lucene or Solr (If
> possible)?
>
> Any help would be appreciated!
>
> Thx in advance,
> david
>
>


Re: Solr is indexing XML only?

2006-04-26 Thread Chris Hostetter
: will need to process that data you want to index (ie exclude certain
: files and remove HTML tags) and put them into Solr's input format.

minor clarification: Solr does ship with two Tokenizers that do a pretty
good job of throwing away HTML markup, os you don't have to parse it
yourlsef -- but therye are still analyzers, all of the tokens they produce
go into one fields, there's no way to use them to parse an entire HTML
file and put the  in one field and  in another.

: > 1. Copy HTML-files to the Live-Server (via RSync)
: > 2. Index them by the search engine
: > 3. Exclude some "tagged" files (these files for example would have a
: > specific meta-data-tag)
: > 4. Exclude HTML-tags and other unworthy stuff
: >
: > How much work of development would that be with Lucene or Solr (If
: > possible)?

with the exception of item #4 in your list (which i addressed above)
The amount of work neccessary to process your files and extract the text
you want to index will largely be the same regardless of wether you use
Lucene or Solr -- what Solr provides is all of the "service" layer stuff,
for example...
  * an HTTP based api so the file processing and the searching don't have
to live on the same machine.
  * a schema that allows you to say "this text should be searchable, and
this number should be sortable" without needing to hardcode those rules
into your indexer .. you can change your mind later and only modify your
schema, not your code.
  * a really smart caching system that knows when the data in your index
has been modified.

...etc.



-Hoss



distributing indexes via solr

2006-04-26 Thread Johnny Monsod
Hi,

Suppose I want the xml input submitted to solr to be distributed among a
fixed set of partitions; basically, something like round-robin among each of
them, so that each directory has a relatively equal size in terms of # of
segments.  Is there an easy way to do this?  I took a quick look at the solr
source code and it looks like the servlet bootstrap associates itself with a
SolrIndexWriter instance that is specifically tied to a directory, so
updates always go to a single directory.

thanks in advance,
John