Re: Parsing/indexing XML data

2006-04-14 Thread Yonik Seeley
On 4/14/06, Ken Krugler <[EMAIL PROTECTED]> wrote:
> Hi all,
>
> I've got some fields that will contain embedded XML. Two questions
> relating to that:
>
> 1. It appears as though I'll need to XML-escape the field data, as
> otherwise Solr complains about find a start tag (one of the embedded
> tags) before it finds the end tag for a field.
>
> Is this an expected constraint?

It's not just embedded XML you have to escape... even in normal text
you need to escape XML reserved characters such as  '<' or '&'.
For most text protocols, there is not much of a way around escaping stuff.

> And is XML-escaping the data the best way to handle it? This is kind
> of related to question #2...

If you are doing the escaping yourself rather than relying on an XML
library, wrapping things  in a CDATA section might be easier for you. 
That would prevent the necessity of escaping every character.
http://www.w3schools.com/xml/xml_cdata.asp

> 2. What would be the easiest way to ignore XML tag data while
> indexing these types of XML-containing fields?

The HTMLStrip* tokenizers might work for you, esp if the fields aren't
well formed XML or HTML.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

> It seems like I could
> define a new field type (e.g. text_xml) and set the associated
> tokenizer class to something new that I create. Though I'd have to
> un-escape the data (ick) before parsing it to skip tags.

You wouldn't have to un-escape the data before parsing because the XML
parser in Solr will have already done that.

-Yonik


Re: Parsing/indexing XML data

2006-04-14 Thread Ken Krugler

Hi Yonik,

Thanks for the fast response.


 > I've got some fields that will contain embedded XML. Two questions

 relating to that:

 1. It appears as though I'll need to XML-escape the field data, as
 otherwise Solr complains about find a start tag (one of the embedded
 tags) before it finds the end tag for a field.

 Is this an expected constraint?


It's not just embedded XML you have to escape... even in normal text
you need to escape XML reserved characters such as  '<' or '&'.
For most text protocols, there is not much of a way around escaping stuff.


OK, just curious. What made it a bit confusing is that for my API, if 
the input is well-formed XML, then my XML parser is happy - it's only 
reserved XML characters that need to be escaped.



 > And is XML-escaping the data the best way to handle it? This is kind

 of related to question #2...


If you are doing the escaping yourself rather than relying on an XML
library, wrapping things  in a CDATA section might be easier for you.
That would prevent the necessity of escaping every character.
http://www.w3schools.com/xml/xml_cdata.asp


I'm using the org.apache.commons.lang.StringEscapeUtils class, which 
seems to work pretty well for this.



 > 2. What would be the easiest way to ignore XML tag data while
 > indexing these types of XML-containing fields?

The HTMLStrip* tokenizers might work for you, esp if the fields aren't
well formed XML or HTML.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters


 It seems like I could
 define a new field type (e.g. text_xml) and set the associated
 tokenizer class to something new that I create. Though I'd have to
 un-escape the data (ick) before parsing it to skip tags.


You wouldn't have to un-escape the data before parsing because the XML
parser in Solr will have already done that.


Great advice - let me give that a try.

Thanks!

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"


RE: Interest in Extending SOLR

2006-04-14 Thread Chris Hostetter

: One last note: last night, I did spend a bit of time looking into what
: exactly it would mean to add support for object types in SOLR. I
: modified the code base to support the object type tag in the schema,
: providing a working proof of concept (I'm happy to send a sample schema
: if anybody is interested). The main changes:

: * Modify IndexSchema to keep an object type
: * Provide a factory in SolrCore that returns the correct
: instance of SolrCore based on object type
: * Modify loading of schema to load one copy per object type

I'm confused ... once you made these modifications, did you have a
seperate index per objectType, each with it's own schema? ... the seperate
SolrCore instances seem to suggest total isolation, so there was no way to
query across all objectTypes?


-Hoss



RE: Interest in Extending SOLR

2006-04-14 Thread Chris Hostetter

: I definitely like the idea of support for multiple indexes based on
: partitioning data that is NOT tied to a predefined element named
: objectType. If we combine this with Chris' mention of completing the
: work to support multiple schemas via multiple webapps in the same
: servlet container, then I no longer see an immediate need to have more
: than one schema per webapp. The concept would be:

Yonik already added support for multiple webapp instances (with unique
schemas) to the Near Term task list ... i've also added a
brainstorming page to the wiki with some ideas for implimenting index
partitioning to the "Ideas for the future" section...

http://wiki.apache.org/solr/TaskList
http://wiki.apache.org/solr/IndexPartitioning

...the more i think about though, the less i'm convinced this is
absolutely neccessary.  i have a feeling that the built in DocSet caching
solr does and the search methods that allow you to filter by a DocSet (or
a query which is converted to a DocSet) would proably be "fast enough"
most times.

I would encourage you to experiment more with SOlr and test out it's
performance before assuming you have to get down into the nitty gritty
stuff and partition hte index (just becuase it improved the performance of
straight Lucene, doesn't mean Solr's built in caching mechanisms aren't
already better)


-Hoss