Re: Parsing/indexing XML data
On 4/14/06, Ken Krugler <[EMAIL PROTECTED]> wrote: > Hi all, > > I've got some fields that will contain embedded XML. Two questions > relating to that: > > 1. It appears as though I'll need to XML-escape the field data, as > otherwise Solr complains about find a start tag (one of the embedded > tags) before it finds the end tag for a field. > > Is this an expected constraint? It's not just embedded XML you have to escape... even in normal text you need to escape XML reserved characters such as '<' or '&'. For most text protocols, there is not much of a way around escaping stuff. > And is XML-escaping the data the best way to handle it? This is kind > of related to question #2... If you are doing the escaping yourself rather than relying on an XML library, wrapping things in a CDATA section might be easier for you. That would prevent the necessity of escaping every character. http://www.w3schools.com/xml/xml_cdata.asp > 2. What would be the easiest way to ignore XML tag data while > indexing these types of XML-containing fields? The HTMLStrip* tokenizers might work for you, esp if the fields aren't well formed XML or HTML. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters > It seems like I could > define a new field type (e.g. text_xml) and set the associated > tokenizer class to something new that I create. Though I'd have to > un-escape the data (ick) before parsing it to skip tags. You wouldn't have to un-escape the data before parsing because the XML parser in Solr will have already done that. -Yonik
Re: Parsing/indexing XML data
Hi Yonik, Thanks for the fast response. > I've got some fields that will contain embedded XML. Two questions relating to that: 1. It appears as though I'll need to XML-escape the field data, as otherwise Solr complains about find a start tag (one of the embedded tags) before it finds the end tag for a field. Is this an expected constraint? It's not just embedded XML you have to escape... even in normal text you need to escape XML reserved characters such as '<' or '&'. For most text protocols, there is not much of a way around escaping stuff. OK, just curious. What made it a bit confusing is that for my API, if the input is well-formed XML, then my XML parser is happy - it's only reserved XML characters that need to be escaped. > And is XML-escaping the data the best way to handle it? This is kind of related to question #2... If you are doing the escaping yourself rather than relying on an XML library, wrapping things in a CDATA section might be easier for you. That would prevent the necessity of escaping every character. http://www.w3schools.com/xml/xml_cdata.asp I'm using the org.apache.commons.lang.StringEscapeUtils class, which seems to work pretty well for this. > 2. What would be the easiest way to ignore XML tag data while > indexing these types of XML-containing fields? The HTMLStrip* tokenizers might work for you, esp if the fields aren't well formed XML or HTML. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters It seems like I could define a new field type (e.g. text_xml) and set the associated tokenizer class to something new that I create. Though I'd have to un-escape the data (ick) before parsing it to skip tags. You wouldn't have to un-escape the data before parsing because the XML parser in Solr will have already done that. Great advice - let me give that a try. Thanks! -- Ken -- Ken Krugler Krugle, Inc. +1 530-210-6378 "Find Code, Find Answers"
RE: Interest in Extending SOLR
: One last note: last night, I did spend a bit of time looking into what : exactly it would mean to add support for object types in SOLR. I : modified the code base to support the object type tag in the schema, : providing a working proof of concept (I'm happy to send a sample schema : if anybody is interested). The main changes: : * Modify IndexSchema to keep an object type : * Provide a factory in SolrCore that returns the correct : instance of SolrCore based on object type : * Modify loading of schema to load one copy per object type I'm confused ... once you made these modifications, did you have a seperate index per objectType, each with it's own schema? ... the seperate SolrCore instances seem to suggest total isolation, so there was no way to query across all objectTypes? -Hoss
RE: Interest in Extending SOLR
: I definitely like the idea of support for multiple indexes based on : partitioning data that is NOT tied to a predefined element named : objectType. If we combine this with Chris' mention of completing the : work to support multiple schemas via multiple webapps in the same : servlet container, then I no longer see an immediate need to have more : than one schema per webapp. The concept would be: Yonik already added support for multiple webapp instances (with unique schemas) to the Near Term task list ... i've also added a brainstorming page to the wiki with some ideas for implimenting index partitioning to the "Ideas for the future" section... http://wiki.apache.org/solr/TaskList http://wiki.apache.org/solr/IndexPartitioning ...the more i think about though, the less i'm convinced this is absolutely neccessary. i have a feeling that the built in DocSet caching solr does and the search methods that allow you to filter by a DocSet (or a query which is converted to a DocSet) would proably be "fast enough" most times. I would encourage you to experiment more with SOlr and test out it's performance before assuming you have to get down into the nitty gritty stuff and partition hte index (just becuase it improved the performance of straight Lucene, doesn't mean Solr's built in caching mechanisms aren't already better) -Hoss