Hoss, There were a few comments about schema files in Mark Mail between you and Grant a couple of months ago, no big demand for them for the schema.xml file. Before I drop this would you consider taking a look at XSD file below for the schema.xml and perhaps submit the XSD file the SVN system? I can send it to you in a zip file, so it is formatted nicer.
Thanks, Peter -----Original Message----- From: Lenahan, Peter Sent: Saturday, November 15, 2008 10:08 PM To: solr-user@lucene.apache.org Subject: :TODO: we should try to make a DTD for the schema, DONE as XSD instead From: Peter Lenahan I noticed on the solr wiki the comment " :TODO: we should try to make a DTD for the schema " Well that wasn't very hard, and an XSD file is much better to work with than a DTD so I created an XSD file. But I have a few questions about the test results of applying an XSD schema to the various schema.xml files I downloaded. Actually running the test was more work than writing the schema.xsd file. The schema.xsd file is at the bottom of this posting. They tests were not all clean. First off to use the schema.xsd file, you need to place the schema.xsd file in the directory that contains the schema.xml file. Then you change the <schema element in the schema.xml files to reference the xsd file as follows. <schema name="Solr schema.xml file" version="1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="schema.xsd"> This schema declaration tags the elements in the default namespace. It would be better if all the elements were in an assigned namespace. But it would require changing all the elements in the files to test it. Here are the test results of applying the XSD schema to all the schema.xml files available in the solr download. I have a few questions about the test results. In some schema.xml files the order of elements are reversed, I allowed this but I was not sure if it was intentional. The elements are "defaultSearchField", "uniqueKey" I allow these to be reversed, should they be in either order like this? I am also not a committer so would someone like to step up and add the xsd file to the source control system? Thanks, Peter Test Summary: I ran the tests by hand on each of the files, I didn't create a test script. First there were 12 passing files. There were a couple of files that contained attributes that the documentation was not complete for. Then there were 3 failures. I think that these are errors in the schema.xml files. ======================================================================== ==== 1 - PASSED src\test\test-files\solr\conf Validating schema-not-required-unique-key.xml... The XML document schema-not-required-unique-key.xml is valid ======================================================================== ==== 2 - PASSED apache-solr-1.3.0\contrib\dataimporthandler\src\test\resources\solr\conf Validating dataimport-schema.xml... The XML document dataimport-schema.xml is valid ======================================================================== === 3 - PASSED apache-solr-1.3.0\client\ruby\solr-ruby\solr\test\ Validating schema.xml... The XML document schema.xml is valid ======================================================================== ==== 4 - PASSED Validating apache-solr-1.3.0\example\example-DIH\solr\db\conf\schema.xml... The XML document schema.xml is valid ======================================================================== ==== 5 - PASSED Validating apache-solr-1.3.0\example\multicore\core0\conf\schema.xml... The XML document schema.xml is valid ======================================================================== ====6 - PASSED Validating apache-solr-1.3.0\example\multicore\core1\conf\schema.xml... The XML document schema.xml is valid ======================================================================== ==== 7 - PASSED Validating apache-solr-1.3.0\example\solr\conf\schema.xml... The XML document schema.xml is valid ======================================================================== ==== 8 - PASSED apache-solr-1.3.0\src\test\test-files\solr\conf Validating bad-schema.xml... file:///c:/transform/tests/bad-schema.xml:44,11: Element 'fieldType' is not valid for content model: '(field,dynamicField)' The XML document bad-schema.xml is NOT valid (1 errors) I think that this is the correct result. Because the fieldType occurs in the middle of the fields declaration. ======================================================================== ==== 9 - PASSED apache-solr-1.3.0\src\test\test-files\solr Validating crazy-path-to-schema.xml... The XML document crazy-path-to-schema.xml is valid ======================================================================== ==== 10 - PASSED apache-solr-1.3.0\client\ruby\solr-ruby\solr\conf\ Validating schema.xml... The XML document schema.xml is valid ======================================================================== ==== 11 - PASSED apache-solr-1.3.0\src\test\test-files\solr\shared\conf Validating schema.xml... The XML document schema.xml is valid ======================================================================== ==== 12 - PASSED, Initially FAILED but Now PASSING When I validated this schema.xml file there were three errors. These attributes don't seem to be documented elsewhere, and I did't find the doc for this class ExternalFileField. I found some stuff on sourceforge that I didn't follow thru on it. I have added these attributes to the schema.xsd file, If this is correct this should be documented somewhere. If this is incorrect, then the attributes should be removed from the schema. apache-solr-1.3.0\src\test\test-files\solr\conf Validating schema11.xml... schema11.xml:237,132: Attribute 'keyField' is not declared for element 'fieldType' schema11.xml:237,132: Attribute 'defVal' is not declared for element 'fieldType' schema11.xml:237,132: Attribute 'valType' is not declared for element 'fieldType' Line:237: <fieldType name="file" keyField="id" defVal="1" stored="false" indexed="false" class="solr.ExternalFileField" valType="float"/> The XML document schema11.xml is NOT valid (3 errors) -------------------- After adding the attributes missing from the Wiki. Finally PASSING apache-solr-1.3.0\src\test\test-files\solr\conf Validating schema11.xml... The XML document schema11.xml is valid ======================================================================== ==== 13 - FAILED(1) - Question about attributes? apache-solr-1.3.0\src\test\test-files\solr\conf Validating schema-required-fields.xml... I was not sure about this? I didn't want to add these missing attributes to the schema unless this is correct. They don't seem to be in the documentation. Can anyone tell me if they should be there? Is there a 'name' attribute on the filter element? Is there a 'sortMissingFirst' on the field element? Also there are a couple of cases where the analyzer doesn't have a tokenizer and filter. I assume that is correct, so I changed the definition of the analyzer description from what I saw in the documentation to allow no tokenizer and a class name attribute on the analyzer tag. I imagine that this format should be documented on the wiki as well. I didn't do it because I wasn't sure. file:///c:/transform/tests/schema-required-fields.xml:116,72: Attribute 'class' is not declared for element 'analyzer' file:///c:/transform/tests/schema-required-fields.xml:116,72: Element 'analyzer' is not valid for content model: '(tokenizer,filter)' file:///c:/transform/tests/schema-required-fields.xml:270,89: Attribute 'name' is not declared for element 'filter' file:///c:/transform/tests/schema-required-fields.xml:365,96: Attribute 'sortMissingFirst' is not declared for element 'field' The XML document schema-required-fields.xml is NOT valid (4 errors) The first 2 errors can be resolved easily if this is the correct syntax. My initial definition of the analyzer element was a required tokenizer and an optional filter. <xs:element name="analyzer"> <xs:complexType> <xs:sequence> <xs:element ref="tokenizer"/> <xs:element minOccurs="0" maxOccurs="unbounded" ref="filter"/> </xs:sequence> <xs:attribute name="id" type="xs:ID"/> <xs:attribute name="type" type="xs:NCName"/> </xs:complexType> </xs:element> To allow no tokenizer and allow the class attribute on the analyzer element I changed the schema to allow it as follows. <xs:element name="analyzer"> <xs:complexType> <xs:sequence> <xs:element minOccurs="0" ref="tokenizer"/> <xs:element minOccurs="0" maxOccurs="unbounded" ref="filter"/> </xs:sequence> <xs:attribute name="id" type="xs:ID"/> <xs:attribute name="type" type="xs:NCName"/> <xs:attribute name="class" type="xs:string"/> </xs:complexType> </xs:element> My question really is, Is this correct? Now my test results look as follows. Validating schema-required-fields.xml... file:///c:/transform/tests/schema-required-fields.xml:270,89: Attribute 'name' is not declared for element 'filter' file:///c:/transform/tests/schema-required-fields.xml:365,96: Attribute 'sortMissingFirst' is not declared for element 'field' The XML document schema-required-fields.xml is NOT valid (2 errors) ======================================================================== ==== 14 - FAILED(2) - Error in the schema file copyField is defined incorrectly in this schema. Validating apache-solr-1.3.0exampleexample-DIHsolrrssconfschema.xml... /apache-solr-1.3.0/example/example-DIH/solr/rss/conf/schema.xml:303,11: Element 'copyField' is not valid for content model: '(field,dynamicField)' The XML document schema.xml is NOT valid (1 errors) ======================================================================== ==== 15 - FAILED(3) 2 problems in this file FAILED not sure if I should add these attributes to the elements filter and field. DOC MISSING in Wiki "preserveOriginal" \src\test\test-files\solr\conf\schema.xml preserveOriginal="1" is missing from the Wiki Page when defining the filter solr.WordDelimiterFilterFactory From: http://marc.info/?l=solr-dev&m=121483672330304&w=2 "Geoffrey Young (JIRA)" <jira () apache ! org> Date: 2008-06-30 14:37:45 When doing prefix searching, you need to hang on to the original term otherwise you'll miss many matches you should be making. Data: ABC-12345 WordDelimiterFitler may change this into ABC 12345 ABC12345 A user may enter a search such as ABC\-123* Which will fail to find a match given the above scenario. The attached patch will allow the use of the "preserveOriginal" option to WordDelimiterFilter and will analyze as ABC 12345 ABC12345 ABC-12345 in which case we will get a positive match. --------------------- 15a - FAILED(3a) not sure if I should add these attributes to the elements filter and field. apache-solr-1.3.0\src\test\test-files\solr\conf\schema.xml Validating schema.xml... schema.xml:287,89: Attribute 'name' is not declared for element 'filter' schema.xml:384,96: Attribute 'sortMissingFirst' is not declared for element 'field' The XML document schema.xml is NOT valid (2 errors) Error the same as: (apache-solr-1.3.0\src\test\test-files\solr\conf\schema-required-fields. xml) ======================================================================== ==== I can send this in a different format, perhaps a zip file if you wish. Peter schema.xsd <?xml version="1.0" encoding="UTF-8"?> <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified"> <xs:annotation> <xs:documentation xml:lang="EN"> Please see the solr wiki for complete details of the elements and attributes defined in this file. For more information, on how to customize this file, please see http://wiki.apache.org/solr/SchemaXml To learn about XMLSchema files, there is a primer available at: http://www.w3.org/TR/xmlschema-0/ This Extensible XML Schema represents the definition's of the Solr Schema file elements and attributes. The Solr Schema file is used within the Solr product to define the relationships of fields to correctly index and search the data in the Lucene index. Since the Solr Product is extensible in many ways if you create your own filter, and define parameters names for the filter that have not previously been defined as filter attributes. Then you will need to add those attribute names to the filters area "filterAttributeGroup" within this file for your schema file to validate correctly. Created: November 2008 </xs:documentation> </xs:annotation> <xs:simpleType name="logicalConjunction"> <xs:restriction base="xs:string"> <xs:enumeration value="AND"/> <xs:enumeration value="OR"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="x01Type"> <xs:annotation> <xs:documentation> </xs:documentation> </xs:annotation> <xs:restriction base="xs:string"> <xs:pattern value="[0-1]"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="phonicFiltersType"> <xs:annotation> <xs:documentation>For use with org.apache.lucene.analysis.PhoneticFilter</xs:documentation> </xs:annotation> <xs:restriction base="xs:string"> <xs:enumeration value="DoubleMetaphone"/> <xs:enumeration value="Metaphone"/> <xs:enumeration value="Soundex"/> <xs:enumeration value="RefinedSoundex"/> </xs:restriction> </xs:simpleType> <xs:simpleType name="porterStemmerlanguageType"> <xs:annotation> <xs:documentation>For use with org.apache.lucene.analysis.SnowballPorterFilter</xs:documentation> </xs:annotation> <xs:restriction id="langList" base="xs:string"> <xs:enumeration value="Danish"/> <xs:enumeration value="Dutch"/> <xs:enumeration value="English"/> <xs:enumeration value="Finnish"/> <xs:enumeration value="French"/> <xs:enumeration value="German2"/> <xs:enumeration value="German"/> <xs:enumeration value="Italian"/> <xs:enumeration value="Kp"/> <xs:enumeration value="Lovins"/> <xs:enumeration value="Norwegian"/> <xs:enumeration value="Porter"/> <xs:enumeration value="Portuguese"/> <xs:enumeration value="Russian"/> <xs:enumeration value="Spanish"/> <xs:enumeration value="Swedish"/> </xs:restriction> </xs:simpleType> <xs:attributeGroup name="fieldtypeAttributesGroup"> <xs:attribute name="class" type="xs:NCName" use="required"/> <!-- Field types that store text (TextField, StrField) support compression of stored content --> <xs:attribute name="compressed" type="xs:boolean"/> <!-- compressThreshold is the minimum length required for text compression to be invoked. This applies only if compressed=true; a common pattern is to set compressThreshold on the field type definition, and turn compression on and off in the individual field definitions. --> <xs:attribute name="compressThreshold" type="xs:integer"/> <!-- used for class="solr.ExternalFileField" --> <xs:attribute name="defVal" type="xs:string"/> <!-- --> <xs:attribute name="id" type="xs:ID" use="optional"/> <!-- --> <xs:attribute name="indexed" type="xs:boolean" use="optional"/> <!-- --> <!-- used for class="solr.ExternalFileField" --> <xs:attribute name="keyField" type="xs:string" use="optional"/> <xs:attribute name="multiValued" type="xs:boolean" use="optional"/> <!-- --> <xs:attribute name="name" type="xs:NCName" use="required"/> <!-- --> <xs:attribute name="omitNorms" type="xs:boolean" use="optional"/> <!-- --> <xs:attribute name="positionIncrementGap" type="xs:nonNegativeInteger" use="optional"/> <!-- --> <xs:attribute name="sortMissingFirst" type="xs:boolean" use="optional"/> <!-- --> <xs:attribute name="sortMissingLast" type="xs:boolean" use="optional"/> <!-- --> <xs:attribute name="stored" type="xs:boolean" use="optional"/> <!-- --> <xs:attribute name="termPositions" type="xs:boolean" use="optional"/> <!-- --> <xs:attribute name="termVectors" type="xs:boolean" use="optional"/> <!-- used for class="solr.ExternalFileField" --> <xs:attribute name="valType" type="xs:string" use="optional"/> </xs:attributeGroup> <xs:attributeGroup name="filterAttributeGroup"> <!-- The "class" attribute is required for all filters. --> <xs:attribute name="class" type="xs:NCName" use="required"/> <!-- The "catenateAll" attribute is required for "WordDelimiterFilterFactory" filters. --> <xs:attribute name="catenateAll" type="x01Type" use="optional"/> <!-- The "catenateNumbers" attribute is required for "WordDelimiterFilterFactory" filters. --> <xs:attribute name="catenateNumbers" type="x01Type" use="optional"/> <!-- The "catenateWords" attribute is required for "WordDelimiterFilterFactory" filters. --> <xs:attribute name="catenateWords" type="x01Type" use="optional"/> <!-- The "enablePositionIncrements" attribute is required for "StopFilterFactory" filters. --> <xs:attribute name="enablePositionIncrements" type="xs:boolean" use="optional"/> <!-- The "encoder" attribute is required for "PhoneticFilterFactory" filters. --> <xs:attribute name="encoder" type="phonicFiltersType" use="optional"/> <!-- The "expand" attribute is required for "SynonymFilterFactory" filters. --> <xs:attribute name="expand" type="xs:boolean" use="optional"/> <!-- The "generateNumberParts" attribute is required for "WordDelimiterFilterFactory" filters. --> <xs:attribute name="generateNumberParts" type="x01Type" use="optional"/> <!-- The "generateWordParts" attribute is required for "WordDelimiterFilterFactory" filters. --> <xs:attribute name="generateWordParts" type="x01Type" use="optional"/> <!-- The "id" attribute is required for "" filters. --> <xs:attribute name="id" type="xs:ID" use="optional"/> <!-- The "ignoreCase" attribute is optional for the "StopFilterFactory, SynonymFilterFactory". --> <xs:attribute name="ignoreCase" type="xs:boolean" use="optional"/> <!-- The "inject" attribute is required for "PhoneticFilterFactory" filters. --> <xs:attribute name="inject" type="xs:boolean" use="optional"/> <!-- The "language" attribute is required for "SnowballPorterFilterFactory" filters. --> <xs:attribute name="language" type="porterStemmerlanguageType" use="optional"/> <!-- The "max" attribute is required for the "LengthFilterFactory" filters. --> <xs:attribute name="max" type="xs:nonNegativeInteger" use="optional"/> <!-- The "maxShingleSize" attribute is required for the "ShingleFilterFactory" filters. --> <xs:attribute name="maxShingleSize" type="xs:nonNegativeInteger" default="2" use="optional"/> <!-- The "min" attribute is required for the "LengthFilterFactory" filters. --> <xs:attribute name="min" type="xs:nonNegativeInteger" use="optional"/> <!-- The "outputUnigrams" attribute is required for "ShingleFilterFactory" filters. --> <xs:attribute name="outputUnigrams" type="xs:boolean" use="optional"/> <!-- The "pattern" attribute is required for "PatternTokenizerFactory" filters. --> <xs:attribute name="pattern" type="xs:string" use="optional"/> <!-- solr.WordDelimiterFilterFactory new feature in Add the ability to preserve the original term when using WordDelimiterFilter --> <xs:attribute name="preserveOriginal" type="x01Type" use="optional"/> <!-- The "protected" attribute is required for "EnglishPorterFilterFactory" filters. --> <xs:attribute name="protected" type="xs:NCName" use="optional"/> <!-- The "replace" attribute is required for "" filters. --> <xs:attribute name="replace" type="xs:NCName" use="optional"/> <!-- The "replacement" attribute is required for "" filters. --> <xs:attribute name="replacement" type="xs:string" use="optional"/> <!-- The "splitOnCaseChange" attribute is required for "WordDelimiterFilter " filters. --> <xs:attribute name="splitOnCaseChange" type="x01Type" use="optional"/> <!-- The "synonyms" attribute is required for "SynonymFilterFactory" filters. --> <xs:attribute name="synonyms" type="xs:NCName" use="optional"/> <!-- The "updateOffsets" attribute is optional for "TrimFilterFactory" filters. --> <xs:attribute name="updateOffsets" type="xs:boolean" use="optional"/> <!-- The "words" attribute is optional for "StopFilterFactory" filters. default words will be used if not specified. --> <xs:attribute name="words" type="xs:NCName" use="optional"/> </xs:attributeGroup> <xs:attributeGroup name="fieldAttributeGroup"> <xs:attribute name="default" type="xs:NMTOKEN"/> <xs:attribute name="compressed" type="xs:boolean"/> <xs:attribute name="compressThreshold" type="xs:integer"/> <xs:attribute name="id" type="xs:ID"/> <xs:attribute name="indexed" type="xs:boolean"/> <xs:attribute name="multiValued" type="xs:boolean"/> <xs:attribute name="name" use="required" type="xs:string"/> <xs:attribute name="omitNorms" type="xs:boolean"/> <xs:attribute name="required" type="xs:boolean"/> <xs:attribute name="stored" type="xs:boolean"/> <xs:attribute name="termOffsets" type="xs:boolean"/> <xs:attribute name="termPositions" type="xs:boolean"/> <xs:attribute name="termVectors" type="xs:boolean"/> <xs:attribute name="type" use="required" type="xs:string"/> </xs:attributeGroup> <xs:element name="schema"> <xs:complexType> <xs:sequence> <xs:element ref="types"/> <xs:element ref="fields"/> <xs:choice maxOccurs="unbounded"> <xs:element maxOccurs="1" ref="defaultSearchField"/> <xs:element maxOccurs="1" ref="uniqueKey"/> </xs:choice> <xs:element minOccurs="0" ref="solrQueryParser"/> <xs:element minOccurs="0" maxOccurs="unbounded" ref="copyField"/> <xs:element ref="similarity" minOccurs="0" maxOccurs="1"/> </xs:sequence> <xs:attribute name="id" type="xs:ID"/> <xs:attribute name="name" use="required" type="xs:string"/> <xs:attribute name="version" use="required" type="xs:decimal"/> </xs:complexType> </xs:element> <xs:element name="types"> <xs:complexType> <xs:choice maxOccurs="unbounded"> <!-- Both elements fieldType and fieldtype are the same --> <xs:element maxOccurs="unbounded" ref="fieldType"/> <xs:element maxOccurs="unbounded" ref="fieldtype"/> </xs:choice> <xs:attribute name="id" type="xs:ID"/> </xs:complexType> </xs:element> <!-- Both elements fieldType and fieldtype are the same --> <xs:element name="fieldType" type="fieldTypeORfieldtype"/> <xs:element name="fieldtype" type="fieldTypeORfieldtype"/> <xs:complexType name="fieldTypeORfieldtype"> <xs:sequence> <xs:element minOccurs="0" maxOccurs="unbounded" ref="analyzer"/> </xs:sequence> <xs:attributeGroup ref="fieldtypeAttributesGroup"/> </xs:complexType> <xs:element name="analyzer"> <xs:complexType> <xs:sequence minOccurs="0"> <xs:element ref="tokenizer"/> <xs:element minOccurs="0" maxOccurs="unbounded" ref="filter"/> </xs:sequence> <xs:attribute name="class" type="xs:string"/> <xs:attribute name="id" type="xs:ID"/> <xs:attribute name="type" type="xs:NCName"/> </xs:complexType> </xs:element> <xs:element name="tokenizer"> <xs:complexType> <xs:attribute name="class" use="required" type="xs:NCName"/> <xs:attribute name="id" type="xs:ID"/> </xs:complexType> </xs:element> <xs:element name="filter"> <xs:complexType> <xs:attributeGroup ref="filterAttributeGroup"/> </xs:complexType> </xs:element> <xs:element name="fields"> <xs:complexType> <xs:sequence> <xs:element maxOccurs="unbounded" ref="field"/> <xs:element maxOccurs="unbounded" minOccurs="0" ref="dynamicField"/> </xs:sequence> <xs:attribute name="id" type="xs:ID"/> </xs:complexType> </xs:element> <xs:element name="field"> <xs:complexType> <xs:attributeGroup ref="fieldAttributeGroup"/> </xs:complexType> </xs:element> <xs:element name="dynamicField"> <xs:complexType> <xs:attributeGroup ref="fieldAttributeGroup"/> </xs:complexType> </xs:element> <xs:element name="uniqueKey" type="xs:NCName"/> <xs:element name="defaultSearchField" type="xs:NCName"/> <xs:element name="solrQueryParser"> <xs:complexType> <xs:attribute name="defaultOperator" use="optional" type="logicalConjunction" default="OR"/> <xs:attribute name="id" type="xs:ID"/> </xs:complexType> </xs:element> <xs:element name="copyField"> <xs:complexType> <xs:attribute name="dest" use="required" type="xs:string"/> <xs:attribute name="id" type="xs:ID"/> <xs:attribute name="source" use="required" type="xs:string"/> </xs:complexType> </xs:element> <xs:element name="similarity"> <xs:complexType> <xs:sequence minOccurs="0"> <!-- org.apache.solr.schema.CustomSimilarityFactory --> <xs:element name="str"> <xs:complexType mixed="true"> <xs:attribute name="name" type="xs:string"/> </xs:complexType> </xs:element> </xs:sequence> <xs:attribute name="class" use="required" type="xs:NCName"/> <xs:attribute name="id" type="xs:ID"/> </xs:complexType> </xs:element> </xs:schema>