Benoit,

Are you familiar with the Vufind project (http://www.vufind.org)? If you
look at the PHP code in the import folder to see how the indexing is
working (there's an XSL transformation that then updates the index).
I've also written some initial code to use embedded Solr to do this
indexing directly from marc format files, including holding the entire
marcxml format record in the index.

You can contact me off-list if you have questions...

Wayne

Walter Underwood wrote:
> Solr is not an XML engine (or a MARC engine). It uses XML as an input format
> for fielded data. It does not index or search arbitrary XML. You need to
> convert your XML into Solr's format.
> 
> I would recommend expressing MARC in a Solr schema, then working on the
> input XML. The input XML depends on the schema.
> 
> If you need an XML engine, I'd recommend MarkLogic (commercial), a very
> good product.
> 
> wunder
> 
> On 10/5/07 12:44 AM, "PAUWELS  Benoit" <[EMAIL PROTECTED]> wrote:
> 
>> Hi,
>>
>> I wish to index well formed xml documents as they are.
>>
>> I have a database filled with MARCXML records. An example of these looks like
>> this:
>>
>>  
>>
>>         <record
>>
>>             ns0:schemaLocation="http://www.loc.gov/MARC21/slim
>> http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd";
>>
>>             xmlns="http://www.loc.gov/MARC21/slim";
>> xmlns:ns0="http://www.w3.org/2001/XMLSchema-instance";>
>>
>>             <leader>00000nam  22      a 4500</leader>
>>
>>             <controlfield tag="001">000500000</controlfield>
>>
>>             <controlfield tag="005">20050826220257.0</controlfield>
>>
>>             <controlfield tag="008">000710s1998    xx      r     000 0 dut
>> d</controlfield>
>>
>>             <datafield ind1=" " ind2=" " tag="040">
>>
>>                 <subfield code="a">Univ</subfield>
>>
>>             </datafield>
>>
>>             <datafield ind1="1" ind2=" " tag="100">
>>
>>                 <subfield code="a">van Wetten, J. W.</subfield>
>>
>>             </datafield>
>>
>>             <datafield ind1="1" ind2="3" tag="245">
>>
>>                 <subfield code="a">De positie van vrouwen in de 
>> asielprocedure
>> /</subfield>
>>
>>                 <subfield code="c">J.W. van Wetten, N. Dijkhof, F.
>> Heide.</subfield>
>>
>>             </datafield>
>>
>>         </record>
>>
>>  
>>
>> The idea is to create Lucene indexes on specific MARC fields and store the
>> complete MARC record in Lucene 'as is'. In the presentation layer of my
>> application I would then have this complete MARC record at hand, and as such
>> have full flexibility on which MARC fields to display. So I want to create 
>> the
>> following record through XSLT and feed this to SOLR.
>>
>>  
>>
>> <doc>
>>
>> <field name="title">De positie van vrouwen in de asielprocedure</field>
>>
>> <field name="author">van Wetten, J. W.</field>
>>
>> ...
>>
>> <field name="originalRecord">
>>
>>   <record
>>
>>             ns0:schemaLocation="http://www.loc.gov/MARC21/slim
>> http://www.loc.gov/standards/marcxml/schema/MARC21slim.xsd";
>>
>>             xmlns="http://www.loc.gov/MARC21/slim";
>> xmlns:ns0="http://www.w3.org/2001/XMLSchema-instance";>
>>
>>             <leader>00000nam  22      a 4500</leader>
>>
>>             <controlfield tag="001">000500000</controlfield>
>>
>>             <controlfield tag="005">20050826220257.0</controlfield>
>>
>>             <controlfield tag="008">000710s1998    xx      r     000 0 dut
>> d</controlfield>
>>
>>             <datafield ind1=" " ind2=" " tag="040">
>>
>>                 <subfield code="a">UGent</subfield>
>>
>>             </datafield>
>>
>>             <datafield ind1="1" ind2=" " tag="100">
>>
>>                 <subfield code="a">van Wetten, J. W.</subfield>
>>
>>             </datafield>
>>
>>             <datafield ind1="1" ind2="3" tag="245">
>>
>>                 <subfield code="a">De positie van vrouwen in de 
>> asielprocedure
>> /</subfield>
>>
>>                 <subfield code="c">J.W. van Wetten, N. Dijkhof, F.
>> Heide.</subfield>
>>
>>             </datafield>
>>
>>         </record>
>>
>> </field>
>>
>> </doc>
>>
>>  
>>
>> I have the following in my schema.xml:
>>
>>  
>>
>> <field name="author" type="text" indexed="true" stored="true"
>> termVectors="true"/>
>>
>> <field name="title" type="text" indexed="true" stored="true"
>> termVectors="true"/>
>>
>> <field name="originalRecord" type="text" indexed="false" stored="true"/>
>>
>>  
>>
>>  
>>
>> SOLR has of course a problem with the XML in the 'originalRecord' field.
>>
>> Is there a solution to this? Has anyone done this before?
>>
>>  
>>
>> Thanks a lot.
>>
>> Benoit.
>>
>>  
>>
>>  
>>
>> =============================
>>
>> PAUWELS Benoit
>>
>> Université Libre de Bruxelles - Libraries
>>
>> Head of Automation
>>
>> Av. F.D. Roosevelt 50, CP 180
>>
>> 1050 BRUSSELS
>>
>> Belgium
>>
>> Tel: + 32 2 650 23 91
>>
>> Fax: + 32 2 650 23 91
>>
>> =============================
>>
>>  
>>
>>  
>>
> 


-- 
/**
 * Wayne Graham
 * Earl Gregg Swem Library
 * PO Box 8794
 * Williamsburg, VA 23188
 * 757.221.3112
 * http://swem.wm.edu/blogs/waynegraham/
 */

Reply via email to