Hi all, I am developing a search engine for a governmental body. This search engine has to index pure xml documents which follow a custom xml schema. The xml documents contain information about laws and official announcements for Andalusia.
I need to implement different filter for the search. The current search engine which can be found here [1] would need to be extended by ranges about organizational bodies, kind of announcement (law, resolution,...), ... I played a bit with Nutch 0.8 and asked myself whether it is best tool for the task. I got nutch to index the xml documents and I can as well search the index, but I would need to add filter conditions for the search. The alternative I see would be pure lucene since I am actually not really "crawling" the site since the documents are not linked with each other but put all the files (which have to be indexed) in the urls/bulletin file. Then Zaheed pointed me to Solr and I had played a wee bit around. To give you a better impression of the underlying architecture and xml documents, each weekday there is a new bulletin (containing approx. 100 - 200 pages) eg [2]. This bulletin is stored on the file system and need to be indexed. We have two different document types summaries and dispositions. The summary looks like: <summary year="2006" number="209" date="27-10-2006" section="1" startPage="8" endPage="20"> <title>1. DISPOSICIONES GENERALES</title> <organisation name="Consejería de la Presidencia"> <disposition bojaYear="2006" bojaNumber="209" bojaSection="1" type="Decreto" startPage="8" endPage="10" date="10-11-2006" detail="999952" law="178/2006"> Decreto 178/2006, de 10 de octubre, por el que se establecen normas de protección de la avifauna para las instalaciones eléctricas de alta tensión</disposition> </organisation> <organisation name="Consejería de Economia y Hacienda"> <disposition bojaYear="2006" bojaNumber="209" bojaSection="1" type="Resolución" startPage="10" endPage="12" date="10-11-2006" detail="999961"> Resolución de 10 de octubre de 2006, de la Dirección General de Tesorería y Deuda Pública, por la que se realiza una convocatoria de subasta de carácter ordinario dentro del Programa de Emisión de Bonos y Obligaciones de la Junta de Andalucía.</disposition> </organisation> </summary> Following the tutorial and looking at the examples it seems that solr only supports one document type. <add><doc> <field name="id">3007WFP</field> <field name="name">Dell Widescreen UltraSharp 3007WFP</field> <!-- ... --> </doc></add> The root element add is "just" the command for the server that we want to add the document. Does that mean I would need to stick with this doctype and transform our internal format for adding the document information? Further since the project is for a customer I would need a released version when I put my engine in production. When does this community expect to make its first release, or better asked which are the blockers? TIA for any information. salu2 [1] http://andaluciajunta.es/portal/aj-bojaBuscador/0,22815,,00.html [2] http://andaluciajunta.es/portal/boletines/2006/11/aj-bojaVerPagina-2006-11/0,23167,bi%253D693228039889,00.html