Hi all,

I am developing a search engine for a governmental body. This search
engine has to index pure xml documents which follow a custom xml schema.
The xml documents contain information about laws and official
announcements for Andalusia.
I need to implement different filter for the search. The current search
engine which can be found here [1] would need to be extended by ranges
about organizational bodies, kind of announcement (law,
resolution,...), ...

I played a bit with Nutch 0.8 and asked myself whether it is best
tool for the task. I got nutch to index the xml documents and I can as
well search the index, but I would need to add filter conditions for the
search. The alternative I see would be pure lucene since I am actually
not really "crawling" the site since the documents are not linked with
each other but put all the files (which have to be indexed) in the
urls/bulletin file. Then Zaheed pointed me to Solr and I had played a
wee bit around. 

To give you a better impression of the underlying architecture and xml
documents, each weekday there is a new bulletin (containing approx. 100
- 200 pages) eg [2]. This bulletin is stored on the file system and need
to be indexed. 

We have two different document types summaries and dispositions. The
summary looks like:
<summary year="2006" number="209" date="27-10-2006" section="1"
  startPage="8" endPage="20">
  <title>1. DISPOSICIONES GENERALES</title>
  <organisation name="Consejería de la Presidencia">
    <disposition bojaYear="2006" bojaNumber="209"
      bojaSection="1" type="Decreto" startPage="8" endPage="10"
      date="10-11-2006" detail="999952" law="178/2006"> Decreto
      178/2006, de 10 de octubre, por el que se establecen normas de
      protección de la avifauna para las instalaciones eléctricas de
      alta tensión</disposition>
  </organisation>
  <organisation name="Consejería de Economia y Hacienda">
    <disposition bojaYear="2006" bojaNumber="209"
      bojaSection="1" type="Resolución" startPage="10"
      endPage="12" date="10-11-2006" detail="999961">
      Resolución de 10 de octubre de 2006, de la Dirección General de
      Tesorería y Deuda Pública, por la que se realiza una
      convocatoria de subasta de carácter ordinario dentro del
      Programa de Emisión de Bonos y Obligaciones de la Junta de
      Andalucía.</disposition>
  </organisation>
</summary>

Following the tutorial and looking at the examples it seems that solr
only supports one document type. 

<add><doc>
  <field name="id">3007WFP</field>
  <field name="name">Dell Widescreen UltraSharp 3007WFP</field>
  <!-- ... -->
</doc></add>

The root element add is "just" the command for the server that we want
to add the document. Does that mean I would need to stick with this
doctype and transform our internal format for adding the document
information?

Further since the project is for a customer I would need a released
version when I put my engine in production. When does this community
expect to make its first release, or better asked which are the
blockers?

TIA for any information.

salu2

[1] http://andaluciajunta.es/portal/aj-bojaBuscador/0,22815,,00.html 
[2]
http://andaluciajunta.es/portal/boletines/2006/11/aj-bojaVerPagina-2006-11/0,23167,bi%253D693228039889,00.html

Reply via email to