On 4 November 2014 10:42, Lemke, Michael ST/HZA-ZSW <lemke...@schaeffler.com> wrote: > On Tuesday, November 04, 2014 4:07 PM > Alexandre Rafalovitch wrote: >> >>What are you actually trying to do on a business level? > > I am importing a wiki extract and the goal here is to extract the > wiki's language from the filename. > > The language is also in an attribute within the imported xml > but it has a namespace. DIH doesn't find the attribute. I tried, > with or without the namespace. I'd actually prefer that option. > > Example: > <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/" > xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" > xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.3/ > http://www.mediawiki.org/xml/export-0.3.xsd" version="0.3" xml:lang="en"> > > Both > xpath="/mediawiki/@xml:lang" > xpath="/mediawiki/@lang" > return nothing while > xpath="/mediawiki/@version" > correctly picks up the version attribute.
DIH ignores/does not support namespaces. So I would expect 'xpath="/mediawiki/@lang"' to work. Unless it is something that XML parser strips away. Possible. > >>Maybe that's >>something that can be handled better by sticking an >>UpdateRequestProcessor chain _after_ DIH? > > Haven't looked at that. It is as simple as a DIH? Simpler :-) And you can find the full list of the processors at http://www.solr-start.com/info/update-request-processors/ > >> >>As to your configuration, you have xxCONTENT column definition twice. >>It might be working, but I think it is non-deterministic. > > In fact there are many more xxCONTENT definitions. The idea is to > apply many unrelated regex substitutions. That part does work. > The actual goal is to replace mediawiki's wikitext with plain > text. Don't know if this is helpful: http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/wikipedia/WikipediaTokenizerFactory.html . That's much later in the chain, once the text is already in Solr. Out of clues on everything else. Regards, Alex.