On 4 November 2014 10:42, Lemke, Michael  ST/HZA-ZSW
<lemke...@schaeffler.com> wrote:
> On Tuesday, November 04, 2014 4:07 PM
> Alexandre Rafalovitch wrote:
>>
>>What are you actually trying to do on a business level?
>
> I am importing a wiki extract and the goal here is to extract the
> wiki's language from the filename.
>
> The language is also in an attribute within the imported xml
> but it has a namespace.  DIH doesn't find the attribute.  I tried,
> with or without the namespace.  I'd  actually prefer that option.
>
> Example:
> <mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/"; 
> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; 
> xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.3/ 
> http://www.mediawiki.org/xml/export-0.3.xsd"; version="0.3" xml:lang="en">
>
> Both
> xpath="/mediawiki/@xml:lang"
> xpath="/mediawiki/@lang"
> return nothing while
> xpath="/mediawiki/@version"
> correctly picks up the version attribute.

DIH ignores/does not support namespaces. So I would expect
'xpath="/mediawiki/@lang"' to work. Unless it is something that XML
parser strips away. Possible.

>
>>Maybe that's
>>something that can be handled better by sticking an
>>UpdateRequestProcessor chain _after_ DIH?
>
> Haven't looked at that.  It is as simple as a DIH?

Simpler :-) And you can find the full list of the processors at
http://www.solr-start.com/info/update-request-processors/

>
>>
>>As to your configuration, you have xxCONTENT column definition twice.
>>It might be working, but I think it is non-deterministic.
>
> In fact there are many more xxCONTENT definitions.  The idea is to
> apply many unrelated regex substitutions.  That part does work.
> The actual goal is to replace mediawiki's wikitext with plain
> text.

Don't know if this is helpful:
http://www.solr-start.com/javadoc/solr-lucene/org/apache/lucene/analysis/wikipedia/WikipediaTokenizerFactory.html
. That's much later in the chain, once the text is already in Solr.

Out of clues on everything else.

Regards,
   Alex.

Reply via email to