Trouble getting "langid.map.individual" setting to work in Solr 5.0.x

David Smith Mon, 03 Aug 2015 08:06:29 -0700

I am trying to use “languid.map.individual” setting to allow field “a” to 
detect as, say, English, and be mapped to “a_en”, while in the same document, 
field “b” detects as, say, German and is mapped to “b_de”.


What happens in my tests is that the global language is detected (for example, 
German), but BOTH fields are mapped to “_de” as a result.  I cannot get 
individual detection or mapping to work.  Am I mis-understanding the purpose of 
this setting?

Here is the resulting document from my test:

----------------
      {
        "id": "1005!22345",
        "language": [
          "de"
        ],
        "a_de": "A title that should be detected as English with high 
confidence",
        "b_de": "Die Einführung einer anlasslosen Speicherung von 
Passagierdaten für alle Flüge aus einem Nicht-EU-Staat in die EU und umgekehrt 
ist näher gerückt. Der Ausschuss des EU-Parlaments für bürgerliche Freiheiten, 
Justiz und Inneres (LIBE) hat heute mit knapper Mehrheit für einen 
entsprechenden Richtlinien-Entwurf der EU-Kommission gestimmt. Bürgerrechtler, 
Grüne und Linke halten die geplante Richtlinie für eine andere Form der 
anlasslosen Vorratsdatenspeicherung, die alle Flugreisenden zu Verdächtigen 
mache.",
        "_version_": 1508494723734569000
      }
----------------

I expected “a_de” to be “a_en”, and the “language” multi-valued field to have 
“en” and “de”.

Here is my configuration in solrconfig.xml:

--------------------
    <updateRequestProcessorChain name="langid" default="true">
        <processor 
class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
            <lst name="defaults">
                <str name="langid">true</str>
                <str name="langid.fl">a,b</str>
                <str name="langid.map">true</str>
                <str name="langid.map.individual">true</str>
                <str name="langid.langField">language</str>
                <str 
name="langid.map.lcmap">af:uns,ar:uns,bg:uns,bn:uns,cs:uns,da:uns,el:uns,et:uns,fa:uns,fi:uns,gu:uns,he:uns,hi:uns,hr:uns,hu:uns,id:uns,ja:uns,kn:uns,ko:uns,lt:uns,lv:uns,mk:uns,ml:uns,mr:uns,ne:uns,nl:uns,no:uns,pa:uns,pl:uns,ro:uns,ru:uns,sk:uns,sl:uns,so:uns,sq:uns,sv:uns,sw:uns,ta:uns,te:uns,th:uns,tl:uns,tr:uns,uk:uns,ur:uns,vi:uns,zh-cn:uns,zh-tw:uns</str>
                <str name="langid.fallback">en</str>
            </lst>
        </processor>
        <processor class="solr.LogUpdateProcessorFactory" />
        <processor class="solr.RunUpdateProcessorFactory" />
    </updateRequestProcessorChain>
--------------------


The debug output of lang detect, during indexing, is as follows:

-------------------
DEBUG - 2015-08-03 14:37:54.450; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Language 
detected de with certainty 0.9999964723182276
DEBUG - 2015-08-03 14:37:54.450; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Detected 
main document language from fields [a, b]: de
DEBUG - 2015-08-03 14:37:54.450; 
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor; 
Appending field a
DEBUG - 2015-08-03 14:37:54.451; 
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor; 
Appending field b
DEBUG - 2015-08-03 14:37:54.453; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Language 
detected de with certainty 0.9999964723182276
DEBUG - 2015-08-03 14:37:54.453; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Mapping 
field a using individually detected language de
DEBUG - 2015-08-03 14:37:54.454; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Doing 
mapping from a with language de to field a_de
DEBUG - 2015-08-03 14:37:54.454; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Mapping 
field 1005!22345 to de
DEBUG - 2015-08-03 14:37:54.454; org.eclipse.jetty.webapp.WebAppClassLoader; 
loaded class org.apache.solr.common.SolrInputField from 
WebAppClassLoader=525571@80503
DEBUG - 2015-08-03 14:37:54.454; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Removing 
old field a
DEBUG - 2015-08-03 14:37:54.455; 
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor; 
Appending field a
DEBUG - 2015-08-03 14:37:54.455; 
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor; 
Appending field b
DEBUG - 2015-08-03 14:37:54.456; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Language 
detected de with certainty 0.9999980402022373
DEBUG - 2015-08-03 14:37:54.456; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Mapping 
field b using individually detected language de
DEBUG - 2015-08-03 14:37:54.456; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Doing 
mapping from b with language de to field b_de
DEBUG - 2015-08-03 14:37:54.456; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Mapping 
field 1005!22345 to de
DEBUG - 2015-08-03 14:37:54.456; 
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Removing 
old field b
-------------

From this, my takeaway is that every time the 
LangDetectLanguageIdentifierUpdateProcessor is asked to detect the language, it 
is using field a AND b.  But I can’t quite tell from this output.

Any insight appreciated.

Regards,

David

Trouble getting "langid.map.individual" setting to work in Solr 5.0.x

Reply via email to