Matt Hov created SOLR-15041:
-------------------------------

             Summary: CSV update handler can't handle line breaks/new lines 
together with field split/separators for multivalued fields
                 Key: SOLR-15041
                 URL: https://issues.apache.org/jira/browse/SOLR-15041
             Project: Solr
          Issue Type: Bug
      Security Level: Public (Default Security Level. Issues are Public)
          Components: update
    Affects Versions: 8.4
         Environment: Ubuntu 20.04 8 CPU 60GB+ ram
            Reporter: Matt Hov


I've been using the /update/csv option to bulk import large numbers of data 
with great success, but I believe I've found a corner case in the parsing of 
csv when the field is a multi-valued string field with a new-line character in 
it.

As soon as you specify 
{{f.[fieldname].split=true&f.[fieldname].separator=[something]}} the 
multi-field/split parsing stops at the first linebreak

My managed schema:
{code:java}
-- managed schema
<fieldType name="string" class="solr.StrField" sortMissingLast="true" 
docValues="true" /><fieldType name="strings" class="solr.StrField" 
sortMissingLast="true" multiValued="true" docValues="true" /> 
<dynamicField name="*_str" type="string" indexed="true" stored="false" />    
<dynamicField name="*_strs" type="strings" indexed="true" stored="false"/>{code}
Example POST url,  I'm using ! as split character for test1_strs and test2_strs
{code:java}
http://[myserver]/solr/[mycore]/update/csv?commitWithin=1000&f.test1_strs.split=true&f.test1_strs.separator=!&f.test2_strs.split=true&f.test2_strs.separator=!{code}
CSV content: (notice the new-lines are included but encapsulated by "", these 
new-lines need to be maintained as is)
{code:java}
id,title,test1_strs,test2_strs,test3_str
csv_test,title,"first line
with break!second line","first line!second_line","a line
break"
{code}
Resulting Solr Doc:
{code:java}
{
        "id":"csv_test",
        "title":"title",
        "_version_":1685718010076069888,
        "test1_strs":["first line "], 
        "test2_strs":["first line", "second_line"],
        "test3_str":"a line\r\nbreak"}]
  }
{code}
Note in the single value {{test3_str}} the new-line is appropriately maintained 
as \r\n (or just \n when this is done via code instead of manually)

{{test2_strs}} shows that the mutli-value split on ! worked correctly

{{test1_strs}} immediately stops processing after the first value's new-line, 
instead of the actual separator after the new-line.

Expected values should look like:
{code:java}
{
        "id":"csv_test",
        "title":"title",
        "_version_":1685718010076069888,
        "test1_strs":["first line\r\nwith break", "second line"], 
        "test2_strs":["first line", "second_line"],
        "test3_str":"a line\r\nbreak"}]
  }
{code}
 
I've tried pre-escaping line breaks but all that gives me is the escaped 
new-line in solr, which would need to be post-processed on the consuming end to 
return to a \r\n (or \n) and would be nontrivial to do.  Solr handles \n just 
find in all other cases so I consider this an expected behavior.

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to