Bug#964901: UTF-8 BOM can be moved while editing

Benno Schulenberg Sun, 12 Jul 2020 04:34:15 -0700

Hello Nils,

Op 11-07-2020 om 21:56 schreef Nils König:
> when editing a UTF8 file in nano that contains a BOM (efbbbf) and inserting a 
> character at the beginning, the BOM bytes will move after the inserted 
> character. This can lead to breakages when such a file is being parsed by a 
> program


Ideally, a UTF-8 file should not contain a Byte Order Mark.  What if
I concatenate several files together?  Then the result might contain
BOMs embedded in the text.

As far as I know, BOM is only a problem with Windows and Google files.
I do not know of any tool on Unix that adds a BOM to a UTF-8 file.

> a BOM should, if at all present, only occur at the very beginning 
> of the file.

Ideally, yes.  But as shown above, if a file contains a BOM, the BOM
is bound to appear in other places too.  And the Unicode standard
does not forbid the BOM from occurring elsewhere -- in that case
it should be considered as a Zero Width Non Breaking Space.

> Ideally nano should detect the presence of BOM and not have it be 
> editable/moveable.

I could mitigate the problem by placing the cursor after the BOM
when a file is opened.  (See attached patch.)  But you can still
delete the BOM with <Backspace>, or put the cursor on it with
<Left> or <Home>.  For nano, all characters are just a group of
bytes that can be added, deleted, restored, searched, and saved.

If I would make the BOM uneditable and unmovable, people could
no longer use nano to get rid of a BOM in a file.

  https://bugs.launchpad.net/ubuntu/+source/nano/+bug/1045062

Benno

diff --git a/src/files.c b/src/files.c
index 04476c44..aad58b78 100644
--- a/src/files.c
+++ b/src/files.c
@@ -459,7 +459,10 @@ bool open_buffer(const char *filename, bool new_one)
 		openfile->lock_filename = thelocksname;
 #endif
 		openfile->current = openfile->filetop;
-		openfile->current_x = 0;
+		if (strcmp(openfile->filetop->data, "\xEF\xBB\xBF") == 0)
+			openfile->current_x = 3;
+		else
+			openfile->current_x = 0;
 		openfile->placewewant = 0;
 	}

signature.asc
Description: OpenPGP digital signature

Bug#964901: UTF-8 BOM can be moved while editing

Reply via email to