On Mon, 3 Jul 2023 08:29:11 +0700 Hairy Pixels via fpc-pascal <[email protected]> wrote:
> > On Jul 2, 2023, at 11:16 PM, Jer Haan <[email protected]> wrote: > > > > This table is copied from Wikipedia.<uencoding.pas>Hope it’s useful > > for you. If you improve the code pls let me know. > > This is perfect, thanks! Much more complicated than I thought. > > I'm curious now, if you were going the other direction and parsing a > string of different unicode characters with different code point > sequence lengths how would you know which length it was? For example > I started off know which unicode scalar to use by looking at a table > but if I had to find the character is stream of text? > > I think UTF8 can have 1-4 byte characters so you could encounter 1 > byte character followed by 4 byte characters interleaved and there's > no header or terminator for each character. How is this solved? There is a header byte. It depends, if you want to check for invalid UTF-8 sequences. From LazUTF8: function UTF8CodepointSizeFast(p: PChar): integer; begin case p^ of #0..#191 : Result := 1; #192..#223 : Result := 2; #224..#239 : Result := 3; #240..#247 : Result := 4; else Result := 1; // An optimization + prevents compiler warning about uninitialized Result. end; end; Mattias _______________________________________________ fpc-pascal maillist - [email protected] https://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal
