Hello,

Inline.

Às 22:09 de 14/04/2022, Kristjan Kure escreveu:
Thank you, Rui. Not sure I got everything right, but here it is:

*current_loc <- Sys.getlocale("LC_COLLATE")*
#> [1] "Estonian_Estonia.1257"

"A" < "a"
#41 < 61
#> [1] FALSE
raw_A <- charToRaw("A") #41
raw_a <- charToRaw("a") #61
# Not OK - should be TRUE (41 is less than 61)

"A" > "a"
#41 > 61
#> [1] TRUE
raw_A <- charToRaw("A") #41
raw_a <- charToRaw("a") #61
# Not OK - should be FALSE (41 is not bigger than 61)

*Sys.setlocale("LC_COLLATE", locale = "C")*

"A" < "a"
#41 < 61
#> [1] TRUE
raw_A <- charToRaw("A") #41
raw_a <- charToRaw("a") #61

# OK - (41 is less than 61)

"A" > "a"
#41 > 61
#> [1] FALSE
raw_A <- charToRaw("A") #41
raw_a <- charToRaw("a") #61

# OK - (41 is not bigger than 61)

*Sys.setlocale("LC_COLLATE", current_loc)*
*
*
*Conclusion: Comparing strings using charToRaw() only works correctly with locale = C?*
*

You are still mistaking the locale with the ASCII code (raw).
Windows codepages like 1252 or your 1257 are supersets of the ASCII code and the ASCII hex codes make a lot of sense. The upper case and lower letters are 2^5 == 32 == 0x20 apart so set the 5th bit to go from upper to lower case:

"A": 0100 0001 == 0x41
"a": 0110 0001 == 0x61

"B": 0100 0010
"b": 0110 0010

etc.

This only relates to human alphabets and languages because its an attempt to make an electronic code usable to transmit/record/retrieve text in human readable way. But each language's lexicographic order need not follow this encoding's order even if it's what is used to record it electronically. In the examples below you'll see that to change the locale does not change the numeric codes.

Comparing strings using charToRaw() only works correctly if what you want is to compare codes, not letters (in the sense of human writing).


old_loc <- Sys.getlocale("LC_COLLATE")

# hexadecimal base integers
raw_A <- charToRaw("A") # 0x41
raw_a <- charToRaw("a") # 0x61

raw_A < raw_a
#> [1] TRUE
raw_A > raw_a
#> [1] FALSE

as.integer(raw_A)
#> [1] 65
as.integer(raw_a)
#> [1] 97

Sys.setlocale("LC_COLLATE", locale = "C")
#> [1] "C"

(C_raw_A <- charToRaw("A")) # 0x41
#> [1] 41
(C_raw_a <- charToRaw("a")) # 0x61
#> [1] 61
C_raw_A < C_raw_a
#> [1] TRUE
C_raw_A > C_raw_a
#> [1] FALSE

identical(raw_A, C_raw_A)
#> [1] TRUE
identical(raw_a, C_raw_a)
#> [1] TRUE

Sys.setlocale("LC_COLLATE", old_loc)
#> [1] "Portuguese_Portugal.1252"


Hope this helps,

Rui Barradas


*
Regards,
Kristjan*
*
*
*
*
*

On Thu, Apr 14, 2022 at 10:01 PM Rui Barradas <ruipbarra...@sapo.pt <mailto:ruipbarra...@sapo.pt>> wrote:

    Hello,

    1) The best I could find on lower case/upper case is [1];
    The Wikipedia page you link to is about a code page and the collating
    sequence is the same as ASCII so no, that's not it.

    2) In the cp1252 table "A" < "a", it follows the numeric order 0x31 <
    0x41. But what R is using is the locale LC_COLLATE setting, not the "C"
    one.

    How to validate the end results? The best way is to check the current
    setting, with Sys.getlocale.



    [1]
    
https://books.google.pt/books?id=GkajBQAAQBAJ&pg=PA259&lpg=PA259&dq=collating+sequence+portuguese&source=bl&ots=fVnUYHz0ev&sig=ACfU3U3xjpJfPNcWEfvwb_2nScYb89CeOw&hl=pt-PT&sa=X&ved=2ahUKEwiAoNTW-JP3AhVI1xoKHXT-C4oQ6AF6BAgUEAM#v=onepage&q=collating%20sequence%20portuguese&f=false
    
<https://books.google.pt/books?id=GkajBQAAQBAJ&pg=PA259&lpg=PA259&dq=collating+sequence+portuguese&source=bl&ots=fVnUYHz0ev&sig=ACfU3U3xjpJfPNcWEfvwb_2nScYb89CeOw&hl=pt-PT&sa=X&ved=2ahUKEwiAoNTW-JP3AhVI1xoKHXT-C4oQ6AF6BAgUEAM#v=onepage&q=collating%20sequence%20portuguese&f=false>


    Hope this helps,

    Rui Barradas

    Às 16:33 de 14/04/2022, Kristjan Kure escreveu:
     > Hi Rui
     >
     > Thank you for the code snippet.
     >
     > 1) How do you find your "Portuguese_Portugal.1252" symbols table now?
     > Is it this https://en.wikipedia.org/wiki/Windows-1252
    <https://en.wikipedia.org/wiki/Windows-1252>
     > <https://en.wikipedia.org/wiki/Windows-1252
    <https://en.wikipedia.org/wiki/Windows-1252>>?
     >
     > 2) What attributes and values do you check to validate the end
    result?
     > I see there is a section "Codepage layout" and I can find "A" and
    "a"
     > symbols.
     >
     > What values on that table tell you "A" is bigger than "a"?
     > "A" < "a" # returns FALSE
     > "A" > "a" # returns TRUE
     >
     > PS! My locale is Estonian_Estonia.1257
     >
     > Regards,
     > Kristjan
     >
     > On Thu, Apr 14, 2022 at 5:05 PM Rui Barradas
    <ruipbarra...@sapo.pt <mailto:ruipbarra...@sapo.pt>
     > <mailto:ruipbarra...@sapo.pt <mailto:ruipbarra...@sapo.pt>>> wrote:
     >
     >     Hello,
     >
     >     This is a locale issue, you are counting on the ASCII table
    codes but
     >     that's only valid for the "C" locale.
     >
     >     old_loc <- Sys.getlocale("LC_COLLATE")
     >
     >     "A" < "a"
     >     #> [1] FALSE
     >     "A" > "a"
     >     #> [1] TRUE
     >
     >     Sys.setlocale("LC_COLLATE", locale = "C")
     >     #> [1] "C"
     >
     >     "A" < "a"
     >     #> [1] TRUE
     >     "A" > "a"
     >     #> [1] FALSE
     >
     >     Sys.setlocale("LC_COLLATE", old_loc)
     >     #> [1] "Portuguese_Portugal.1252"
     >
     >
     >     Hope this helps,
     >
     >     Rui Barradas
     >
     >     Às 15:06 de 13/04/2022, Kristjan Kure escreveu:
     >      > Hi!
     >      >
     >      > Sorry, I am a beginner in R.
     >      >
     >      > I was not able to find answers to my questions (tried
    Google, Stack
     >      > Overflow, etc). Please correct me if anything is wrong here.
     >      >
     >      > When comparing symbols/strings in R - raw numeric values
    are compared
     >      > symbol by symbol starting from left? If raw numeric values are
     >     not used is
     >      > there an ASCII / Unicode table where symbols have
     >     values/ranking/order and
     >      > R compares those values?
     >      >
     >      > *2) Comparing symbols*
     >      > Letter "a" raw value is 61, letter "b" raw value is 62? Is
    this
     >     correct?
     >      >
     >      > # Raw value for "a" = 61
     >      > a_raw <- charToRaw("a")
     >      > a_raw
     >      >
     >      > # Raw value for "b" = 62
     >      > b_raw <- charToRaw("b")
     >      > b_raw
     >      >
     >      > # equals TRUE
     >      > "a" < "b"
     >      >
     >      > Ok, so 61 is less than 62 so it's TRUE. Is this correct?
     >      >
     >      > *3) Comparing strings #1*
     >      > "1040" <= "12000"
     >      >
     >      > raw_1040 <- charToRaw("1040")
     >      > raw_1040
     >      > #31 *30* (comparison happens with the second symbol) 34 30
     >      >
     >      > raw_12000 <- charToRaw("12000")
     >      > raw_12000
     >      > #31 *32* (comparison happens with the second symbol) 30 30 30
     >      >
     >      > The symbol in the second position is 30 and it's less than 32.
     >     Equals to
     >      > true. Is this correct?
     >      >
     >      > *4) Comparing strings #2*
     >      > "1040" <= "10000"
     >      >
     >      > raw_1040 <- charToRaw("1040")
     >      > raw_1040
     >      > #31 30 *34*  (comparison happens with third symbol) 30
     >      >
     >      > raw_10000 <- charToRaw("10000")
     >      > raw_10000
     >      > #31 30 *30*  (comparison happens with third symbol) 30 30
     >      >
     >      > The symbol in the third position is 34 is greater than 30.
    Equals
     >     to false.
     >      > Is this correct?
     >      >
     >      > *5) Problem - Why does this equal FALSE?*
     >      > *"A" < "a"*
     >      >
     >      > 41 < 61 # FALSE?
     >      >
     >      > # Raw value for "A" = 41
     >      > A_raw <- charToRaw("A")
     >      > A_raw
     >      >
     >      > # Raw value for "a" = 61
     >      > a_raw <- charToRaw("a")
     >      > a_raw
     >      >
     >      > Why is capitalized "A" not less than lowercase "a"? Based
    on raw
     >     values it
     >      > should be. What am I missing here?
     >      >
     >      > Thanks
     >      > Kristjan
     >      >
     >      >       [[alternative HTML version deleted]]
     >      >
     >      > ______________________________________________
     >      > R-help@r-project.org <mailto:R-help@r-project.org>
    <mailto:R-help@r-project.org <mailto:R-help@r-project.org>> mailing list
     >     -- To UNSUBSCRIBE and more, see
     >      > https://stat.ethz.ch/mailman/listinfo/r-help
    <https://stat.ethz.ch/mailman/listinfo/r-help>
     >     <https://stat.ethz.ch/mailman/listinfo/r-help
    <https://stat.ethz.ch/mailman/listinfo/r-help>>
     >      > PLEASE do read the posting guide
     > http://www.R-project.org/posting-guide.html
    <http://www.R-project.org/posting-guide.html>
     >     <http://www.R-project.org/posting-guide.html
    <http://www.R-project.org/posting-guide.html>>
     >      > and provide commented, minimal, self-contained,
    reproducible code.
     >


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to