How to display complicated Chinese character: Biang.

2022-06-02 Thread jian he
Inspired by this thread:
https://www.postgresql.org/message-id/011f01d8757e%24f5d69700%24e183c500%24%40ndensan.co.jp
Trying to display some special Chinese characters in Postgresql. For now I
am using postgresql 15 beta1. The OS is Ubuntu 20.

localhost:5433 admin@test=# show LC_COLLATE;
++
| lc_collate |
++
| C.UTF-8|
++

localhost:5433 admin@test=# select icu_unicode_version();

+-+

| icu_unicode_version |

+-+

| 13.0|

+-+

icu_unicode_version is the extension function.

Wiki about character Biang: https://en.wikipedia.org/wiki/Biangbiang_noodles

quote:

> The character's traditional and simplified forms were added to Unicode
>  version 13.0 in March 2020 in the CJK
> Unified Ideographs Extension G
>  block
> of the newly allocated Tertiary Ideographic Plane
> .[19]
>  The
> corresponding Unicode characters are:
>

Unicode character info: https://www.compart.com/en/unicode/U+30EDD

query

with strings(s) as (
>  values (U&'\+0030EDD')
> )
> select s,
>   octet_length(s),
>   char_length(s),
>   (select count(*) from icu_character_boundaries(s,'en')) as graphemes
> from strings;
>

return

+-+--+-+---+
|  s| octet_length | char_length | graphemes |
+-+--+-+---+
| ロD |4  |   2  | 2 |
+-+--+-+---+

Seems not right. graphemes should be 1?
And I am not sure values (U&'\+0030EDD') is the same as 𰻝.





-- 
 I recommend David Deutsch's <>

  Jian


Re: How to display complicated Chinese character: Biang.

2022-06-02 Thread Laurenz Albe
On Thu, 2022-06-02 at 12:45 +0530, jian he wrote:
> Trying to display some special Chinese characters in Postgresql.
> 
> localhost:5433 admin@test=# show LC_COLLATE;
> ++
> | lc_collate |
> ++
> | C.UTF-8    |
> ++ 
> 
> > with strings(s) as (
> >  values (U&'\+0030EDD')
> > )
> > select s,
> >   octet_length(s),
> >   char_length(s),
> >   (select count(*) from icu_character_boundaries(s,'en')) as graphemes from 
> > strings;
> > 
> 
> +-+--+-+---+
> |  s    | octet_length | char_length | graphemes |
> +-+--+-+---+
> | ロD |            4  |           2  |         2 |
> +-+--+-+---+
> 
> Seems not right. graphemes should be 1?

You have an extra "0" there; "\+" unicode escapes have exactly 6 digits:

WITH strings(s) AS (
   VALUES (U&'\+030EDD')
)
select s,
   octet_length(s),
   char_length(s) 
from strings;

 s  │ octet_length │ char_length 
╪══╪═
 𰻝 │4 │   1
(1 row)

PostgreSQL doesn't have a function "icu_character_boundaries".

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com




Re: Extended multivariate statistics are ignored (potentially related to high null fraction, not sure)

2022-06-02 Thread Danny Shemesh
Hey bruce, thanks for the prompt reply !

It reproduces in 12.11 and 13.7, we use a managed offering that is yet to
include pg 14,
so sadly I can't try expressional extended statistics as of yet, nor can I
attach gdb to the process
to debug the flow.

If any other directions come to mind, I'd really appreciate hearing them.
Thanks again !

On Wed, Jun 1, 2022 at 11:08 PM Bruce Momjian  wrote:

> On Wed, Jun  1, 2022 at 07:28:58PM +0300, Danny Shemesh wrote:
> > Hey everyone,
> >
> > I'm working on improving gathered statistics on several large tables
> (2TB /
> > 500M records);
> > I've created extended stats on correlated columns, ran analyze, compared
> the
> > pre and post explains, and it seems as though the extended statistics are
> > ignored -
> > the estimation doesn't change much, and I can accurately derive the est.
> rows
> > from multiplying the single-column freqs from pg_stats with the supplied
> > values.
> >
> >
> > Context:
> > - pg 12.8
>
> You should be on the most recent version of 12.X, and I do see some
> extended statistics change in the later releases you are missing.
>
> --
>   Bruce Momjian  https://momjian.us
>   EDB  https://enterprisedb.com
>
>   Indecision is a decision.  Inaction is an action.  Mark Batterson
>
>


Accessing composite type elements

2022-06-02 Thread Garfield Lewis
Hi All,

I’m not sure if this is the actual place for this but I guess I can start here. 
The question I have is, I’ve created a composite type like this:

CREATE TYPE myxml AS { encoding_ int4, xml_ xml };

In my client-side C code I am using PQgetvalue to pull in the data from the 
row/column. However, since it is a composite type, it is unclear to me how to 
get the individual members from the data. I have googled but I probably just am 
not googling the correct term because I cannot find any examples of this being 
done anywhere. A hex dump of the data gives me this:


0x0002001700046f01008e01433c637573746f6d6572696e666f20786d6c6e733d22687474703a2f2f6c7a6c6162732e637573742e636f6d22204369643d22543130303130303031223e3c6e616d653e5461626c6520584d4c…

I can tell that the green portion is my endcoding_ value and the blue section 
is the xml_ data.

My best guess right now is:


  *   0x2 is the number of members
  *   0x4 and 0x143 are the lengths of the individual members
  *   0x17 and 0x8e are the OID for the member type

Is this the proper layout of these composite types? Can I go ahead and use this 
without possibly having it broken in the future? Are there any existing 
supported APIs that I can use instead to get this information?

Regards
G


Re: Accessing composite type elements

2022-06-02 Thread Tom Lane
Garfield Lewis  writes:
> In my client-side C code I am using PQgetvalue to pull in the data from the 
> row/column. However, since it is a composite type, it is unclear to me how to 
> get the individual members from the data.

Binary representations are not too well documented :-(.  However,
looking at record_send() helps here.

> My best guess right now is:
>   *   0x2 is the number of members
>   *   0x4 and 0x143 are the lengths of the individual members
>   *   0x17 and 0x8e are the OID for the member type

Right, with the additional proviso that -1 "length" indicates a null
field value.

regards, tom lane




Re: [EXT] Re: Accessing composite type elements

2022-06-02 Thread Garfield Lewis
> Binary representations are not too well documented :-(.  However,
> looking at record_send() helps here.

will have a look…

> Right, with the additional proviso that -1 "length" indicates a null
> field value.

Thx, Tom… never thought the null field…

--
Regards,
Garfield A. Lewis



Re: unoptimized nested loops

2022-06-02 Thread Jeff Janes
On Thu, Jun 2, 2022 at 12:32 AM Tom Lane  wrote:

> Jeff Janes  writes:
> > On Tue, May 31, 2022 at 4:04 PM Tim Kelly 
> wrote:
> >> I do not see evidence that the nested loop is trying to reduce overhead
> >> by using the smaller set.  It seems to want to scan on data first either
> >> way.
>
> > The planner probably doesn't know which one is smaller.
>
> There is not a lot of daylight between the cost estimates for
> "a nestloop-join b" and "b nestloop-join a", if we're considering
> plain seqscans on both tables and all else is equal.  It tends to
> come down to factors like which one is more densely populated.
>
> As best I can tell, the issue Tim's unhappy about is not so
> much the use of a nestloop as the lack of use of any index.
>

But it is using an index on one of the tables, on "id".  There is no reason
it would not be able to reverse that, doing the seq scan on the smaller (in
assumed bytes) table and using the id index on the larger (in bytes) table,
to great effect.

Based on the timing of the simple counts on "data" with and without the
WHERE, I'm pretty sure that what is going on here is that data.content is
large and resides mostly in TOAST.  When TOAST needs to be accessed it is
much slower than when it doesn't.  And that is also the cause of the
estimation problem, the oversized values are just assumed to be distinct,
and no histogram is generated. Without histogram boundaries to serve as a
randomish sample, the selectivity estimate falls back to something even
worse (pattern length), and gets it massively wrong.


> But "string like '%foo%'" is not at all optimizable with a
> btree index.  You might be able to get somewhere with a
> pg_trgm GIN or GIST index.
>

I agree with the recommendation, but not really the reasoning.  Having the
pg_trgm index on metadata.author is might be even better than just getting
the planner to do the right thing without the index, but getting the
planner to do the right thing even without the index would also be a big
improvement over the current plan, if there were just a way to do it.  If
the planner explicitly accounted for TOAST cost, that would probably do
it.  Or if the selectivity estimates on data.content were better, that
would too.  But Tim can't reasonably do anything about those things, while
he can build the index.

Another thing he could try would be to force the correct index use by using
the inner join, but writing the join condition as "on data.id = metadata.id
||''"

Cheers,

Jeff


Re: [EXT] Re: Accessing composite type elements

2022-06-02 Thread Merlin Moncure
On Thu, Jun 2, 2022 at 12:05 PM Garfield Lewis 
wrote:

> > Binary representations are not too well documented :-(.  However,
> > looking at record_send() helps here.
>
> will have a look…
>
>
> > Right, with the additional proviso that -1 "length" indicates a null
> > field value.
>
> Thx, Tom… never thought the null field…
>


take a look at libpqtypes. it's client side extension library to libpq that
implements the binary protocol.

https://github.com/pgagarinov/libpqtypes

merlin

>