Why the index is not used ?

2018-10-06 Thread ROS Didier
Hi
I would like to submit the following problem to the PostgreSQL community. In my 
company, we have data encryption needs.
So I decided to use the following procedure :


(1)Creating a table with a bytea type column to store the encrypted data
CREATE TABLE cartedecredit(card_id SERIAL PRIMARY KEY, username VARCHAR(100), 
cc bytea);



(2)inserting encrypted data
INSERT INTO cartedecredit(username,cc)  SELECT 'individu ' || x.id, 
pgp_sym_encrypt('test value ' || x.id, 'motdepasse','compress-algo=2, 
cipher-algo=aes256') FROM generate_series(1,10) AS x(id);


(3)Querying the table
SELECT pgp_sym_decrypt(cc, 'motdepasse') FROM cartedecredit WHERE 
pgp_sym_decrypt(cc, 'motdepasse')='test value 32';

pgp_sym_decrypt

-

test value 32

(1 row)



Time: 115735.035 ms (01:55.735)
-> the execution time is very long. So, I decide to create an index



(4)Creating an index on encrypted data
CREATE INDEX idx_cartedecredit_cc02 ON cartedecredit(cc);


(5)Querying the table again

SELECT pgp_sym_decrypt(cc, 'motdepasse') FROM cartedecredit WHERE 
pgp_sym_decrypt(cc, 'motdepasse')='test value 32';
pgp_sym_decrypt

-

test value 32

(1 row)



Time: 118558.485 ms (01:58.558) -> almost 2 minutes !!
postgres=# explain analyze SELECT pgp_sym_decrypt(cc, 'motdepasse') FROM 
cartedecredit WHERE pgp_sym_decrypt(cc, 'motdepasse')='test value 32';

  QUERY PLAN

--

Seq Scan on cartedecredit  (cost=0.00..3647.25 rows=500 width=32) (actual 
time=60711.787..102920.509 rows=1 loops=1)

   Filter: (pgp_sym_decrypt(cc, 'motdepasse'::text) = 'test value 32'::text)

   Rows Removed by Filter: 9

Planning time: 0.112 ms

Execution time: 102920.585 ms

(5 rows)



? the index is not used in the execution plan. maybe because of the use of a 
function in the WHERE clause. I decide to modify the SQL query


(6)Querying the table
SELECT pgp_sym_decrypt(cc, 'motdepasse') FROM cartedecredit WHERE 
cc=pgp_sym_encrypt('test value 32', 'motdepasse');
pgp_sym_decrypt

-

(0 rows)



Time: 52659.571 ms (00:52.660)


? The execution time is very long and I get no result (!?)

QUERY PLAN

---

Seq Scan on cartedecredit  (cost=0.00..3646.00 rows=1 width=32) (actual 
time=61219.989..61219.989 rows=0 loops=1)

   Filter: (cc = pgp_sym_encrypt('test value 32'::text, 'motdepasse'::text))

   Rows Removed by Filter: 10

Planning time: 0.157 ms

Execution time: 61220.035 ms

(5 rows)



? My index is not used.

QUESTIONS :
-  why I get no result ?

-why the index is not used?

Thanks in advance

Best Regards
Didier



[cid:image002.png@01D14E0E.8515EB90]


Didier ROS
Expertise SGBD
DS IT/IT DMA/Solutions Groupe EDF/Expertise Applicative - SGBD







Ce message et toutes les pièces jointes (ci-après le 'Message') sont établis à 
l'intention exclusive des destinataires et les informations qui y figurent sont 
strictement confidentielles. Toute utilisation de ce Message non conforme à sa 
destination, toute diffusion ou toute publication totale ou partielle, est 
interdite sauf autorisation expresse.

Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de le 
copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou partie. Si 
vous avez reçu ce Message par erreur, merci de le supprimer de votre système, 
ainsi que toutes ses copies, et de n'en garder aucune trace sur quelque support 
que ce soit. Nous vous remercions également d'en avertir immédiatement 
l'expéditeur par retour du message.

Il est impossible de garantir que les communications par messagerie 
électronique arrivent en temps utile, sont sécurisées ou dénuées de toute 
erreur ou virus.


This message and any attachments (the 'Message') are intended solely for the 
addressees. The information contained in this Message is confidential. Any use 
of information contained in this Message not in accord with its purpose, any 
dissemination or disclosure, either whole or partial, is prohibited except 
formal approval.

If you are not the addressee, you may not copy, forward, disclose or use any 
part of it. If you have received this message in error, please delete it and 
all copies from your system and notify the sender immediately by return message.

E-mail communication cannot be guaranteed to be timely secure, error or 
virus-free.


Re: Why the index is not used ?

2018-10-06 Thread Pavel Stehule
so 6. 10. 2018 v 11:57 odesílatel ROS Didier  napsal:

> Hi
>
> I would like to submit the following problem to the PostgreSQL community.
> In my company, we have data encryption needs.
> So I decided to use the following procedure :
>
>
>
> (1)Creating a table with a bytea type column to store the encrypted
> data
> CREATE TABLE cartedecredit(card_id SERIAL PRIMARY KEY, username
> VARCHAR(100), cc bytea);
>
>
>
> (2)inserting encrypted data
> INSERT INTO cartedecredit(username,cc)  SELECT 'individu ' || x.id,
> pgp_sym_encrypt('test value ' || x.id, 'motdepasse','compress-algo=2,
> cipher-algo=aes256') FROM generate_series(1,10) AS x(id);
>
>
>
> (3)Querying the table
> SELECT pgp_sym_decrypt(cc, 'motdepasse') FROM cartedecredit WHERE
> pgp_sym_decrypt(cc, 'motdepasse')='test value 32';
>
> pgp_sym_decrypt
>
> -
>
> test value 32
>
> (1 row)
>
>
>
> Time: 115735.035 ms (01:55.735)
> -> the execution time is very long. So, I decide to create an index
>
>
>
> (4)Creating an index on encrypted data
> CREATE INDEX idx_cartedecredit_cc02 ON cartedecredit(cc);
>

this index cannot to help.

but functional index can cartedecredit(pgp_sym_decrypt(cc, 'motdepasse').
Unfortunately index file will be decrypted in this case.

CREATE INDEX ON


>
>
> (5)Querying the table again
>
> SELECT pgp_sym_decrypt(cc, 'motdepasse') FROM cartedecredit WHERE
> pgp_sym_decrypt(cc, 'motdepasse')='test value 32';
> pgp_sym_decrypt
>
> -
>
> test value 32
>
> (1 row)
>
>
>
> Time: 118558.485 ms (01:58.558) -> almost 2 minutes !!
> postgres=# explain analyze SELECT pgp_sym_decrypt(cc, 'motdepasse') FROM
> cartedecredit WHERE pgp_sym_decrypt(cc, 'motdepasse')='test value 32';
>
>   QUERY PLAN
>
>
> --
>
> Seq Scan on cartedecredit  (cost=0.00..3647.25 rows=500 width=32) (actual
> time=60711.787..102920.509 rows=1 loops=1)
>
>Filter: (pgp_sym_decrypt(cc, 'motdepasse'::text) = 'test value
> 32'::text)
>
>Rows Removed by Filter: 9
>
> Planning time: 0.112 ms
>
> Execution time: 102920.585 ms
>
> (5 rows)
>
>
>
> è the index is not used in the execution plan. maybe because of the use
> of a function in the WHERE clause. I decide to modify the SQL query
>
>
>
> (6)Querying the table
> SELECT pgp_sym_decrypt(cc, 'motdepasse') FROM cartedecredit WHERE 
> *cc*=pgp_sym_encrypt('test
> value 32', 'motdepasse');
>

it is strange - this should to use index, when there is usual index over cc
column.

What is result of explain analyze when you penalize seq scan by

set enable_seqscan to off



> pgp_sym_decrypt
>
> -
>
> (0 rows)
>
>
>
> Time: 52659.571 ms (00:52.660)
>
> è The execution time is very long and I get no result (!?)
>
> QUERY PLAN
>
>
> ---
>
> Seq Scan on cartedecredit  (cost=0.00..3646.00 rows=1 width=32) (actual
> time=61219.989..61219.989 rows=0 loops=1)
>
>Filter: (cc = pgp_sym_encrypt('test value 32'::text,
> 'motdepasse'::text))
>
>Rows Removed by Filter: 10
>
> Planning time: 0.157 ms
>
> Execution time: 61220.035 ms
>
> (5 rows)
>
>
>
> è My index is not used.
>
>
> QUESTIONS :
> -  why I get no result ?
>
> -why the index is not used?
>
> Thanks in advance
>
>
>
> Best Regards
> Didier
>
>
>
>
>
> [image: cid:image002.png@01D14E0E.8515EB90]
>
>
> * Didier ROS*
> * Expertise SGBD*
>
>
> *DS IT/IT DMA/Solutions Groupe EDF/Expertise Applicative - SGBD *
>
>
>
>
>
>
> Ce message et toutes les pièces jointes (ci-après le 'Message') sont
> établis à l'intention exclusive des destinataires et les informations qui y
> figurent sont strictement confidentielles. Toute utilisation de ce Message
> non conforme à sa destination, toute diffusion ou toute publication totale
> ou partielle, est interdite sauf autorisation expresse.
>
> Si vous n'êtes pas le destinataire de ce Message, il vous est interdit de
> le copier, de le faire suivre, de le divulguer ou d'en utiliser tout ou
> partie. Si vous avez reçu ce Message par erreur, merci de le supprimer de
> votre système, ainsi que toutes ses copies, et de n'en garder aucune trace
> sur quelque support que ce soit. Nous vous remercions également d'en
> avertir immédiatement l'expéditeur par retour du message.
>
> Il est impossible de garantir que les communications par messagerie
> électronique arrivent en temps utile, sont sécurisées ou dénuées de toute
> erreur ou virus.
> 
>
> This message and any attachments (the 'Message') are intended solely for
> the addressees. The information contained in this Message is confidential.
> Any use of information contained in this Message not in accord with its
> pur

Re: Why the index is not used ?

2018-10-06 Thread Vladimir Ryabtsev
Hello Didier,

(3), (5) to find the match, you decrypt the whole table, apparently this
take quite a long time.
Index cannot help here because indexes work on exact match of type and
value, but you compare mapped value, not indexed. Functional index should
help, but like it was said, it against the idea of encrypted storage.

(6) I never used pgp_sym_encrypt() but I see that in INSERT INTO you
supplied additional parameter 'compress-algo=2, cipher-algo=aes256' while
in (6) you did not. Probably this is the reason.

In general matching indexed bytea column should use index, you can ensure
in this populating the column unencrypted and using 'test value 32'::bytea
for match.
In you case I believe pgp_sym_encrypt() is not marked as STABLE or
IMMUTABLE that's why it will be evaluated for each row (very inefficient)
and cannot use index. From documentation:

"Since an index scan will evaluate the comparison value only once, not once
at each row, it is not valid to use a VOLATILE function in an index scan
condition."
https://www.postgresql.org/docs/10/static/xfunc-volatility.html

If you cannot add STABLE/IMMUTABLE to pgp_sym_encrypt() (which apparently
should be there), you can encrypt searched value as a separate operation
and then search in the table using basic value match.

Vlad


Text-indexing UTF-8 bytea, convert_from() immutability, null bytes...

2018-10-06 Thread Phil Endecott
Dear Experts,

I have a table that contains bytea data which sometimes is UTF-8 text. 
When it is - and that is indicated by another column - I want to 
text-search index it.  Something like this:

db=> create index ix on tbl using gin (to_tsvector('english',body)) where 
is_utf8_text;

(Note my client encoding is UTF-8.)

That doesn't work because to_tsvector doesn't take bytea.  So I tried 
to_tsvector('english',body::text) ; that almost works, but CRLFs in the 
data become \015\012 (i.e. 6 characters) in the text and the tsvector 
contains things like '12hello'.

Next I tried to_tsvector('english',convert_from(body::text,'UTF-8')).  That 
doesn't work because convert_from is not immutable.  (This is 9.6; maybe that 
has changed.)  Is there a good reason for that?  Maybe because I might change 
the client encoding? As a hack I tried ALTER FUNCTION to make it immutable, 
and now I get:

ERROR:  invalid byte sequence for encoding "UTF8": 0x00

Hmm, as far as I'm aware 0x00 is fine in UTF-8; what's that mean?  But 
actually I'd be more than happy to ignore invalid UTF-8 here, since I'm 
only using it for text search; there may be some truly invalid UTF-8 
in the data.  Is there a "permissive" mode for charset conversion?

(That error also suggests that the convert_from is not optimising the 
conversion from UTF-8 to UTF-8 to a no-op.)

Anyway: given the problem of creating a text search index over bytea data 
that contains UTF-8 text, which may include oddities like null bytes, what 
would you do?

Thanks for any suggestions!

Phil.





Re: Why the index is not used ?

2018-10-06 Thread Paul McGarry
I haven’t looked up what pgp_sym_encrypt() does but assuming it does encryption 
the way you should be for credit card data then it will be using a random salt 
and the same input value won’t encrypt to the same output value so

WHERE cc=pgp_sym_encrypt('test value 32', 'motdepasse');

wouldn’t work because the value generated by the function when you are 
searching on isn’t the same value as when you stored it.



Paul

> On 6 Oct 2018, at 19:57, ROS Didier  wrote:
> 
> WHERE cc=pgp_sym_encrypt('test value 32', 'motdepasse');


Re: Text-indexing UTF-8 bytea, convert_from() immutability, null bytes...

2018-10-06 Thread Andrew Gierth
> "Phil" == Phil Endecott  writes:

 Phil> Next I tried
 Phil> to_tsvector('english',convert_from(body::text,'UTF-8')). That
 Phil> doesn't work because convert_from is not immutable. (This is 9.6;
 Phil> maybe that has changed.) Is there a good reason for that?

I would guess because conversions are controlled by the pg_conversion
table which can be modified by create/drop conversion.

 Phil> Maybe because I might change the client encoding?

No, because convert_from converts from the specified encoding to the
server_encoding, not the client_encoding, and the server_encoding can't
be changed except at db creation time.

 Phil> As a hack I tried ALTER FUNCTION to make it immutable,

A better approach is to wrap it in a function of your own which is
declared immutable, rather than hacking the catalogs:

create function from_utf8(bytea) returns text language plpgsql immutable
  as $$ begin return convert_from($1, 'UTF8'); end; $$;

 Phil> and now I get:

 Phil> ERROR:  invalid byte sequence for encoding "UTF8": 0x00

 Phil> Hmm, as far as I'm aware 0x00 is fine in UTF-8; what's that mean?

PG doesn't allow 0x00 in text values regardless of encoding.

 Phil> But actually I'd be more than happy to ignore invalid UTF-8 here,
 Phil> since I'm only using it for text search; there may be some truly
 Phil> invalid UTF-8 in the data. Is there a "permissive" mode for
 Phil> charset conversion?

Unfortunately not.

 Phil> (That error also suggests that the convert_from is not optimising
 Phil> the conversion from UTF-8 to UTF-8 to a no-op.)

Indeed not, because it must validate that the data really is UTF-8
before treating it as such.

 Phil> Anyway: given the problem of creating a text search index over
 Phil> bytea data that contains UTF-8 text, which may include oddities
 Phil> like null bytes, what would you do?

You can search for 0x00 in a bytea using position() or LIKE. What do you
want to do with values that contain null bytes? or values which you
think are supposed to be valid utf8 text but are not?

-- 
Andrew (irc:RhodiumToad)