Bug#900277: Found the issue

Trent W. Buck Wed, 30 May 2018 19:03:27 -0700

Peter van Dijk wrote:
> I can reproduce this ‘hang’ (actually it’s a very busy read() loop). A
> simpler ‘echo 1,2,3 > test2.csv’ does work, so this is just exposing a bug
> in the CSV implementation.


You're right --- it looks like ext/misc/csv.c is broken for one-column CSV 
tables.
I can reproduce the problem with a minimal test.csv of "1\n".
I can prevent the problem with a minimal test.csv of "1,\n".
I guess that's a separate upstream bug! :-)

If I do "select * from t1" instead of "select count(*) from t1",
I can see it printing blank lines forever,
making the problem much more obvious.

It handles a 256MB-ish 3-column CSV quite well, though;
it can do simple queries in 4s to 7s.

    $ seq 33333333 | paste -sd ,,\\n >test.csv

    $ command time sqlite3 <<< $'.load ./csv.so\nCREATE VIRTUAL TABLE temp.t1 
USING csv(filename="test.csv"); SELECT count(*) FROM t1;'
    11111111
    3.59user 0.12system 0:03.71elapsed 99%CPU (0avgtext+0avgdata 
3484maxresident)k
    0inputs+0outputs (0major+272minor)pagefaults 0swaps

    $ command time sqlite3 <<< $'.load ./csv.so\nCREATE VIRTUAL TABLE temp.t1 
USING csv(filename="test.csv", schema="CREATE TABLE xs(x INT, y INT, z INT)"); 
SELECT count(*) FROM t1;'
    11111111
    3.77user 0.08system 0:03.85elapsed 99%CPU (0avgtext+0avgdata 
3432maxresident)k
    0inputs+0outputs (0major+271minor)pagefaults 0swaps

    $ command time sqlite3 <<< $'.load ./csv.so\nCREATE VIRTUAL TABLE temp.t1 
USING csv(filename="test.csv"); SELECT count(*) FROM t1 WHERE c1 = 1;'
    4.82user 0.08system 0:04.90elapsed 99%CPU (0avgtext+0avgdata 
3428maxresident)k
    0inputs+0outputs (0major+272minor)pagefaults 0swaps

    $ command time sqlite3 <<< $'.load ./csv.so\nCREATE VIRTUAL TABLE temp.t1 
USING csv(filename="test.csv"); SELECT count(*) FROM t1 WHERE c1 BETWEEN 111111 
AND 111113;'
    73
    6.47user 0.12system 0:06.59elapsed 100%CPU (0avgtext+0avgdata 
3396maxresident)k
    0inputs+0outputs (0major+273minor)pagefaults 0swaps

By comparison, the same test CSV takes 16s to import, 8s to export, and 0s to 
1s to query (with further optimization available).

    $ command time sqlite3 test.db <<< $'CREATE TABLE xs(x INT, y INT, z INT); 
\n.mode csv \n.import test.csv xs'
    15.92user 0.20system 0:16.13elapsed 99%CPU (0avgtext+0avgdata 
5296maxresident)k
    0inputs+0outputs (0major+686minor)pagefaults 0swaps

    $ command time sqlite3 test.db -csv 'SELECT * FROM xs;' >test2.csv
    7.98user 0.27system 0:08.25elapsed 99%CPU (0avgtext+0avgdata 
5028maxresident)k
    0inputs+0outputs (0major+674minor)pagefaults 0swaps

    $ cmp test.csv test2.csv

    $ ls -hl test.csv test.db
    -rw-r--r-- 1 root root 276M May 31 00:55 test.csv
    -rw-r--r-- 1 root root 236M May 31 01:02 test.db

    $ command time sqlite3 test.db <<< $'SELECT count(*) FROM xs;'
    11111111
    0.08user 0.09system 0:00.18elapsed 100%CPU (0avgtext+0avgdata 
5040maxresident)k
    0inputs+0outputs (0major+677minor)pagefaults 0swaps

    $ command time sqlite3 test.db <<< $'SELECT count(*) FROM xs WHERE x = 1;'
    1
    0.63user 0.06system 0:00.70elapsed 100%CPU (0avgtext+0avgdata 
5056maxresident)k
    0inputs+0outputs (0major+676minor)pagefaults 0swaps

    $ command time sqlite3 test.db <<< $'SELECT count(*) FROM xs WHERE x 
BETWEEN 111111 AND 111113;'
    1
    0.70user 0.07system 0:00.78elapsed 99%CPU (0avgtext+0avgdata 
5108maxresident)k
    0inputs+0outputs (0major+677minor)pagefaults 0swaps

So if your table's canonical version is stuck in CSV format (for legacy
reasons), importing it pays off iff you're going to be doing many query
before the CSV changes and you have to re-import it again.


PS: without slashes, ".load csv" searches LDPATH rather than PWD:

    $ strace -e trace=file sqlite3 <<< '.load csv' |& cut -d\" -f2 | grep csv.so
    /lib/x86_64-linux-gnu/csv.so
    /usr/lib/x86_64-linux-gnu/csv.so
    /lib/csv.so
    /usr/lib/csv.so

    $ strace -e trace=file sqlite3 <<< '.load test/csv' |& cut -d\" -f2 | grep 
csv.so
    test/csv.so

So these can be installed in, like,
    /lib/x86_64-linux-gnu/libsqlite3-ext-misc-csv.so
but not
    /lib/x86_64-linux-gnu/libsqlite3/ext/misc/csv.so

That sqliteodbc package I mentioned uses
    /lib/x86_64-linux-gnu/libsqlite3_mod_csvtable-0.9995.so


László, based on the sqliteodbc version numbers, I think "it hasn't
even gotten a version 1.0 yet" is a deliberate versioning scheme by
the author, not a reflection on its code quality:
    http://www.ch-werner.de/sqliteodbc/
The latest release is from Feb 2018 and speaks of SQLite 3.22.0;
the previous release (in Debian) is from Dec 2016 an speaks of SQLite 3.15.2.

The sqliteodbc version has no documentation, but seems to work roughly
the same way, except that all the arguments are positional instead of
key=value.

    $ seq 10000000|paste -sd ,\\n >test.csv

    $ sqlite3 <<< $'.load ./csv\nCREATE VIRTUAL TABLE t USING 
csv(filename="test.csv", schema="CREATE TABLE t(x INT, y INT)");SELECT sum(x) 
FROM t WHERE x BETWEEN 10 AND 100;'
    2475

    $ sqlite3 <<< $'.load libsqlite3_mod_csvtable\nCREATE VIRTUAL TABLE t USING 
csvtable("test.csv");\n.header on\nSELECT * FROM t LIMIT 3;'
    column_1
    1.2
    3.4
    5.6

It seems to treat the comma as a LC_NUMERIC=de_DE decimal_point="," separator 
--- whoops!
Adding more columns fixes THAT, at least.

    $ seq 10000000|paste -sd ,,,\\n >test.csv
    $ sqlite3 <<< $'.load libsqlite3_mod_csvtable\nCREATE VIRTUAL TABLE t USING 
csvtable("test.csv");\n.header on\nSELECT * FROM t LIMIT 3;'
    column_1|column_2|column_3|column_4
    1|2|3|4
    5|6|7|8
    9|10|11|12

Bug#900277: Found the issue

Reply via email to