This got long, but I structured it so one can stop reading when one gets
bored or mired in minutia, and constructively answer based on what was
read so far.

So feel free to stop reading after only a couple paragraphs and fire off a
response :-)

"+++" marks good stopping-points for this cumulative post/response deal.

I'll take any other random suggestions for how to drasticly improve
full-text index/search performance, of course!  I examined the PostgreSQL
contrib/ directory solutions -- fulltextindex didn't help, and the other
option required software/processes we're not able to maintain at this
time.

THANKS!!!

----------------------------------------
Under what real-world conditions does a read-only replication of a DB on a
RAM disk make sense?

Any quick tips for what NOT to do if I just want to run a quick crude
benchmark trial sort of thing using PostgreSQL and PHP?

+++

A Google for "PostgreSQL RAMdisk" turned up only one brief article that
seemed geared toward a very small dataset for a hobbyist type site where
performance was more about principle than real-world need.  (Though it was
a VERY nice article with very concise directions and great warnings about
"what if")

Perhaps I should have Googled for "MySQL RAMdisk" and hoped the parallels
were accurate?  I'll give it a shot, but am leery of invalid information
from that cross-db assumption.  Any comment on the likelihood of
non-PostgreSQL info being applicable?

Mailing List archives turned up some interesting discussions, but were
generally not conclusive.

+++

Or, to make this easier to answer, if somewhat less useful generally in
the archive, do you predict it would significantly boost performance (and
why or why not) for:

Intel Pentium III, 1.4 G MHz
RAM 1G
PostgreSQL 7.3
    currently *NEEDS* shared_buffers = 256M, has 512M as safety
    also set sort_mem = 16384 - not sure it helped/hurt
    also tried to turn query/perofrmance logging on,
    but never did find the darn log files to view anything useful (Help!)
PHP 4.2.3
The database itself is sucking down 2.6 Gig of hard drive space:
du -hc .
3.6M    ./base/1
3.6M    ./base/16975
4.0k    ./base/16976/pgsql_tmp
2.3G    ./base/16976
4.0k    ./base/3087869/pgsql_tmp
120M    ./base/3087869
2.5G    ./base
140k    ./global
113M    ./pg_xlog
720k    ./pg_clog
2.6G    .
2.6G    total

+++

Obviously I cannot currently replicate that whole thing in RAM.

However, a large chunk (1 Gig-ish) of that is a 12 million row table of
(int4+int4+int4) == 96 byte records that I was hoping would allow a
self-join of word position (concordance) which would be faster than the
RegEx across 18000 articles of 0 to 270K in length.

Another large chunk (750 Meg-ish) is a 10 million (int4+int4+smallint) =
80 byte records which are currently being used (and probably must remain)
to cross index a dictionary with each article and its frequency in the
article.

This 750 Meg table took queries ranging from two minutes to an unknown
time down to the 10 second magnitude.  I doubt that the RAM disk is going
to be enough boon to beat that, right?

I believe that before I added these tables, we were looking at ~200 Meg of
data, which makes a little more sense on a RAM disk.

+++

An answer that involves paying for more RAM and making it 40X as fast
would be fine, of course.

So if I trade the 1.8 Gig concordance[s] for a RAM disk will the RAM disk
give 4X to ??X performance improvement to "make up" for the concordance
gains I've already achieved?

+++

Big Picture:
It's a custom search engine of texts from over 50 years' worth of a
magazine using some specific meta-data to provide searching.

Full-text index searching has been the biggest challenge.  No surprise
there -- But I had hoped for more definitive solutions "out there" in
algorithms and such-like.  There were off-the-shelf projects (eg
ht://dig), but they didn't "fit in" with some of the meta-data and design
constraints.

I've managed to create a concordance of article/word/frequency that has
sped up most queries from minutes to ~10 seconds.

Alas, the support for "a NEAR b" is at risk, because the RegEx on 18000
rows of text from 0 to 270K takes too long.  (We can drop NEAR if we have
to, maybe, but would rather not.)

Only now am I realizing I probably underestimated that 12 million number
-- It probably is closer to 20 million or even 40 million.  Hard to say
for sure until the actual text is indexed or I run a random sampling at
least.

I don't even have real data in that table yet -- Just used semi-random
numbers to populate it for benchmarking to see what sort of gains I might
get.  (Answer:  4X is great, but it's not enough.)

And the concordance with full positional information (eg word "foo"
appears in article 57 at offset 42) requires an [under-]estimated 12
million row self-join.

A sample table of that magnitude clocks in at 30 seconds just for two
words "a NEAR b" without even trying (yet) to correlate those with the
actual articles via the dictionary/concordance lookup.  And while 30
seconds sure beats two minutes, it's not good enough, particularly when I
haven't even factored in the actual article lookup in that crude
benchmark.

I think I've just about exhausted the query-munging solutions and am now
casting farther afield (EG this RAM disk idea) for solutions.

As a long-time Open Source advocate, I am hoping to publish some of my
experience, pending (A) successful resolution and (B) client->client
approval.  They've been pretty good on that sort of thing so far, but you
never know.  I'll post here if that works out.

+++

Excruciating hardware detail:

cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 11
model name      : Intel(R) Pentium(R) III CPU family      1400MHz
stepping        : 1
cpu MHz         : 1406.020
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 mmx fxsr sse
bogomips        : 2804.94

cat /proc/meminfo
        total:    used:    free:  shared: buffers:  cached:
Mem:  921235456 916316160  4919296  4612096 87429120 706134016
Swap: 2097143808   786432 2096357376
MemTotal:       899644 kB
MemFree:          4804 kB
MemShared:        4504 kB
Buffers:         85380 kB
Cached:         688816 kB
SwapCached:        768 kB
Active:         319436 kB
Inact_dirty:     33096 kB
Inact_clean:    426936 kB
Inact_target:   229376 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:       899644 kB
LowFree:          4804 kB
SwapTotal:     2047992 kB
SwapFree:      2047224 kB

free -m
             total       used       free     shared    buffers     cached
Mem:           878        872          5          4         83        670
-/+ buffers/cache:        119        759
Swap:         1999          0       1999

If you've read this far, you deserve a Cookie or something! :-)

THANK YOU!!!



-- 
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to