Hi Mark,
Am 27.06.2016 um 15:11 schrieb Mark Thomas:
I believe I have an explanation for what is going on that fits both the
reported behaviour and the proposed fix.
Background
==========
OpenSSL tracks a list of the most recent errors for each thread in a
hash map keyed on the thread (int_thread_hash in err.c). Reading and
writing to this hash map is protected by a lock. The hash map is created
and populated lazily.
tc-native calls ERR_clear_error() before every call to
SSL_do_handshake(), SSL_read() and SSL_write(). The call to
ERR_clear_error() either clears the error list for the current thread or
inserts a new empty list into the hash map of the thread is not already
present.
The performance problem was tracked down to threads waiting in
ERR_clear_error() to obtain the write lock for the hash map.
The proposed solution was to call ERR_remove_thread_state() just before
the current Tomcat thread processing the connection is returned to the
thread pool. This method removes the current thread and its associated
error list from the hash map.
Analysis
========
The proposed solution, calling ERR_remove_thread_state(), adds a call
that also obtains the write lock for the hash map. This indicates that
the problem is not delays in obtaining the lock but contention for the
lock because one or more operations taking place within the lock are
taking a long time.
Removing unused threads from the hash map removes the bottleneck. This
points towards the hash map being the source of the problem.
Testing by the OP showed that as soon as a test had been ran that
required ~ 400 concurrent threads performance dropped significantly. It
did not get noticeably worse if the same 400 thread test was run repeatedly.
My testing indicated, on OSX at least, that the thread IDs used in the
hash map were stable and that uncontrolled growth of the hash map was
unlikely to be the cause.
The manner in which thread IDs are generated varies by platform. On
Linux, where this problem was observed, the thread ID is derived from
(is normally equal to) the memory address of the per thread errno
variable. This means that thread IDs tend to be concentrated in a
relatively narrow range of values. For example, in a simple 10 thread
test on OSX thread IDs ranged from 123145344839680 to 123145354387455.
Thread IDs therefore fall with a 10^7 range within a possible range of
1.8x10^19. i.e. a very small, contiguous range.
Hash maps use hashing functions to ensure that entries are (roughly)
evenly distributed between the available buckets. The hash function,
err_state_hash, used for the thread IDs in OpenSSL is threadID * 13.
Supposition
===========
The hash function used (multiple by 13) is insufficient to distribute
the resulting values across multiple buckets because they will still
fall in a relatively narrow band. Therefore all the threads end up in a
single bucket which makes the performance of the hash map poor. This in
turn makes calls to thread_get_item() slow because it does a hash map
lookup. This lookup is performed with the read lock held for the hash
map which in turn will slow down the calls that require the write lock.
Proposal
========
The analysis and supposition above need to be checked by someone with a
better understanding of C than me. Assuming my work is correct, the next
step is to look at possible fixes. I do not believe that patching
OpenSSL is a viable option.
The OpenSSL API needs to be reviewed to see if there is a way to avoid
the calls that require the write lock.
If the write lock cannot be avoided then we need to see if there is a
better place to call ERR_remove_thread_state(). I'd like to fix this
entirely in tc-native but that may mean calling
ERR_remove_thread_state() more frequently which could create its own
performance problems.
Nate - I may have some patches for you to test in the next few days.
Mark
Great analysis. I was really wondering, what could make the hash map so
huge and hadn't thought about the hash function as the problem.
Before OpenSSL 1.1.0 there's a callback for applications to provide
their own thread IDs:
https://www.openssl.org/docs/man1.0.2/crypto/CRYPTO_THREADID_set_callback.html
So we could probably work around the problem of the poor hashing
function by passing in IDs that work for hashing (pre-hashed ID?). Of
course then we loose the direct association of the OpenSSL thread ID
with the real platform thread id.
Currently our callback in tcnative is ssl_set_thread_id() which refers
to ssl_thread_id(), which on Linux gets the ID from the APR function
apr_os_thread_current(). So we could add some hashing formula in
ssl_thread_id().
In OpenSSL 1.1.0 this thread ID callback is gone. We'll see whether the
lock hash problem is gone too ...
I think we could add some logging to the hash impl to verify the
assumption, that many thread IDs result in the same hash value. The
lhash impl already seems to maintain quite a few statistics:
https://www.openssl.org/docs/man1.0.2/crypto/lh_stats.html
The source code shows the following counters in OpenSSL lhash.c:
ret->num_nodes = MIN_NODES / 2;
ret->num_alloc_nodes = MIN_NODES;
ret->num_items = 0;
ret->num_expands = 0;
ret->num_expand_reallocs = 0;
ret->num_contracts = 0;
ret->num_contract_reallocs = 0;
ret->num_hash_calls = 0;
ret->num_comp_calls = 0;
ret->num_insert = 0;
ret->num_replace = 0;
ret->num_delete = 0;
ret->num_no_delete = 0;
ret->num_retrieve = 0;
ret->num_retrieve_miss = 0;
ret->num_hash_comps = 0;
Something like adding a global atomic counter to a special OpenSSL build
that counts calls to ERR_clear_err() and whenever it is a multiple of
100 (or some other value) dump the statistics.
Alternatively one could use the debugger and try finding the error state
lhash and its structure members.
WDYT?
Regards,
Rainer
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org