nickva opened a new issue, #5166:
URL: https://github.com/apache/couchdb/issues/5166
Saw this in the logs:
```
exit:{{function_clause,[
{gb_trees,delete_1,[1395375226,nil],[{file,"gb_trees.erl"},{line,408}]},
{gb_trees,delete_1,2,[{file,"gb_trees.erl"},{line,412}]},
{gb_trees,delete_1,2,[{file,"gb_trees.erl"},{line,409}]},
{gb_trees,delete_1,2,[{file,"gb_trees.erl"},{line,412}]},
{gb_trees,delete,2,[{file,"gb_trees.erl"},{line,404}]},
{couch_lru,close_int,2,[{file,"src/couch_lru.erl"},{line,56}]},
{couch_server,maybe_close_lru_db,1,[{file,"src/couch_server.erl"},{line,455}]},
{couch_server,handle_call,3, [{file,"src/couch_server.erl"},{line,609}]}]},
{gen_server,call,[couch_server_10,{open,<<"shards/00000000-3fffffff ...
[{gen_server,call,3,[{file,"gen_server.erl"},{line,385}]},
{couch_server,open_int,2,[{file,"src/couch_server.erl"},{line,130}]},
{couch_server,open,2,[{file,"src/couch_server.erl"},{line,113}]},
{mem3_util,get_or_create_db_int,2,[{file,"src/mem3_util.erl"},{line,619}]},
{fabric_rpc,with_db,3,[{file,"src/fabric_rpc.erl"},{line,356}]},
{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,138}]}]
```
Our LRU has a bug in it. While we're traversing the tree we're also
updating/deleting entries. In fact, we're keeping two separate view of the Tree
going: one in the `Iter`, created when we start `close_int`
https://github.com/apache/couchdb/blob/83658d06d12447b7d1abccc1a64b84e020899ba4/src/couch/src/couch_lru.erl#L39
and the `{Tree, _} = Cache` variable.
When we iterate over the tree, we do either of 3 things:
1. We try to lock the entry, if that returns `false` (entry is not found
or is locked), delete the entry from the cache and also restart the iteration
from the beginning.
2. If we find and lock the entry, and it's idle, we evict it and return.
That stops the iteration `gb_trees:delete(Lru, Tree)...`
3. If we find and lock the entry, and it's not idle, we delete/re-insert
it and continue iterating.
There a few odd things that jump out:
* We manage two views of the Tree, one in the iterator while traversing
and one in the Cache variable.
* We consider a missing entry and a locked entry the same in the result
of the try_lock
* We always restart the iteration when if the entry is missing/locked
* If we can't find any idle entries, we always end up re-inserting all the
entries in the cache. So a single close, call, ends up traversing and
re-writting the whole cache. If there are thousands of concurrent calls, they
all end up doing that work.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]