nfsantos opened a new pull request, #1394:
URL: https://github.com/apache/jackrabbit-oak/pull/1394
The current implementation of filtering excluded paths and custom regex is
using a condition like
` _id: { $nin: [ /^[0-9]{1,3}:\/content\/dam\/.*$/ ]`
Mongo cannot evaluate this condition without retrieving the full document,
because a value of _null would also match this condition and the index does not
contain null values. Therefore, when the index contains excluded paths, the
download will be much slower because Mongo has to retrieve every single
document to evaluate the condition.
As a workaround, we can transform the regex on an equivalent one that
matches the complement of the original regex using [negative
lookahead](https://stackoverflow.com/questions/1240275/how-to-negate-specific-word-in-regex).
This allows rewriting the filter condition using only positive conditions,
which can be evaluated using only the index.
## Performance
As an example of the difference between the two approaches, the two
following queries count the number of documents that match the regex filter.
Using `$not` in the regex:
```
> db.nodes.find({ $and: [{ "_modified": { "$gte": 0 } }, { _id: { $not: {
$regex: /^[0-9]{1,3}:\/content\/dam\/.*$/ } } }, { _id: { $not: { $regex:
/^[0-9]{1,3}:\/oak:index\/.*$/ } } }] }).sort({ "_modified": 1 }).count()
15338210
```
4m18s
Using a negated regex to turn the Mongo filter into a positive filter:
```
> db.nodes.find( { $and: [ {"_modified": {"$gte": 0}}, { _id: { $regex:
/^(?![0-9]{1,3}:\/content\/dam\/)/ } }, { _id: { $regex:
/^(?![0-9]{1,3}:\/oak:index\/)/ } }]} ).sort({ "_modified":1 }).count()
15338210
```
39s
The plan for the query with `$not` fetches every document and then applies
the regex filters:
```
stage: 'FETCH',
filter: {
'$and': [
{
_id: {
'$not': { '$regex': '^[0-9]{1,3}:\\/content\\/dam\\/.*$' }
}
},
{
_id: { '$not': { '$regex': '^[0-9]{1,3}:\\/oak:index\\/.*$' } }
}
]
},
inputStage: {
stage: 'IXSCAN',
keyPattern: { _modified: 1, _id: 1 },
indexName: '_modified_1__id_1',
isMultiKey: false,
multiKeyPaths: { _modified: [], _id: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { _modified: [ '[0, inf.0]' ], _id: [ '[MinKey, MaxKey]' ] }
}
```
While the plan for the query with the negated regex applies the filter
during the index scan, fetching only the documents that match:
```
stage: 'FETCH',
inputStage: {
stage: 'IXSCAN',
filter: {
'$and': [
{
_id: { '$regex': '^(?![0-9]{1,3}:\\/content\\/dam\\/)' }
},
{ _id: { '$regex': '^(?![0-9]{1,3}:\\/oak:index\\/)' } }
]
},
keyPattern: { _modified: 1, _id: 1 },
indexName: '_modified_1__id_1',
isMultiKey: false,
multiKeyPaths: { _modified: [], _id: [] },
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: { _modified: [ '[0, inf.0]' ], _id: [ '["", {})' ] }
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]