[Kernel-packages] [Bug 1774336] Re: FS-Cache: Assertion failed: FS-Cache: 6 == 5 is false

Daniel Axtens Wed, 01 Aug 2018 21:11:46 -0700

** Description changed:

  == SRU Justification ==
  
  [Impact]
  Oops during heavy NFS + FSCache use:
  
- [81738.886634] FS-Cache: 
+ [81738.886634] FS-Cache:
  [81738.888281] FS-Cache: Assertion failed
  [81738.889461] FS-Cache: 6 == 5 is false
  [81738.890625] ------------[ cut here ]------------
  [81738.891706] kernel BUG at 
/build/linux-hVVhWi/linux-4.4.0/fs/fscache/operation.c:494!
  
  6 == 5 represents an operation being DEAD when it was not expected to
  be.
  
  [Cause]
- There is a race in fscache and cachefiles. 
+ There is a race in fscache and cachefiles.
  
  One thread is in cachefiles_read_waiter:
-  1) object->work_lock is taken.
-  2) the operation is added to the to_do list.
-  3) the work lock is dropped.
-  4) fscache_enqueue_retrieval is called, which takes a reference.
+  1) object->work_lock is taken.
+  2) the operation is added to the to_do list.
+  3) the work lock is dropped.
+  4) fscache_enqueue_retrieval is called, which takes a reference.
  
  Another thread is in cachefiles_read_copier:
-  1) object->work_lock is taken
-  2) an item is popped off the to_do list.
-  3) object->work_lock is dropped.
-  4) some processing is done on the item, and fscache_put_retrieval() is 
called, dropping a reference.
+  1) object->work_lock is taken
+  2) an item is popped off the to_do list.
+  3) object->work_lock is dropped.
+  4) some processing is done on the item, and fscache_put_retrieval() is 
called, dropping a reference.
  
  Now if the this process in cachefiles_read_copier takes place *between*
  steps 3 and 4 in cachefiles_read_waiter, a reference will be dropped
  before it is taken, which leads to the objects reference count hitting
  zero, which leads to lifecycle events for the object happening too soon,
  leading to the assertion failure later on.
  
  (This is simplified and clarified from the original upstream analysis
  for this patch at https://www.redhat.com/archives/linux-
  cachefs/2018-February/msg00001.html and from a similar patch with a
  different approach to fixing the bug at https://www.redhat.com/archives
  /linux-cachefs/2017-June/msg00002.html)
  
  [Fix]
- Move fscache_enqueue_retrieval under the lock in cachefiles_read_waiter. This 
means that the object cannot be popped off the to_do list until it is in a 
fully consistent state with the reference taken.
+ 
+ 
+  (Old sauce patch being reverted) Move fscache_enqueue_retrieval under the 
lock in cachefiles_read_waiter. This means that the object cannot be popped off 
the to_do list until it is in a fully consistent state with the reference taken.
+ 
+  (New upstream patch) Explicitly take a reference to the object while it
+ is being enqueued. Adjust another part of the code to deal with the
+ greater range of object states this exposes.
  
  [Testcase]
  A user has run ~100 hours of NFS stress tests and not seen this bug recur.
  
  [Regression Potential]
-  - Limited to fscache/cachefiles. 
-  - The change makes things more conservative (doing more under lock) so 
that's reassuring. 
-  - There may be performance impacts but none have been observed so far.
+  - Limited to fscache/cachefiles.
+  - The change makes things more conservative (taking more references) so 
that's reassuring.
+  - There may be performance impacts but none have been observed so far.


-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1774336

Title:
  FS-Cache: Assertion failed: FS-Cache: 6 == 5 is false

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Trusty:
  Fix Released
Status in linux source package in Xenial:
  Fix Released
Status in linux source package in Artful:
  Fix Released
Status in linux source package in Bionic:
  Fix Released

Bug description:
  == SRU Justification ==

  [Impact]
  Oops during heavy NFS + FSCache use:

  [81738.886634] FS-Cache:
  [81738.888281] FS-Cache: Assertion failed
  [81738.889461] FS-Cache: 6 == 5 is false
  [81738.890625] ------------[ cut here ]------------
  [81738.891706] kernel BUG at 
/build/linux-hVVhWi/linux-4.4.0/fs/fscache/operation.c:494!

  6 == 5 represents an operation being DEAD when it was not expected to
  be.

  [Cause]
  There is a race in fscache and cachefiles.

  One thread is in cachefiles_read_waiter:
   1) object->work_lock is taken.
   2) the operation is added to the to_do list.
   3) the work lock is dropped.
   4) fscache_enqueue_retrieval is called, which takes a reference.

  Another thread is in cachefiles_read_copier:
   1) object->work_lock is taken
   2) an item is popped off the to_do list.
   3) object->work_lock is dropped.
   4) some processing is done on the item, and fscache_put_retrieval() is 
called, dropping a reference.

  Now if the this process in cachefiles_read_copier takes place
  *between* steps 3 and 4 in cachefiles_read_waiter, a reference will be
  dropped before it is taken, which leads to the objects reference count
  hitting zero, which leads to lifecycle events for the object happening
  too soon, leading to the assertion failure later on.

  (This is simplified and clarified from the original upstream analysis
  for this patch at https://www.redhat.com/archives/linux-
  cachefs/2018-February/msg00001.html and from a similar patch with a
  different approach to fixing the bug at
  https://www.redhat.com/archives/linux-cachefs/2017-June/msg00002.html)

  [Fix]

  
   (Old sauce patch being reverted) Move fscache_enqueue_retrieval under the 
lock in cachefiles_read_waiter. This means that the object cannot be popped off 
the to_do list until it is in a fully consistent state with the reference taken.

   (New upstream patch) Explicitly take a reference to the object while
  it is being enqueued. Adjust another part of the code to deal with the
  greater range of object states this exposes.

  [Testcase]
  A user has run ~100 hours of NFS stress tests and not seen this bug recur.

  [Regression Potential]
   - Limited to fscache/cachefiles.
   - The change makes things more conservative (taking more references) so 
that's reassuring.
   - There may be performance impacts but none have been observed so far.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1774336/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1774336] Re: FS-Cache: Assertion failed: FS-Cache: 6 == 5 is false

Reply via email to