On Mon, Apr 18, 2011 at 12:49:38PM -0700, [email protected] wrote: > On Mon, Apr 18, 2011 at 12:28 PM, Kostik Belousov <[email protected]> wrote: > > On Mon, Apr 18, 2011 at 04:32:22PM +0000, Matthew D Fleming wrote: > >> Author: mdf > >> Date: Mon Apr 18 16:32:22 2011 > >> New Revision: 220791 > >> URL: http://svn.freebsd.org/changeset/base/220791 > >> > >> Log: > >> Add the posix_fallocate(2) syscall. The default implementation in > >> vop_stdallocate() is filesystem agnostic and will run as slow as a > >> read/write loop in userspace; however, it serves to correctly > >> implement the functionality for filesystems that do not implement a > >> VOP_ALLOCATE. > >> > >> Note that __FreeBSD_version was already bumped today to 900036 for any > >> ports which would like to use this function. > >> > >> Also reserve space in the syscall table for posix_fadvise(2). > > >> +#ifdef __notyet__ > >> + /* > >> + * Check if the filesystem sets f_maxfilesize; if not use > >> + * VOP_SETATTR to perform the check. > >> + */ > >> + error = VFS_STATFS(vp->v_mount, &sfs, td); > >> + if (error != 0) > >> + goto out; > >> + if (sfs.f_maxfilesize) { > >> + if (offset > sfs.f_maxfilesize || len > sfs.f_maxfilesize || > >> + offset + len > sfs.f_maxfilesize) { > >> + error = EFBIG; > >> + goto out; > >> + } > >> + } else > >> +#endif > >> + if (offset + len > vap->va_size) { > >> + VATTR_NULL(vap); > >> + vap->va_size = offset + len; > >> + error = VOP_SETATTR(vp, vap, td->td_ucred); > >> + if (error != 0) > >> + goto out; > >> + } > > I still do not see a reason to do VOP_SETATTR() there. VOP_WRITE() will > > do auto-extend as needed. Also, see below. > > The need is, as commented, to return EFBIG when the new file size will > be larger than the FS supports. Without this code, passing in > something like posix_fallocate(fd, 0, OFF_MAX) will run the filesystem > out of space. Handling max file size and not overflowing the fs are different things. VOP_WRITE() will handle file size on its own too. I see no problem with exhausting free space if this is what user asked for.
>
> >> +
> >> + while (len > 0) {
> >> + if (should_yield()) {
> >> + VOP_UNLOCK(vp, 0);
> >> + locked = 0;
> >> + kern_yield(-1);
> > Please note that, despite dropping the vnode lock, the snapshot creation
> > is still blocked at this point, due to previous vn_start_write().
> >
> > If doing vn_finished_write() there, then bwillwrite() before
> > next iteration is desired.
> >> + error = vn_lock(vp, LK_EXCLUSIVE);
> >> + if (error != 0)
> >> + break;
> >> + locked = 1;
> >> + error = VOP_GETATTR(vp, vap, td->td_ucred);
> >> + if (error != 0)
> >> + break;
> >> + }
> >> +
> >> + /*
> >> + * Read and write back anything below the nominal file
> >> + * size. There's currently no way outside the filesystem
> >> + * to know whether this area is sparse or not.
> >> + */
> >> + cur = iosize;
> >> + if ((offset % iosize) != 0)
> >> + cur -= (offset % iosize);
> >> + if (cur > len)
> >> + cur = len;
> >> + if (offset < vap->va_size) {
> >> + aiov.iov_base = buf;
> >> + aiov.iov_len = cur;
> >> + auio.uio_iov = &aiov;
> >> + auio.uio_iovcnt = 1;
> >> + auio.uio_offset = offset;
> >> + auio.uio_resid = cur;
> >> + auio.uio_segflg = UIO_SYSSPACE;
> >> + auio.uio_rw = UIO_READ;
> >> + auio.uio_td = td;
> >> + error = VOP_READ(vp, &auio, 0, td->td_ucred);
> >> + if (error != 0)
> >> + break;
> >> + if (auio.uio_resid > 0) {
> >> + bzero(buf + cur - auio.uio_resid,
> >> + auio.uio_resid);
> >> + }
> >> + } else {
> >> + bzero(buf, cur);
> >> + }
> > Wouldn't VOP_SETATTR() at the start of the function mostly prevent
> > this bzero from executing ?
>
> Yes. If struct statfs had a member indicating the file system's max
> file size, then the extend wouldn't be necessary. We have that
> feature locally, but it's only implemented for ufs and our custom file
> system, and it requires an ABI change so it's a bit of work to
> upstream. And as with most of those things, it's hard to find the
> time to upstream it outside of work hours.
>
> > I estimated what it would take to do the optimized implementation for UFS,
> > and I think that the following change would allow to lessen the code
> > duplication much.
> >
> > What if the vnode lock drop and looping be handled by the syscall, instead
> > of the vop implementation ? In other words, allow the VOP_ALLOCATE()
> > to allocate less then requested, and return the allocated amount to
> > the caller. The loop would be centralized then, freeing fs from doing
> > the dance. Also, if fs considers that suitable, it would do a whole
> > allocation in one run.
>
> I like this idea. Perhaps input to the vop should be pointers to
> offset and len, and the vop can change them as it iterates? Otherwise
> the return code must be overloaded to distinguish between an error
> code and the len that has been handled so far. And, I think the VOP
> interface must return int, so there would be no way to indicate that
> more than 2GB had been allocated.
>
> So something in the kernel like:
> while ((error = VOP_ALLOCATE(vp, &offset, &len)) == EAGAIN) {
> VOP_UNLOCK(vp, 0);
> /* XXX unlock other resources */
> maybe_yield();
> bwillwrite();
> /* XXX vn_start_write dance */
> error = VOP_LOCK(vp, LK_EXCLUSIVE);
> }
Exactly. I would not even bother to return EAGAIN, just 0, and the need
to retry the loop can be determined by len being non-zero.
Also, by rearranging the loop, we can avoid the duplication of
calls to bwillwrite, vn_start_write, vn_lock etc.
for (;;) {
/* All error handling is removed for convenience */
bwillwrite();
vn_start_write();
vn_lock();
VOP_ALLOCATE(vp, &offset, &len);
VOP_UNLOCK();
vn_finished_write();
if (len == 0)
break;
yield();
}
>
> Cheers,
> matthew
pgpIKtLaACft6.pgp
Description: PGP signature
