Writing Filesystems - Reading/Writing files

From Genunix

Jump to: navigation, search
Image:Info.gif This article has been identified as a draft. It is currently undergoing a community review. Please add your comments to the discussion page.

Do not quote any text on this page! It is still a draft!


Implementing VOP_READ()

As we will see from the codesamples, a lot of the generic validation code that checks whether the I/O request is valid against permissions and sizes of the file is common for VOP_READ() and VOP_WRITE(). Many filesystems in the past have therefore provided a unified xxx_rdwr() or rwxxx function that'd be called from both the read and write backends, and would branch up internally. That has advantages and disadvantages and what solution you choose for your filesystem is up to you. For the purpose of this article series I've chosen the split approach, as it makes the two codepaths clearer.

Let's look at the simpler one of these two first - VOP_READ().

The code looks like this:

static int
fat_read(
	struct fatnode *fip,
	struct uio *uio,
	enum uio_rw rw,
	int ioflag,
	struct cred *cr)
{
	struct vnode *vp = FAT2V(fip);
	struct fatfs *fsp = VFS2FATFS(vp->v_vfsp);
	int error = 0;
	int flags;
	fat_rl_t *rl;

	ASSERT(rw == UIO_READ);
	ASSERT(vp->v_type == VREG);	/* directories ? later ! */

	if (MANDLOCK(vp, FAT_POSIXMODE(fip)) {
		/*
		 * chklock may end up calling VOP_GETATTR() - hence first.
		 */
		error = chklock(vp, FREAD, uiop->uio_loffset,
				uiop->uio_resid, uiop->uio_fmode, ct);
		if (error)
			return (error);
	}

	/*
	 * Generic consistency checks on the uio request.
	 * This code is as good as identical for all filesystems,
	 * except for the checks against the maximum supported file
	 * size which of course depends on the filesystem's limits.
	 */
	if (uio->uio_loffset < 0)	/* framework misses this case ! */
		return (EINVAL);

	/*
	 * ZFS sourcecode claims "access beyond end" sets atime.
	 * Where does POSIX say that ?
	 */
	if (uio->uio_resid == 0 ||			/* nothing to read */
	    uio->uio_loffset >= FAT_MAXOFFSET_T ||	/* beyond max filesize */
	    uio->uio_loffset >= (offset_t)fip->f_size)	/* beyond end of file */
		return (0);

	FAT_ENTER(fsp, FAT_ENTER_SHARED);

	/*
	 * ensure POSIX O_RSYNC/O_SYNC semantics.
	 */
	fat_commit_node(fsp, fip, (ioflag & FRSYNC));

	rl = fat_range_lock(fip, uio->uio_loffset, uio->uio_resid, FAT_RL_READER);

	do {
		offset_t off, mapon, diff;
		caddr_t base;
		size_t n;

		off = uio->uio_loffset & MAXBMASK;
		mapon = (int)(uio->uio_loffset & MAXBOFFSET);
		diff = (offset_t)fip->f_size - uio->uio_loffset;
		if (diff <= 0)
			break;		/* read beyond the end of the file */

		n = MIN(uio->uio_resid, MIN(diff, MAXBSIZE - mapon));

		ASSERT(n <= MAXBSIZE);	/* the MIN() calls should've made sure of that */

		base = segmap_getmapflt(segkmap, vp, (u_offset_t)off, n, 1, S_READ);

		if ((error = uiomove(base + mapon, n, rw, uio)) != 0) {
			(void)segmap_release(segkmap, base, SM_DONTNEED);
		} else {
			flags = 0;
			/*
			 * We won't need this block again if all of it has been copied
			 * out, of if the end of the file has been reached.
			 */
			if ((mapon + n) == MAXBSIZE || uiop->uio_loffset == fip->f_size)
				flags = SM_DONTNEED;
			error = segmap_release(segkmap, base, flags);
		}
	} while (!error && uio->uio_resid > 0 && n != 0);

out:
	fat_range_unlock(fip, rl);
	FAT_ACCESSTIME_STAMP(fsp, fip);
	FAT_EXIT(fsp);

	return (error);
}

Now that's not difficult, isn't it ?

The copy loop uses segmap to provide temporary kernel mappings for the data pages during I/O. Traditionally, one would've called segmap_getmap() but as you can see from the source that's just passing through to segmap_getmapflt() anyway, so the code optimizes and explicitly requests pre-faulting the pages. If your code doesn't do this, pagefaults will be triggered during uiomove() that'll also fault the pages in. But it'll incur trap dispatch overhead, which we avoid. segmap processes data in chunks of MAXBSIZE, and the loop breaks down larger I/O requests into such chunks (if that code isn't obvious to you, check hsfs_read() for a comment).

Note that OpenSolaris recently got a new facility, VPM, that can be used instead of segmap if support for it is enabled. Be aware that neither segmap nor VPM is considered a committed/stable interface, and changes do occur over time. As far as segmap goes, its interface is mature and settled and no matter its stability classification large changes are unlikely. The same isn't true for VPM yet, so if you decide to use it, either closely follow the OpenSolaris development (or better yet, integrate your filesystem into the main OpenSolaris source tree !), or get in contact with Sun about it - so that codechanges Sun makes to VPM won't break your filesystem.

With VPM, the code in the copy loop changes slightly:

...
	/*
	 * VOP_READ() main copy loop with VPM support.
	 */
	do {
		offset_t off, mapon, diff;
		caddr_t base;
		size_t n;

		off = uio->uio_loffset & MAXBMASK;
		mapon = (int)(uio->uio_loffset & MAXBOFFSET);
		diff = (offset_t)fip->f_size - uio->uio_loffset;
		if (diff <= 0)
			break;		/* read beyond the end of the file */

		n = MIN(uio->uio_resid, MIN(diff, MAXBSIZE - mapon));

		ASSERT(n <= MAXBSIZE);	/* the MIN() calls should've made sure of that */

		if (vpm_enable) {
			error = vpm_data_copy(vp, (off + mapon), (uint_t)n,
				uio, 1, NULL, 0, S_READ);
		} else {
			base = segmap_getmapflt(segkmap, vp, (u_offset_t)off, n, 1, S_READ);
			error = uiomove(base + mapon, n, rw, uio);
		}

		if (error) {
			if (vpm_enable)
				(void)vpm_sync_pages(vp, off, n, SM_DONTNEED);
			else
				(void)segmap_release(segkmap, base, SM_DONTNEED);
		} else {
			flags = 0;
			/*
			 * We won't need this block again if all of it has been copied
			 * out, of if the end of the file has been reached.
			 */
			if ((mapon + n) == MAXBSIZE || uiop->uio_loffset == fip->f_size)
				flags = SM_DONTNEED;

			if (vpm_enable)
				error = vpm_sync_pages(vp, off, n, flags);
			else
				error = segmap_release(segkmap, base, flags);
		}
	} while (!error && uio->uio_resid > 0 && n != 0);
...

The fallback support for segmap is still required - VPM may not be active on all architectures, and it can be switched off by a kernel tunable. The future may see Sun retire segmap, but then it may not. You might want to ask the Oracle for a quote about the future...</p>

To sum up what VOP_READ() does in sequence:

  1. Check for mandatory file locking (if your fs type doesn't support it, you can actually take this code out).
  2. Check limits on the uio request against those on your filesystem/file.
  3. Perform forced umount synchronization / op refcounting.
  4. Sync outstanding I/O for the node to disk if O_SYNC / O_RSYNC semantics are requested.
  5. Lock the file range covered by the read request against changes. Whether this is done in the way ZFS does it in zfs_range_lock(), or simply via a flag/mutex/cv mechanism per-file, doesn't matter for correctness.
    Only be aware that a simple file reader/writer lock does not suffice - that lock cannot be held over the calls to segmap_getmapflt() and uiomove(). See comments in ufs_vnops.c</tt> about that.
  6. Execute the main copy loop. The VM interfaces, whether segmap or VPM, operate on units of MAXBSIZE bytes aligned at the same value, so the I/O operation is broken down into such chunks, and, on the leading/trailing block, properly truncated. The actual update of the loop state variables (uio_loffset and uio_resid) is done within uiomove() so we don't have to care. Just make sure to release resources if an error occurred.
  7. After finishing the loop, unlock the file (allow file size changes again, aka writes).
  8. Request an accesstime update.</li>

In the next section we'll see how VOP_WRITE() differs.

Implementing VOP_WRITE()

The differences between VOP_READ() and VOP_WRITE() mostly come from the fact that the latter needs to deal with the case of extending writes, and the implementation of write therefore must know how to perform data block allocation. Without the code for extending the file (or filling in holes, if the filesystems supports sparse files), VOP_WRITE() would look just like VOP_READ().

The steps that VOP_WRITE() needs to perform are:

  1. Synchronization with forced umount.
  2. Check for active mandatory file locks.
  3. Optionally, do a fastpath for [#rewrite rewrite*)] (that requires no block allocation).
  4. Adjust the write offset depending on whether this is an appending write (offset relative to end of file) or a regular write (offset relative to start of file).
  5. Lock the node against concurrent (write) access.
    The simplified [#rangelock rangelock code**)] that this sample code uses will be shown below, for a much better version see zfs_rlock.c.
  6. Perform checks of the I/O request limits against filesystem and resource control (ulimit) restrictions.
  7. Do the main "copy loop", which, for every unit of MAXBSIZE, does:
    • Check whether new filesystem blocks must be allocated, and if so allocate these.
    • Use segmap/VPM and uiomove() to perform the actual I/O. See below for notes.
    • Adjust the on-disk metadata information about the current filesize, update POSIX mtime.
    • On error, undo block allocation.
    There's significant potential for optimization here. We'll show below.
  8. If the I/O request was synchronous, make sure all data is written.
  9. If the I/O request could only be satisfied partially, discard eventual error. The framework may or may not want retry the I/O for the part that wasn't satisfied; from the point of view of the filesystem writing one byte successfully makes the whole I/O request successful - except for ENOSPC.
  10. Unlock the node (allow concurrency again).
  11. Release "hold" for forced umount synchronization.

What makes the write codeflow so much more complicated than read is the block allocation, and the need to undo that in case of an I/O error writing the actual blocks along the way. How exactly this is done (and when exactly the on-disk node gets updated with new file size information) depends on the actual filesystem implementation and metadata structure. In addition to that, the filesystem code should optimize the case of full-block writes over partial rewrites, and it should apply some I/O device bandwidth management (so that write systemcalls cannot swamp the device with I/O requests; generically called 'throttling'). See comments in the code.

Enough mumbling - let's get at the code !

#define	CLEAR_TAIL(base, mapon, off, n, uio)				\
	if (uio->uio_loffset < roundup(off + mapon + n, PAGESIZE)) {	\
		offset_t nzero, nmoved;					\
									\
		nmoved = uio->uio_loffset - (off + mapon);		\
		nzero = roundup(mapon + n, PAGESIZE) - nmoved;		\
		(void) kzero(base + mapon + nmoved, (size_t)nzero);	\
	}

static int
fat_write(
	struct fatnode *fip,
	struct uio *uio,
	enum uio_rw rw,
	int ioflag,
	struct cred *cr)
{
	struct vnode *vp = FAT2V(fip);
	struct fatfs *fsp = VFS2FATFS(vp->v_vfsp);
	int error = 0;
	int size_changed;
	rlim64_t limit;
	offset_t eoff;
	fat_rl_t *rl;

	ASSERT(rw == UIO_WRITE);

	if (uio->uio_resid == 0)		/* nothing to do - fasttrack ... */
		return (0);

	/*
	 * For FAT filesystems, we could actually drop this piece of code
	 * since the necessary permission bits for mandatory file locking
	 * do not exist on FAT; keep the code because:
	 *	a) this is sample code and other fs'ses may well have this
	 *	b) we may add some non-persistent state support for this ?
	 *
	 */
	if (MANDLOCK(vp, FAT_POSIXMODE(fip)) {
		/*
		 * chklock may end up calling VOP_GETATTR() - hence first.
		 */
		error = chklock(vp, FREAD, uiop->uio_loffset,
				uiop->uio_resid, uiop->uio_fmode, ct);
		if (error)
			return (error);
	}

	/*
	 * This returns with EIO if a forced umount request is pending.
	 * Otherwise, it bumps up the filesystem instance refcount.
	 */
	FAT_ENTER(fsp, FAT_ENTER_SHARED);

	/*
	 * Generic consistency checks on the uio request.
	 * This code is as good as identical for all filesystems,
	 * except for the checks against the maximum supported file
	 * size which of course depends on the filesystem's limits.
	 *
	 * XXX - to be sorted out: Some of these are supposed (?) to
	 * create mtime/atime updates even though no I/O is done ?
	 * Need a POSIX 'nitpicker' to tell me ...
	 */

	if (vp->v_type != VREG) {		/* XXX - maybe ASSERT() ? as in read() ? */
		FAT_EXIT(fsp);
		return (EISDIR);
	}

	eoff = uiop->uio_loffset + (offset_t)uiop->uio_resid;

	if (uio->uio_loffset < 0 || eoff < 0) {	/* framework misses this case ! */
		FAT_EXIT(fsp);
		return (EINVAL);
	}

	/*
	 * This is the check against ulimit for filesize, and against the maximum
	 * filesize supported by this filesystem type. Don't ask me why this looks
	 * like it does. If you can explain it, change this comment ...
	 */
	limit = uiop->uio_llimit;
	if (limit == RLIM64_INFINITY || limit > MAXOFFSET_T)
		limit = MAXOFFSET_T;
	/*
	 * I/O request offsets in APPEND mode are relative to the end of the file.
	 * Adjust uio_loffset accordingly.
	 */
	if (ioflag & FAPPEND)
		uio->uio_loffset = fip->f_size;

	if (uio->uio_loffset > limit) {
		proc_t *p = ttoproc(curthread);

		mutex_enter(&p->p_lock);
		(void) rctl_action(rctlproc_legacy[RLIMIT_FSIZE],
		    p->p_rctls, p, RCA_UNSAFE_SIGINFO);
		mutex_exit(&p->p_lock);
		FAT_EXIT(fsp);
		return (EFBIG);
	}

	/*
	 * Now that 'ulimit -f' has been checked, validate against the
	 * actual limits of our filesystem type.
	 */
	limit = MIN(limit, FAT_MAXOFFSET_T);

	if (uio->uio_loffset > limit) {
		FAT_EXIT(fsp);
		return (EFBIG);
	}


	/*
	 * XXX - write throttling ? Apply some threshold checks in order not
	 * to swamp the device with I/O requests ?
	 */



	/*
	 * We're lazy here - writelock the entire file, aka singlethread writers.
	 * For more elaborate locking protocols allowing concurrent nonoverlapping
	 * writes to the same file, feel free to peek at the ZFS sources !
	 */
	rl = fat_range_lock(fip, 0, FAT_MAXOFFSET_T, FAT_RL_WRITER);

	/*
	 * The main "copy loop" for VOP_WRITE(). Operates on MAXBSIZE units of
	 * data (that's what segmap/VPM use). We always map aligned chunks of
	 * MAXBSIZE and use uiomove() to transfer the overlap between the current
	 * block and the uio request from temporary in-kernel mappings directly
	 * into userspace.
	 *
	 * If we 
	 */
	do {
		u_offset_t uoff = uio->uio_loffset;
		offset_t off, mapon, diff;
		caddr_t base;
		size_t n;
		int pagecreate;

		off = uoff & MAXBMASK;
		mapon = (uoff & MAXBOFFSET);
		diff = (offset_t)fip->f_size - uoff;
		if (diff <= 0)
			break;		/* read beyond the end of the file */

		n = MIN(uio->uio_resid, MIN(diff, MAXBSIZE - mapon));

		/*
		 * Since we're writing in MAXBSIZE chunks and partial writes
		 * are actually a success, we need to recheck against the file
		 * size limit once per iteration of this loop ...
		 */
		if (uoff + n >= limit) {
			if (uoff >= limit) {
				error = EFBIG;
				break;
			}
			n = (size_t)(limit - (rlim64_t)uoff);
		}

		ASSERT(n <= MAXBSIZE);	/* the MIN() calls should've made sure of that */

		/*
		 * See whether the file must be grown.
		 */

		size_changed = 0;

		if ((size_t)uoff + n > fip->f_size) {
			/*
			 * This is an optimization - if we start this write a the
			 * beginning of a mapping, then we don't need to zero-fill
			 * in advance but can write the new data directly into the
			 * freshly-allocated blocks.
			 */
			size_changed = 1;
			pagecreate = (mapon == 0);
		} else if (n == MAXBSIZE) {
			/*
			 * We write a full mapping and can therefore skip the need
			 * to read in pages; precreate them so that a pagecache hit
			 * is guaranteed and no calls to VOP_GETPAGE() will be done
			 * by segmap_getmapflt()/vpm_data_copy() below.
			 * XXX - fs/VM deadlock ...
			 */
			/*
			 * XXX - FAT does not support holes. Filesystems that do need
			 * to check for a hole at the current offset and fill it.
			 */
			pagecreate = 1;
		} else {
			pagecreate = 0;
		}

		premove_resid = uio->uio_resid;

		/*
		 * Perform the actual data copy. VPM does this all in a "simple"
		 * way, one call, but segmap usage requires multiple steps:
		 *
		 * 1. A segmap mapping is created via segmap_getmapflt() and directly
		 *    populated if we found this write to be a partial rewrite (i.e.
		 *    if 'pagecreate' wasn't set above).
		 * 2. If a full-block rewrite or a write to a new block is done, then
		 *    there's no need to call VOP_GETPAGE() via segmap_getmapflt()
		 *    since any data brought in would be overwritten anyway; instead,
		 *    we'll directly populate the mapping by segmap_pagecreate().
		 *    XXX - fs/vm deadlock ?!
		 * 3. uiomove() copies the user input buffer into the segmap mapping.
		 * 4. Trailing parts of the last block must be zero-filled, both for
		 *    EOF conditions and I/O errors from uiomove().
		 * 5. If segmap_pagecreate() was called, the pages must be unlocked.
		 */
		if (vpm_enable) {
			error = vpm_data_copy(vp, (off + mapon), n,
			    uio, (pagecreate == 0), NULL, 0, S_WRITE);
		} else {
			int newpage = 0;

			base = segmap_getmapflt(segkmap, vp, (off + mapon),
			    n, (pagecreate == 0), S_WRITE);
			if (pagecreate)
				newpage = segmap_pagecreate(segkmap, base, n, 0);
			error = uiomove(base + mapon, n, UIO_WRITE, uio);
			uoff = uio->uio_loffset;
			if (newpage) {
				ASSERT(pagecreate);
				CLEAR_TAIL(base, mapon, off, n, uio);
				segmap_pageunlock(segkmap, base, n, S_WRITE);
			}
		}

		if (size_changed) {
			/*
			 * XXX - need to commit the node metadata ...
			 */
		}

		if (error) {
			flags = size_changed ? SM_INVAL : SM_DESTROY;
			uio->uio_loffset = size_changed ? premove_resid : uio->uio_loffset;
			if (vpm_enable)
				(void) vpm_sync_pages(vp, off, n, flags);
			else
				(void) segmap_release(segkmap, base, flags);
		}

	} while (!error && uio->uio_resid > 0 && n != 0)
	FAT_MODIFYTIME_STAMP(fsp, fip);
out:
	FAT_EXIT(fsp);
	return (error);
}
&ast;) rewrite
rewrite means that all I/O is done into existing, preallocated disk blocks. A partial block write can never be a rewrite, and a pure copy-on-write filesystem implementation like ZFS will also never perform rewrites. But most "classical" filesystems do have the concept of re-using existing data blocks that "belong" to that file, hence do support a form of rewrite.</dt>
&ast;&ast;rangelock
The following simple sample code performs a reader/writer synchronization that does not hold a lock. We could opt for simple rwlocks but there's the problem that these would need to be dropped during uiomove(); flagging the node as "sharedlocked", "writewanted", "writelocked" and using a mutex/cv pair avoids this. A much better (as it allows for concurrent nonoverlapping writes, unlike this sample which singlethreads writes and starves reads in favour of writes) implementation is in ZFS, zfs_rlock.c

to be continued ...

Personal tools