Writing Filesystems - Mapped I/O Backends
From Genunix
One of the big deficiencies in the Solaris filesystem framework is that there is no framework service function for the glue logic of VOP_GETPAGE() and VOP_PUTPAGE(). This makes these two vnode ops unnecessarily complicated.
For simplicity, this code makes extensive use of functions from the paged vnode support code in vm_pvn.c.
VOP_GETPAGE()
Before we investigate the actual implementation, let's look at the arguments and understand how the framework calls VOP_GETPAGE(), and what it expects the function to do.
The prototype for an implementation of VOP_GETPAGE() can be found in fop_getpage(), and looks like this:
int fop_getpage( vnode_t *vp, offset_t off, size_t len, uint_t *protp, page_t **plarr, size_t plsz, struct seg *seg, caddr_t addr, enum seg_rw rw, cred_t *cr)
The arguments and their meanings are:
- vnode_t *vp
- This is the vnode for which the framework requests a fault to be handled.
- offset_t off, size_t len
- Offset and size describe the fault location, the framework requests bytes from the range [off, off + len] to be brought in.
A special meaing is attributed to len == 0 - such a request means 'from off to EOF' (end of file). - uint_t *protp, page_t *pl[], size_t plsz
- The VM framework passes arrays for pages and per-page protection information as arguments to
VOP_GETPAGE(). The size_t plsz parameter gives the number of entries in both the protp[] and the *pl[] array. protp[] is optional and needs not be filled in if not provided.
A special case is pl == NULL and plsz == 0, which is used for a readahead request on the requested byte range. On such a request, an implementation may choose to just return fromVOP_GETPAGE()with a return code 0, which means 'success'.
The calling framework guarantees that plsz is large enough to accommodate the requested fault range [off, off + len], but will often be larger than that. XXX - explain purpose !!!! - struct seg *seg
- A pointer to the virtual memory subsystem's segment structure. This is the segment mapped to the [off, off + len] fault range.
- caddr_t addr
- The target virtual address (guaranteed to be valid in kernel mode while VOP_GETPAGE() is processing) where the data is supposed to be written to.
- enum seg_rw rw
VOP_GETPAGE()does not only handle read faults. It is also called by the framework on initialization faults, if a page in a mapping is accessed for the first time - even if that access is actually a store. The possible values for enum seg_rw can be found in <vm/seg_enum.h>.- cred_t *cr
- Credentials associated with the calling process. XXX - actually ignored ?!
Actual code:
static int
fat_getpage(
struct vnode *vp,
offset_t off,
size_t len,
uint_t *protp,
struct page *pl[],
size_t plsz,
struct seg *seg,
caddr_t addr,
enum seg_rw rw,
struct cred *cred)
{
struct fatnode *fip = VTOF(vp);
struct fatfs *fsp = VFSTOFATFS(vp->v_vfsp);
int err;
if (vp->v_flag & VNOMAP) {
return (ENOSYS);
}
ASSERT(off <= FAT_MAXOFFSET_T);
ASSERT((off & PAGEOFFSET) == 0);
FAT_ENTER(fsp, FAT_ENTER_SHARED);
/*
* An attempt to fault in pages from beyond the end of the file
* must fail if the target is userspace.
*/
if ((off + len) > (offset_t)(fip->f_size + PAGEOFFSET) &&
seg != segkmap) {
FAT_EXIT(fsp);
return (EFAULT); /* beyond EOF */
}
if (protp != NULL)
*protp = PROT_ALL;
/*
* This is a small optimization. A fault on a single page does not
* need to call the iterator.
*/
if (len <= PAGESIZE) {
err = fat_getapage(vp, (u_offset_t)off, len, protp, pl, plsz,
seg, addr, rw, cred);
} else {
err = pvn_getpages(fat_getapage, vp, off, len, protp,
pl, plsz, seg, addr, rw, cred);
}
FAT_EXIT(fsp);
return (err);
}
Now this breaks down the fault-in request against the byte range [off, off + len] into requests to fault in single pages.
XXX - need to explain why this is a good thing
XXX - need to give the getapage sample code !!!!
VOP_PUTPAGE()
And the same for VOP_PUTPAGE():
int fop_putpage( struct vnode *vp, offset_t off, size_t len, int flags, struct cred *cr)
Being the counterpart to VOP_GETPAGE(), this vnode operation's primary task is to write dirty pages associated with the given byte range [off, off + len] into the on-disk representation of vnode_t * vp. But like VOP_GETPAGE(), which also does zero-fill and readahead, VOP_PUTPAGE() does more than just writing dirty pages out - it also must support invalidation and freeing of pages associated with the vnode in the specified byte range. The parameters for VOP_PUTPAGE() and their possible values are:
- vnode_t *vp
- obvious
- offset_t off, size_t len
- same meaning as with
VOP_GETPAGE(), including len == 0 marking a request to flush from off to EOF. - int flags
- XXX - explain
- struct cred *cr
- XXX - as with getpage, unused ...
/*
* Flags are composed of {B_INVAL, B_FREE, B_DONTNEED, B_FORCE}
* If len == 0, do from off to EOF.
*
* The normal cases should be len == 0 & off == 0 (entire vp list),
* len == MAXBSIZE (from segmap_release actions), and len == PAGESIZE
* (from pageout).
*
*/
/*ARGSUSED*/
static int
fat_putpage(
struct vnode *vp,
offset_t off,
size_t len,
int flags,
struct cred *cr)
{
struct fatnode *fip = VTOF(vp);
struct fatfs *fsp = VFSTOFATS(vp->v_vfsp);
page_t *pp;
int err = 0;
u_offset_t io_off;
size_t io_len;
se_t se;
int synchronous;
if (vp->v_flag & VNOMAP)
return (ENOSYS);
FAT_ENTER(fsp, FAT_ENTER_SHARED);
ASSERT(off <= FAT_MAXOFFSET_T);
ASSERT((off & PAGEOFFSET) == 0);
/*
* An attempt to "flush" data if there's none cached, or an
* attempt to write data to beyond the end of the file do
* immediately succeed - there's nothing to do for the filesystem.
*/
if (!vn_has_cached_data(vp) || off >= fip->f_size) {
FAT_EXIT(fsp);
return (0);
}
if (len == 0) {
/*
* Search the entire vp list for pages >= off
*/
err = pvn_vplist_dirty(vp, off, fat_putapage, flags, cr);
FAT_EXIT(fsp);
return (err);
}
/*
* If we are not invalidating, synchronously freeing or writing pages
* use the routine page_lookup_nowait() to prevent reclaiming them from
* the free list.
*/
if ((flags & B_INVAL) || ((flags & B_ASYNC) == 0)) {
se = (flags & (B_FREE | B_INVAL)) ? SE_EXCL : SE_SHARED;
synchronous = 1;
} else {
se = (flags & B_FREE) ? SE_EXCL : SE_SHARED;
synchronous = 0;
}
io_off = off;
while (err == 0 && io_off < MIN(off + len, fip->f_size)) {
if (synchronous)
pp = page_lookup(vp, io_off, se);
else
pp = page_lookup_nowait(vp, io_off, se);
/*
* Skip just the found page by default. But if it is dirty,
* give getapage() the ability to cluster multiple consecutive
* pages, and adjust io_len accordingly.
*/
io_len = PAGESIZE;
if (pp && pvn_getdirty(pp, flags))
err = fat_putapage(vp, pp, &io_off, &io_len, flags, cr);
io_off += io_len;
}
FAT_EXIT(fsp);
return (err);
}
to be continued...
