Lecture 13: File Systems

# Lecture 13: File Systems # Slide Notes Most systmes need to store data persistently so it is still there after reboot / power down. This is typically a core piece of functionality for the system. Even the operating system itself needs to be sotred this way. our options to do this is to: ... - use raw storage blocks to store the data (NO) - do not make a lot of sense to users and not even easy for OS developers to work with - Use a database to store the data (NO) - probably more structure than we need or can afford - Use a file system (YES) - some organizaed way of structuring persistent data in a way which makes sense to users and programmers ## Basic File System Concept Organize data into natural coherent units (like a spreadsheet). Then store each unit as its own self-contained entity (file). finally provide some simple powerful organizing princple for the collection of files and make it easy to find them and organize them. ### File Systems and Hardware File systems are typically stored on hardware providing persistent memory with the expectiation that a file put in one "place" will be there when we look again. Performance considerations will require us to match the implementation to the hardware but ideally, the same user-visible file system should work on any reasonable hardware. ### Flash Drives These are a solid state persistent sotrage device meaning no moving parts. It reads and writes are fairly fast. But a given block can only be written once. This means writing again requires erasing and it is much slower to erase large sectors of the drive. ### Data and Metadata File systems deal with 2 kinds of information: - **Data**: The information that the file is actaully suppored to store (e.g. image) - **Metadata**: Inforamtion about the information the file stores. Note that both data and metadata must be stored persistently and usually on the same piece of hardware. ### A Further Wrinkle We also need to consider that we want our file system to be agnostic to the storage medium. The same program should access the file system the same way, regardless of medium. ### Desirable File System Properties - Persistence - Easy use model for accessing one file and organizing a collections of files - flexibility - portability across hardware device types - performance - reliability - suitable security ### The Performance Issue How fast does our file system need to be? Ideally, as fast as everything else like CPU, mem, and bus However these devices operate today at nanosecond sppeds while flash drives are 1000x slower. we have to work to be as fast as possible ### The Reliability Issue Persistence implies reliability. we want our files to be there when we check no matter what. So our file systems must be free of errors. Remember the discussion of concurrency, race conditions, etc. where we will ahve to consider correctness ## Basics of File System Design ![Screenshot 2024-05-16 at 3.30.33 PM](https://hackmd.io/_uploads/HJYMK-4XA.png) ### The File System API Highly desirable to provide a single API to programmers and users for all files regardless of how the file system underneath is actually implemented. A requirement if one wants program portability is: 1. File container operations 2. Directory Operations 3. File I/O operations ### File Container Operations ### Directory Operations direcotries provide the organizations of a file system. At the core, direcotires translate a name to a lower-level file pointer. Directory operations tend to be related to finding a file by name, creating new name/file mapping, and list a set of known names. ### File I/O operations - Open: use name to set up an open instance - we read data from file and write data to file using logical block fetches and copying data between user space and file buffer - seek: change logical offset associated with open instance - map file into address space: file block buffers are just pages of physical memory. we want to map into address space, page it to and from file system. ### The Virtual File System (VFS) Layer it is the federation lay to generalize file systems which permits rest of OS to reat all file systems as the same. It is a plug-in interface for file system implementations where all implement same basic methods (create ...) ### The file systems Layer Desirable to support multiple different file systems. All implemented on top of block I/O and should be independent of underlying devices All files systems perform same basic functions (Map names, manage free space, allocate it to files, create and destroy files, get and set file attributes, manipulate the file name space) ### Why Multiple File Systems There may be multiple storage devices (hard disk and flash drive) which means we need to provide different services despite the same interface (performance, correctness, permissions, etc.). Different File systems used for different purposes. ### File Systems and Block I/O Devices File Systems typically sit on a general block I/O layer. Implementation of standard operations on each block device include asynchronous read and write (physical block #, buffer, bytecount). We also would want to map logical block numbers to device addresses (<cylinder, head, sector>) And we want to encapsulate all the particulars of device support such as I/O scheduling, initiation, completion, error handlings, as well as size and alignment limitations. ### Why Device Independent Block I/O Allows unified LRU buffer cache for drive data, as it can hold frequently used data until it is needed again. it also provides buffers for data re-blocking where it can adapt the file system block size to device block size and user request size it also handles automatic buffer management (allocation, deallocation, and automatic write back of changed buffers) ### Why do we need that cache? File access exhibits a high degree of reference locality at multiple levels which means that having common cache eliminates many disk accesses, which are slow. ## File Systems Control Structures A **file** is named collection of information. Primary roles of **file systems** are to store and retrieve data as well as to manage the media/space where data is stored. Typical operations include finding where the first block of the file, next block, ... 35th block, allocate a new block to the end of this file, and free all blocks associated with this file. ### Finding Data on Devices Key question is how to manage the space on your device. The mangemetn is very complex (millions of files, continously create and destroyed, extendable, performance effects,). poor mangement leads to poor performance. ### On-Device File Control Structures On-device description of important attributes of a file. (metadata) Virtually all file systems have such data strucutres with different implemenations, performance and abilities. strucutres are a core design element of a file system and and will be paried with some kind of in-memory representation of the same information. ### The In-memory Representation There is an on-disk structure pointing to device blocks (and holding other information). when file is opened, an in-memory strucutre is created. Not an exact copy of the device version. The decive version points to the device blocks. The in memory versions points to RAM pages. Also has to keep track of which blocks have been written and which aren't. There are some questions we have to address with this. - what if multiple processes have the same file open? - should they share one control structure or have one each? - In-memory strucutres typically contain a cursor pointer - etc. ### Per-Process or Not? What if cooperating processes are working with the same file? And how can we know when all processes are finished with an open file? All these questions implies a two-level soltion: 1. A strucutre shared by all 2. A structure shared by cooperating processes ![Screenshot 2024-05-21 at 2.35.14 PM](https://hackmd.io/_uploads/Bkp9Q957A.png) ## Basics of File System Structure Most file systems live on block-orineted devices such as volumes are divided into fixed-sized blocks. Most blocks will be used to store user data. Some will be used to organizing "meta-data" (description of the file system, file control blocks, list of free blocks) All file systems have such data strucutres ### The Boot Block The 0th block of devices is usually reserved for the **boot block** this stores the code allowing the machine to boot up the OS Not usually under the control of a file system and Not all devices are bootable (reserved just in case) So file systems start work at block 1. ### Managing Allocated Space A core activity for a file system with various choices. comes with many questions: - what if we give each file the same amount of space (internal fragemntation) - what if we allocate ust as much as file needs (external fragmentation) - pherhpas we should allocate space is "pages" - how many pages should it contain? The file control data strucutre determines this as it thinks about how it only has room for so many pointers before the file is "full" ultimate question: So how do we want to organize the space in a file? ## Linked Extents a simple answer to designing a file system. File control block contains exactly one pointer. it points to the first chuck of the file. each chunk contains a pointer to the next chunck alloing us to add arbitrarily many chunks to each file. Pointers can be in the chunks themselves which takes away a litttle of every chunk. To find chunk N you have to read the first N-1 chunks Or pointer can be in auxiliary "chunk linkage" table for faster searches, especially if table is kept in memory. ![Screenshot 2024-05-21 at 3.09.24 PM](https://hackmd.io/_uploads/HkVso5c70.png) BIOS parameter block (BPB) tells you about the current layout. ### DOS File System Overview DOS file systems divide space into "clusters" (multiples of 512) fixed for each file system. (1 through N) File control structers points to the first cluster of a file. The **File Allocation Table (FAT) / ex: linkage table** allows for one entry per cluster - contains the nubmer of the next clister in file. 0 entry means the cluster is not allocated. -1 entry means end of file File system is sometimes called "FAT," after the name of this key data strucutre. ![Screenshot 2024-05-21 at 3.14.36 PM](https://hackmd.io/_uploads/SJu0n59m0.png) ### DOS File System Characteristics To find a particular block of a file you need get the number of first cluster from directory entry and follow chain of pointer through file Allocation Table. Entire File Allocation Table is kept in memory so ther is no disk I/O required to find a cluster. For very large files, the in-memory search can still be long. The Width of FAT will determine the mac file system size based on how many bits describe a cluster address. (originally 8 bits, eventually expanded to 32) ### How Big a File Can the DOS File System Handle remember there is only one entry in the FAT table per cluster. The FAT table has some maxiumum size as it is kept in memory on a machine with little RAM. Originally 4096 entries allowed ... this means the original max file size was 4Mbytes ## File Index Blocks A different way to keep track of where a file's data blocks are on the device. A file control block points to all blocks in file which comes with the perks of being very fast to access a desired block. however there is a limit to how many pointers a file control block can hold. File control block could point at extent descriptors (of bigger than block size) but this still gives us a fixed number of extents ### Hierarchically Strucutred File Index Blocks the solution is the basic file index vlock points to multiple index blocks. Specificially some points to data blocks and index block than then point to other data blocks. 13 pointers in an index block some for data blocks and others for other index blocks that contain other pointers. ![Screenshot 2024-05-21 at 3.32.26 PM](https://hackmd.io/_uploads/HJBbWj9XR.png) ![Screenshot 2024-05-21 at 3.33.32 PM](https://hackmd.io/_uploads/rkvr-o97C.png) purple, double and single indirect blocks are the blocks that are pointed to that contain other pointers ### how Big a File Can Unix Handle? The on-disk inode contains 13 block pointers. First 10 point to first 10 blocks of file. 11th is for indirect block, 12th is double, and 13th is triple. at max a triple indirect can support 4T bytess ### Unix Inode Performance Issues The indoes is in memory whenever file is open. So the first ten blocks can be fond with no extra I/O. After that we must read indirect blocks. The real pointers are in the indirect block where sequential file processing will keep referncing it and Block I/O will keep it in the buffer cache that is 1-3 extra I/O operations per thousand blocks. # chapter 40 Interlude: Files and Directories **Persistent-storage device**, such as a classic **hard disk drive** or a more modern **solid-state storage device**, stores information permanently. - before we discussed RAM which is temporary ## Flies and Directories The first key abstraction is the **File** which is simply a linear array of bytes, each of which you can read or write. Each file has some kind of **low-level name,** usually a number of some kind; often, the user is not aware of this name. For historical reasons, the low-level name of a file is often referred to as its **inode number (i-number)**. In most systems, the OS does not know much about the strucutre of the file, simply the responsibility. The second key abstraction is the **directory** which like a file has on inode number, but it contians a list of (user-readable name, low-level name) pairs. Each entry in a directory refers to either files or other directories. By placing directories within other direcotires, users are able to build an arbitrary **directory tree/ hiearrchy** under which all files and direcotires are stored. - starts at the **root directory (/)** and uses some sort of **seperator** to name subsequent **sub-directories** **Absolute pathname** is the path to a file starting at the root (e.g. */foo/bar.txt*). Directories and files can have the same name so long as they are in different locations in the file-system tree. files are seperated by a period where (e.g. .c) is to specify the contents of the file is c code. However this is just a **convention** and there is no guarentee that it is acutally c code in the file. Ulitmately the file system structure is designed to name and identify data in a easy to understand way ## The File System Interface Will start with basics of creating, accessing, and deleting files. ### Creating Files can be simply done with the `open()` system call with a passed `O_CREAT` flag. `int fd = open("foo", O_CREAT|O_WRONLY|O_TRUNC, S_IRUSR|S_IWUSR);` More speicifically `O_CREAT` creates the file if it does not exist, `O_WRONLY` ensures that the file can only be written to, and `O_TRUNC` is used if the file already exists and truncates it to a size of zero bytes thus removing any existing content of the existing file. `S_IRUSR|S_IWUSR` is for specific permission such as making the file readable and writable by the owner. `open()` returns a **file descriptor** which is just an integer, private per process, and is used in UNIX systems to access files. you use the file descriptor to read or write the file. In this way a file descriptor is a **capability** (gives you the power to perform certain operations). - another way to think of a file descriptor is as a pointer to an object of type file where you can use other methods such as `read()` and `write()` to access the file Inside the OS is a **`proc structure`** which is a simple array, indexed by the file descriptor, and tracks which files are opened on a per-process basis. Each entry of the array is actually just a pointer to a `struct file`, which will be used to track information about the file being read or written. ### Reading and Writing Files **`strace`** (**dtruss** on Mac) is a way to see what programs are up to. it allows you to trace which system calls a program makes, see the arguments and return codes. - For example, -f follows any fork’d children too; -t reports the time of day at each call; -e trace=open,close,read,write only traces calls to Lets analize `cat` linux command. First it calls `open()` with `O_RDONLY` flag to make sure it is only reading, `O_LARGEFILE` makes sure a 64 bit offset is used. it also will check for a successful opening return (has the value of 3). - we return 3 with this becuase each process already has 3 files open: standard input (0), standard output (1), and standard error (2). upon success the next system call is `read()` which uses the file descriptor to tell the file system and OS which file to read. It has an argument for the return of `open()`, a second arg to point the output of `read()` to a buffer and a third arg for the size of the buffer. For a file that only contains "hello" the return will be 6 (5 for the letters and one for an end-of-line marker). `write()` has 3 parameters with the first one being the file to return to (usually 1 for standard output) the buffer message to output, and the return of `read()` (6). If `cat` is not highly optimized then it may call the library routine `printf()` achieve the desired output. The `cat` program will then try to read more from the file, (next line) but since there are no bytes left in the file, the `read()` returns 0 and the program calls `close()` to end it. ![Screenshot 2024-05-12 at 7.19.03 PM](https://hackmd.io/_uploads/S1QiOxym0.png) ### Reading and Writing, But Not Sequentially **sequential** is when we have eather read a file from the beginning to the end, or written a file out from beginning to end. However sometimes it is usefil to read or write to a specific offset within a file. `off_t lseek(int fildes, off_t offset, int whence);` The first arg is a file descriptor, the second arg is the offset to a particular location within the file, and `whence` determines exactly how the seek is performed. The current offset is updated in one of two ways. The first is when a read or write of N bytes takes place, N is added to the current offset; thus each read or write implicitly updates the offset. The second is explicitly with lseek, which changes the offset as specified above. The offset is kept in `struct file` ![Screenshot 2024-05-12 at 7.27.14 PM](https://hackmd.io/_uploads/H16F5ekX0.png) `struct file and struct proc` as sometimes reffered to as the **open file table**. The kernel just keeps these as an array, with one lock for the entire table. here is how the offset would operate: ![Screenshot 2024-05-12 at 7.30.48 PM](https://hackmd.io/_uploads/r1mvoxJXR.png) Here is another example where you can see the Open file Table working independently for 2 processes with offsets. ![Screenshot 2024-05-13 at 8.02.54 AM](https://hackmd.io/_uploads/r1ijoikX0.png) `lseek()` in design repositions the ucrrent offset before reading. ![Screenshot 2024-05-13 at 8.04.18 AM](https://hackmd.io/_uploads/B1nx2jy7R.png) ### Shared File Table Entries: `fork()` and `dup()` In many cases the mapping of file descriptor to an entry in the open file table is a one-to-one mapping. Even if some other process reads the same file at the same time, each will have its own entry in the open file table. each logical reading or writing of a file is independent, and each has its own current offset while it access the given file. there are cases where the open file table is **shared**. - when a parent process creates a child process with `fork()`. they share private descriptor array, the shared open file table entry and the reference from it to the underlying file-systme inode. - **reference count** is incremented only when both processes close the file and the entry is removed. - usefule when you want to write to the same output file without any extra coordination - `dup() (dup2(),dup3())` is another case where it allows a process to create a new file descriptor that refers to the same underlying open file as an existing descriptor. - useful for unix shell and performing operations like output redirection. ### Writing Immediately With `fsync()` `write()` in more detail simply tells the file system: please write this data to persistent storage, at some point in the future. for performance, it will buffer the writes into memory for some time and will be later issued to the storage device. - possibly ineffective as in that time the machine can crash before it gets stored on this disk. some applications like the **database management system (DBMS)** is used for development of a correct recovery protocol and requires the ability to force writes to disk from time to time. `fsync()` is called for a particualr file descriptor and the file system responds by forcing all dirty data to disk, for the file reffered to by the specified file descriptor. ![Screenshot 2024-05-13 at 8.46.31 AM](https://hackmd.io/_uploads/S1Z18h17C.png) In some cases, you also need to fsync() the directory that contains the file foo. Adding this step ensures not only that the file itself is on disk, but that the file, if newly created, also is durably a part of the directory. Not surprisingly, this type of detail is often overlooked, leading to many application-level bugs. ### Renaming Files once we have a file it is sometimes useful to be able to give a file a different name (e.g. `prompt> mv foo bar`). `strace` reveals the `rename(char *old, char *new)` system call and its arguments are pretty self explanitory. This is implemented as an **atomic** call with respect to system crashes. It swaps the new file into place, while concurrently feleting the old version of the file. ### Getting Information About Files we want to keep addition information for each file in the file system. this is known as **metadata** To see metadata for a certain file we can use `stat()` or `fstat()` system calls. ```c= struct stat { dev_t st_dev; // ID of device containing file ino_t st_ino; // inode number mode_t st_mode; // protection nlink_t st_nlink; // number of hard links uid_t st_uid; // user ID of owner gid_t st_gid; // group ID of owner dev_t st_rdev; // device ID (if special file) off_t st_size; // total size, in bytes blksize_t st_blksize; // blocksize for filesystem I/O blkcnt_t st_blocks; // number of blocks allocated time_t st_atime; // time of last access time_t st_mtime; // time of last modification time_t st_ctime; // time of last status change }; ``` ![Screenshot 2024-05-13 at 8.56.55 AM](https://hackmd.io/_uploads/SkGUO2k70.png) Each file system usually keeps this type of information in a structure called an **inode** and all inodes reside on disk; a copy of active ones are usually cached in memory to speed up access. ### Making Directories Note you can never write to a directory directly beucase the format of the directory is considered file system metadata. The file system considers itself responsible for the integrity of directory data. `mkdir()` system call is ued to create the directory with 2 entries (.) refers to itself and (..) refers to the parent. ### Reading Diectories lets take a look at `ls` command. this program uses 3 calls `opendir()`, `readdir()`, and `closedir()` and the interface is jsut a simple loop to read one direcotry entry at a time and print out the name and inode number of each file in the directory The declaration below shows the information avaiable within each directory entry in the `struct dirent` data structure: ```c= struct dirent { char d_name[256]; // filename ino_t d_ino; // inode number off_t d_off; // offset to the next dirent unsigned short d_reclen; // length of this record unsigned char d_type; // type of file }; ``` when you pass `-l` to `ls` it will use the `stat()` data structure over this ligther version. ### Deleting Directories done with the system call `rmdir()` which has the requirement that the directory be empty to avoid unintended data loss. ### Hard Links this is the key to understanding removing files by first understanding `link()` system call. This call takes 2 args, an old pathname and a new one. when you link a new file name to an old one, you essentially create another way to refer to the same file. done with `ln` command. - has the same inode number of the original file. - this is not copying just pointer logic ```bash= prompt> ls -i file file2 67158084 file 67158084 file2 ``` when you create a file, you are make a structure (innode) to track file information and you are **linking** a human readable name to that file and putting the link in the directory. ### Removing Files consider the unix call `rm`: ```bash= prompt> strace rm foo ... unlink("foo") = 0 ... ``` `unlink()` takes the name of the file to be removed, and returns zero upon success. As refered to in hard links, the human readable name is just the link to the metadata and calling `unlink()` will simply remove that link. This will also decrement **reference count / link count** which is how the file system keeps track of the number of links to a particular inode. when it reaches 0 does the inode and all the file information truely disappear. ### Symbolic Links Hard links are somewhat limited: you can’t create one to a directory (for fear that you will create a cycle in the directory tree); you can’t hard link to files in other disk partitions (because inode numbers are only unique within a particular file system, not across file systems) - Thus **symbolic / soft links** were created. (`ln -s`) The first major different is the symbolic link is actually a file itself, of a differnt type (its symlink) Another differnce is the is a danger of **dangling reference** where removing the original file causes the link to point to a pathname that no longer exists. ## Permission Bits ANd Access Control Lists abstraction for files is noticably different from that of the CPU and memory, in that files are commonly shared amoung different users and processes and are not always private. **permission bits** are used to define specific privacy and rules for files. (e.g. **-rw-r--r-- 1** remzi wheel 0 Aug 24 16:29 foo.txt) '-' is for regular file (d for dir and l for sym link). next 9 chars are permission bits. The permissions consist of three groupings: what the **owner** of the file can do to it, what someone in a **group** can do to the file, and finally what **anyone / other** can do to it. - (rw-) shows both readable and writable by the owner - (r--) only readable by the members of the group `wheel` - (r--) anyone / other in the system can only read it `chmod` command is used to change these perms. There is also the execute bit which determines if the file is allowed to run as a script. Some file systems, such as the distributed file system known as AFS include more sophisticated controls where it uses an **access control list (ACL)** per directory. can control who is given a resource. In a file system this can create a very specific list of who can and cannot read a set of files. - list can be given with `fs listacl <file>` command - will list both system admins and users permissions - to allow someone to access the directory use `fs setacl private/ andrea rl` ### Makeing and Mounting A File System To make a file system, most file systems provide a tool, usually referred to as` mkfs` that performs this task. The idea is as follows: give the tool, as input, a device (such as a disk partition, e.g., /dev/sda1) and a file system type (e.g., ext3), and it simply writes an empty file system, starting with a root directory, onto that disk partition. And mkfs said, let there be a file system! `mount()` system call is used to take an existing directory as a target **mount point** and essentially paste a new file system onto the directory tree at that point. e.g. - `mount -t ext3 /dev/sda1 /home/users` - mounts /dev/sdal to /home/users with this the mount, instead of heaving a number of separate file systems, mount has unified all files systems into one tree, making naming uniform and convenient. it shows that a whole number of different file systems, including **ext3** (a standard disk-based file system), the **proc file system** (a file system for accessing information about current processes), tmpfs (a file system just for temporary files), and **AFS** (a distributed file system) are all glued together onto this one machine’s file-system tree. # File-System Implementation introduces **vsfs (the Very Simple FIle System)** to introduce basics on-disk strucutres, access methods, and various policies that you will find in many file systems today. Note the file system is pure software. To understand how to think about them we should first understand **data structures** in the file system which are used to organize its data and metadata. (mostly arrays of blocks or other objects) Will also try to understand **access methods** such as open(), read(), write(), etc. ## Overall Organization The first thing we'll need to do is divide the disk into blocks. Let's call the region of the disk we use for user data the **data region**. we will also need to store **metadata** and store this information in a structure known as the **inode**. Therefore we need a protion of of the disk for the **inode table** which is a simple array of on-disk inodes. we will also need some space to keep track of free or allocted data (**allocation structures**) where we use a **bitmap**, one for the data region (**data bitmap**) and one for the inode table (**inode bitmap**). The last block is reserved for the **superblock** which contains information about this particular file system including how many inodes and data blocks are in the files system, where the inode table begins and so forth. ![Screenshot 2024-05-15 at 6.48.19 PM](https://hackmd.io/_uploads/S1CkUyQQR.png) ### File Organization: The Inode **inode** is short for **index node** as nodes were originally arranged in an array and the index was used when accessing a particular node. each inode has an **i-number** which is the low level name of a file. We can use this i-number to calculate where on the idsk the corresponding inode is located. ![Screenshot 2024-05-15 at 6.51.45 PM](https://hackmd.io/_uploads/B1n2LJXm0.png) $blk = (inumber * sizeof(inode_t)) / blockSize$ $sector = ((blk * blockSize) + inodeStartAddr) / sectorSize;$ once the **sector** is found each inode is virtually all the information you need about a file (**metadata**): - type (dir or .txt ...) - size - the numver of blocks allocated to it - protection information - some time information such was when it was created, modified, or last accessed - where its data blocks reside on disk (pointers) One of the most important decisions in the desgin of the inodes is having one or more **direct pointers** (disk addresses) pointing to one disk block that belongs to the file. - this approach is limited if you have a very big file that is bigger than the block size multiplied by the number of direct pointers in the inode. #### The Multi-Level Index to support bigger files you have to introduce **indirect pointers** which will point to a block that contains more pointers each of which pointer to the user data. This solve the issues of inodes having some fixed number of direct pointers by making one of them an indirect pointer. You will simply have a condition where if a file grows too big, an indirect block is allocated and an indirect pointer is set as the size grows you may reach a point where **double or triple indirect pointers** where there are multiple indirect blocks allocated. This imbalanced tree is referred to as the **multi-level index** approach This approach is used to optimize for the more common file characterists to come across which is *small* and its not often we do point to larger. but when we do we a prepared to handle it. ![Screenshot 2024-05-15 at 7.17.08 PM](https://hackmd.io/_uploads/ryxghhy7XC.png) ### Directory Organization A directory is simply a list of (entry name, inode number) pairs. For each file or directory in a given directory, there is a string and a number in the data blocks of the directory. ![Screenshot 2024-05-15 at 7.19.44 PM](https://hackmd.io/_uploads/HkjHTkQQ0.png) Deleting a file (e.g., calling unlink()) can leave an empty space in the middle of the directory, and hence there should be some way to mark that as well (e.g., with a reserved inode number such as zero). Such a delete is one reason the record length is used: a new entry may reuse an old, bigger entry and thus have extra space within. directory are like a special file which has an inode marked as "directory" and data blocks pointed to by the inode. ### Free Space Management A file system must track which indoes and data blocks are free and which are not. The **free space management** accomplishes this with 2 simple bitmaps. when we create a file, we will allocate an inode and the file system will then search through the bitmap for an inode that is free, allocate it to the file, mark the inode as used and eventually update the on-disk bitmap with the correct information. you may also need a **pre-allocation** policy to ensure you can find sequence of free blocks and thereby guaranteeing that a portion of the file will be contiguous on the disk. ### Access Paths: reading and Writing #### Reading A file From Disk ![Screenshot 2024-05-15 at 8.43.30 PM](https://hackmd.io/_uploads/Ska1-WQQA.png) in this image we see the path of the directory with open where `open(bar)` follows the path starting at the root inode, to get the data which will then point to the foo inode then to the food data which points to the bar inode which is the reference we want to access for the next commands The final step of `open()` is to read bar's inode into memory. the FS does a final permissions check and then issues `read()`s to read all the blocks of data and write them to memory and updating the offset for the next read until it finishes. to close the file descriptor should deallocate and that is all it really needs to do. Note that the amount of I/O generated by the open is proportional to the length of the pathname. #### Writing A File To Disk Unlike reading, writing to the file may also **allocate** a block, unless the block is being overwritten, therefore it has to decide which block to allocate to the file and thus update other structures of the disk accordingly (data bitmaps and inode). ![Screenshot 2024-05-15 at 8.55.52 PM](https://hackmd.io/_uploads/SkXCQWXQ0.png) The amount of write traffic is even worse when one considers a simple and common operation such as file creation. To create a file, the file system must not only allocate an inode, but also allocate space within the directory containing the new file. with the high number of I/O you have to wonder who the system can accomplish any of this with reasonable efficicency. ### Caching and Buffering To remedy the expensive read and writes, most file systems use **system memory (DRAM)** to cache important blocks. Early file systems introduced a **fixed-size cache** to hold popular blocks (size usually 10% of total memory). This implements **LRU** to decide on which blocks should be cached. - this is **static partitioning** which can be wasteful when the file system does not need 10% of memory at a given point in time. Modern system implement **dynamic partitioning** / integrate virtual memory pages into **unified page cache** to allocate more flexibly across virtual memory and file systems. caches allow for hits to occur on files and thus avoid the need for I/O. Note that this will not serve the same purpose for writing. Therefore we will use **write buffering** where we delay write so the file system can batch some updates into a smaller set of I/Os. By buffering a number of writes in memory the system can then schedule the subsequent I/Os and this increase performance. Additionally, if a write is written and then deleted there is no need to write and avoids the I/O needs entirely there. - laziness is a virtue this also comes with a tradeoff where if the system crashes before an update then some of the write information will be lost but performance will improve with batching. Therefore you must consider if applications (e.g. databases) do not like this trade-off. - call `fsync()` to force write to disk. and avoiding file system altogether. # File Types and Attributes Files are most commonly discussed as byte streams or one-dimensional arrays. many of those byte-streams contain highly structured data (e.g. a load module, an MPEG-4 video or a database). Some sources of data (e.g. directories and key-value stores) while serializable are not intended to be processed as byte streams. ## Ordinary Files Even the simplest files are understood by imposing structure and semantics on a binary byte stream: - A **text file** is a byte stream, but when we process it we generally break it into lines (delimited by \n or \r\n), and render it as characters (e.g. according to the ASCII or ISO 8859 character set). - An **archive (e.g. zip or tar)** is a single file that contains many others. It is an alternating sequence of headers (that describe the next file in the archive) and data blobs (that are the contents of the described file). - A **load module** is similar to an archive (in that it is an alternating sequence of section headers and contents) but the different sections represent different parts of a program (code, initialized data, symbol table, ...). - an **MPEG stream** is a sequence of audio, video, frames, containing compressed (e.g. Discrete Cosine Transform) program information, which require considerable processing in order to reconstruct the encoded program. An ordinary file is just a blob of ones and zeroes. They only have meaning when rendered by a program that understands the underlying data format. ### Data Types and Associated Applications If a file can only be interpreted by a program that understands the meaning that has been encoded in the byte stream, our first problem is finding the right program to interpret each file (or byte stream). There are a few general approaches: - Require the user to specifically invoke the correct command to process the data. This is common in Unix-derived systems: - to edit a file you type vi filename - to compile a program you type gcc filename - Consult a registry that associates a program with a file type: - there may be a system-wide registry that associates programs with file types (e.g. Windows). - there may be a program-specific registry that associates programs with file types (e.g. configuring browser plug-ins). - the owning program may be an attribute of the file (e.g. classic Mac OS). A registry solution presupposes that we know what the type of the file is. There are a few approaches to **classing** files: - The simplest approach is based on file name suffix (e.g. .c, .png, .txt). This may be simply an organizing convention (e.g. in Unix-derived systems) or a hard-rule (in Windows it may be impossible to process a file that has been renamed to the wrong suffix). - Another common approach (very popular with Unix-derived systems) is a magic number at the start of the file. Each type of file (or even file system) begins with a reserved and registered magic number that identifies the file's type. For more information on this approach, look at the file(1) command. - In systems that support extended attributes the file type can be an attribute of the file, not dependent on a suffix or magic-number registry. ### File Structure and Operations Even ASCII text byte streams have structure, and some streams (e.g. an MPEG-4 video) have a very rich structure. In these cases the file can be viewed as a serialized representation of data that is intended to be viewed by a particular program (e.g. a text editor or video player). It is meaningful to talk about doing read(2) and write(2) operations on these files. But not all files are intended to be accessed via these operations. In some cases, however, the structure of the data is not merely an implementation decision, but fundamental to the manner in which the data is intended to be used: - the earliest databases were indexed sequential files. These were organized into records, each with a unique index key. While these files could be processed sequentially (one record at a time), it was more common to get and put records based on their keys. - these evolved into Relational databases, accessed via Structured Query Languages. - the complexity (and non-scalability) of SQL databases gave rise to much simpler (and more scalable) Key-Value Stores accessed only by get, put, and delete operations. While all of these wind up being implemented (usually by some middle-ware layer) on top of byte streams, that is certainly not how they are accessed by their clients. ## Other types of files It can be argued that all of the above types of ordinary files are still just blobs of data, differing in their binary representations and the operations in terms of which they are written and read back. But there are other types of files that are not merely blobs of data, to be written and re-read. ### Directories Directories do not contain blobs of client data. Rather they represent name-spaces, the association of names with blobs of data. They are, in this respect, somewhat similar to **key-value stores**, but they the namespace is much more highly structured, and the operations are quite different. One very important difference is that a key-value store, as a single file, typically contains data owned by a single user. The namespace implemented by directories includes files owned by numerous users, each of whom wants to impose different sharing/privacy constraints on the access to each referenced file. Because of how important directories are to the system operation and integrity, all directory operations tend to be implemented within the **operating system**. To ensure the correctness and security of the directory structure there are only a few supported update operations (e.g. mkdir(2), rmdir(2), link(2), unlink(2), create(2) and open(2)), each of which involves considerable permission checking and integrity assurance. But despite these differences, directories exist in same namespace as files, and have the same notions of user/group ownership, and file protection. They can even be accessed (if you know what you are doing) with open(2) and read(2). ### Inter-Process Communications Ports An **inter-process communications port** (e.g. a pipe) is not so much a container in which data is stored, as it is a channel through which data is passed. That data is exchanged via write(2) and read(2) system calls on file descriptors that can be manipulated with the dup(2) and close(2) operations. And in the case of named pipes they can be accessed via the open(2) system call. With directories we saw something that was implemented as on-disk byte streams, but accessed with very different operations. With inter-process communications ports, we have something with a very different implementation, but that is accessed (as a byte stream) with normal file I/O operations. ### I/O Devices I/O devices connect a computer to the outside world. Many operating systems put devices in the file namespace. In Unix/Linux systems the special file associated with a device can be anywhere and have any name. Many sequential access devices (e.g. keyboards and printers) are fit naturally into the byte-stream read(2)/write(2) model. Other random access devices (e.g. disks) easily fit into a read(2)/write(2)/seek(2) access model. Communications interfaces often behave like byte streams (or perhaps message streams) that also support additional control functions (like controlling line speed, setting MAC address, etc), which we can handle with ioctl(2) operations. But not all I/O devices can be fit into a byte-stream access model. Consider a 3D rendering engine comprised of a few thousand GPUs. Rather than deal with this as a byte stream, we simply map gigabytes of display memory and control registers into our address space and maniuplate them directly. With more primitive devices, we may choose to deal directly with digital and analog signals that sample the state of the external world and control external actuators. Beyond the fact that we used the open(2) system call to get access to them, these devices may bear no relation to persistent byte streams. ## File Attributes Most of the above have treated files as containers (or access portals) for data. But even a file of ASCII text (e.g. this HTML) is more than than the bytes it contains. In addition to data, files also have **meta-data** ... data that describes data. ### System Attributes Unix/Linux files all have a standard set of attributes: - type: regular, directory, pipe, device, symbolic link, ... - ownership: identity of the owning user and owning group - protection: permitted access (read, write execute), by the owner, the owning group, and others - when the file was created, last updated, last accessed - size (for regular files) ... the number of bytes in the file (which may be sparse). Other operating systems may support more or different attributes. What is important about this list is: - all files have these attributes - the operating system depends on these attributes (e.g. to correctly implement access control) - the operating system maintains these attributes ### Extended Attributes There may be other information (beyond the basic system attributes) that is vitally important to correct file processing. Examples might be: - if the file has been encrypted or compressed, by what algorithm(s)? - if a file has been signed, what is the associated certificate? - if a file has been check-summed, what is the correct check-sum? - if a program has been internationalized, where are its localizations? All of the above examples are meta-data. They are not part of the file's contents ... but rather descriptive information that may be necessary to properly process the file's contents. There have been two basic approaches taken to supporting extended attributes: - associate a limited number/size of name=value attributes with each file. - pair each file with one or more shadow files (sometimes called resource forks) that contain additional resources and information. The operating system may even need to make use of extended attributes. If we were required to extend Unix/Linux owner/group/other protection to support generalized Access Control Lists we might choose store the additional information in (protected) extended attributes. ## Diversity of Semantics The Posix operations for file and directory operations are standardized and relatively universal (although some file systems do not support some types of links). The Posix standards even include a readdir(3) operation for file-system and operating system independent scanning of directories. It is not difficult to write portable software that operates on ordinary files and directories. The operations on and behavior of other types of files (e.g. inter-process communications ports and devices) are not at all standardized. While the need for extended attributes is widely recognized, there are many different implementations, and not (yet) any widely accepted standard. # An Introduction to DOS FAT Volume and File Strucutre When the first personal computers with disks became available, they were very small (a few megabytes of disk and a few dozen kilobytes of memory). A file system implementation for such machines had to impose very little overhead on disk space, and be small enough to fit in the BIOS ROM. BIOS stands for BASIC I/O Subsystem. Note that the first word is all upper-case. The purpose of the BIOS ROM was to provide run-time support for a BASIC interpreter (which is what Bill Gates did for a living before building DOS). DOS was never intended to provide the features and performance of real timesharing systems. Disk and memory size have increased in the last thirty years, People now demand state-of-the-art power and functionality from their PCs. Despite the evolution that the last decades have seen, old standards die hard. Much as European train tracks maintain the same wheel spacing used by Roman chariots, most modern OSs still support DOS FAT file systems. DOS file systems are not merely around for legacy reasons. The ISO 9660 CDROM file system format is a descendent of the DOS file system. The DOS FAT file system is worth studying because: - It is heavily used all over the world, and is the basis for more modern file system (like 9660). - It provides reasonable performance (large transfers and well clustered allocation) with a very simple implementation. - It is a very successful example of "linked list" space allocation. ## Strucutral Overview All file systems include a few basic types of data structures: - **bootstrap:** code to be loaded into memory and executed when the computer is powered on. MVS volumes reserve the entire first track of the first cylinder for the boot strap. - **volume descriptors:** information describing the size, type, and layout of the file system ... and in particular how to find the other key meta-data descriptors. - **file descriptors:** information that describes a file (ownership, protection, time of last update, etc.) and points where the actual data is stored on the disk. - **free space descriptors:** lists of blocks of (currently) unused space that can be allocated to files. - **file name descriptors:** data structures that user-chosen names with each file. DOS FAT file systems divide the volume into fixed-sized (physical) blocks, which are grouped into larger fixed-sized (logical) block clusters. The first block of DOS FAT volume contains the bootstrap, along with some volume description information. After this comes a much longer File Allocation Table (FAT from which the file system takes its name). The File Allocation Table is used, both as a free list, and to keep track of which blocks have been allocated to which files. The remainder of the volume is data clusters, which can be allocated to files and directories. The first file on the volume is the root directory, the top of the tree from which all other files and directories on the volume can be reached. ![Screenshot 2024-05-16 at 12.15.02 PM](https://hackmd.io/_uploads/BJMrsA7XA.png) ## Boot block BIOS Parameter Block and FDISK Table Most file systems separate the first block (pure bootstrap code) from volume description information. DOS file systems often combine these into a single block. The format varies between (partitioned) hard disks and (unpartitioned) floppies, and between various releases of DOS and Windows ... but conceptually, the boot record: - begins with a branch instruction (to the start of the real bootstrap code). - followed by a volume description (BIOS Parameter Block) - followed by the real bootstrap code - followed by an optional disk partitioning table - followed by a signature (for error checking). ![Screenshot 2024-05-16 at 12.24.11 PM](https://hackmd.io/_uploads/H1DwpRX7C.png) ### BIOS Parameter Block After the first few bytes of the bootstrap comes the BIOS parameter block, which contains a brief summary of the device and file system. It describes the device geometry: - number of bytes per (physical) sector - number of sectors per track - number of tracks per cylinder - total number of sectors on the volume It also describes the way the file system is laid out on the volume: - number of sectors per (logical) cluster - the number of reserved sectors (not part of file system) - the number of Alternate File Allocation Tables - the number of entries in the root directory These parameters enable the OS to interpret the remainder of the file system. ### FDISK Table As disks got larger, the people at MicroSoft figured out that their customers might want to put multiple file systems on each disk. This meant they needed some way of partitioning the disk into logical sub-disks. To do this, they added a small partition table (sometimes called the FDISK table, because of the program that managed it) to the end of the boot strap block. This FDISK table has four entries, each capable of describing one disk partition. Each entry includes - A partition type (e.g. Primary DOS partition, UNIX partition). - An ACTIVE indication (is this the one we boot from). - The disk address where that partition starts and ends. - The number of sectors contained within that partition. ![Screenshot 2024-05-16 at 12.27.00 PM](https://hackmd.io/_uploads/SkJMRRQmC.png) The addition of disk partitioning also changed the structure of the boot record. The first sector of a disk contains the **Master Boot Record (MBR)** which includes the FDISK table, and a bootstrap that finds the active partition, and reads in its first sector (**Partition Boot Record**). Most people (essentially everyone but Bill Gates :-) make their MBR bootstrap ask what system you want to boot from, and boot the active one by default after a few seconds. This gives you the opportunity to choose which OS you want to boot. Microsoft makes this decision for you ... you want to boot Windows. The structure of the Partition Boot Record is entirely operating system and file system specific ... but for DOS FAT file system partitions, it includes a BIOS Parameter block as described above. ## File Descriptors (directories) In keeping with their desire for simplicity, DOS file systems combine both file description and file naming into a single file descriptor (directory entries). A DOS directory is a file (of a special type) that contains a series of fixed sized (32 byte) directory entries. Each entry describes a single file: - an 11-byte name (8 characters of base name, plus a 3 character extension). - a byte of attribute bits for the file, which include: - Is this a file, or a sub-directory. - Has this file changed since the last backup. - Is this file hidden. - Is this file read-only. - Is this a system file. - Does this entry describe a volume label. - times and dates of creation and last modification, and date of last access. - a pointer to the first logical block of the file. (This field is only 16 bits wide, and so when Microsoft introduced the FAT32 file system, they had to put the high order bits in a different part of the directory entry). - the length (number of valid data bytes) in the file. ![Screenshot 2024-05-16 at 12.34.19 PM](https://hackmd.io/_uploads/ryUa1kVmR.png) If the first character of a files name is a NULL (0x00) the directory entry is unused. The special character (0xE5) in the first character of a file name is used to indicate that a directory entry describes a deleted file. DOS stores file modification times and dates as a pair of 16-bit numbers: - 7 bits of year, 4 bits of month, 5 bits of day of month, 5 bits of hour, 6 bits of minute, 5 bits of seconds (x2). A seven bit field for years means that the the DOS calendar only runs til 2107. Hopefully, nobody will still be using DOS file systems by then :-) ## Links and Free Space (File Allocation Table) Many file systems have very compact (e.g. bitmap) free lists, but most of them use some per-file data structure to keep track of which blocks are allocated to which file. The DOS File Allocation Table is a relatively unique design. It contains one entry for each logical block in the volume. If a block is free, this is indicated by the FAT entry. If a block is allocated to a file, the FAT entry gives the logical block number of the next logical block in the file. ### Cluster Size and Performance Space is allocated to files, not in (physical) blocks, but in (logical) multi-block clusters. The number of clusters per block is determined when the file system is created. Allocating space to files in larger chunks improves I/O performance, by reducing the number of operations required to read or write a file. This comes at the cost of higher internal fragmentation (since, on average, half of the last cluster of each file is left unused). - As disks have grown larger, people have become less concerned about internal fragmentation losses, and cluster sizes have increased. The maximum number of clusters a volume can support depends on the width of the FAT entries. Microsoft compromised and adopted 12-bit wide FAT entries (two entries in three bytes). These were called **FAT-12 file systems**. As disks got larger, they created 2-byte wide (FAT-16) and 4-byte wide (FAT-32) file systems. ### Next Block Pointers A file's directory entry contains a pointer to the first cluster of that file. The File Allocation Table entry for that cluster tells us the cluster number next cluster in the file. When we finally get to the last cluster of the file, its FAT entry will contain a -1, indicating that there is no next block in the file. ![Screenshot 2024-05-16 at 12.41.04 PM](https://hackmd.io/_uploads/H1oIZkEQC.png)