Reading notes
Concepts
Raw I/O: issued directly to disk offsets, bypassing the file system altogether. Used esp for databases that can manage and cache their own data better than the file system cache.
Direct I/O: use file system but bypass the file system cache. Used by applications that perform file system backups, to avoid polluting the file system cache with data that will be read only once.
Memory-mapped files: map files to the process address space and access memory offset directly. This avoids the syscall execution and context switch overheads incurred when calling read() and write syscalls to access file data. It can also avoid double copying of data, if the kernel supports direct copying of the file data buffer to the process address space. Disadvantage of using mapping on multiprocessor systems can be the overhead to keep each CPU MMU in sync, specifically the CPU cross calls to remove mappings (TLB shootdowns).
Logical metadata: file statistics, timestamp updates etc
Physical metadata: on-disk layout metadata necessary to record all file system information — superblocks, inodes, blocks of data pointers and free lists.
Architecture
VFS: virtual file system interface provides common interface for different file system types.
Application POSIX| System Libraries
[ System Calls ]
| VFS
raw/io File System
| Volume Manager
Disk Device Subsystem
Disk Devices
Cache
Buffer cache: stored in page cache since Linux 2.4.
Page cache: caches virtual memory pages, including file system pages.
Dentry cache: remembers mapping from directory entry to VFS nodes.
Inode cache: this cache contains VFS inodes.
File System Features
Block vs Extent
Block based file systems store data in fixed-size blocks.
Extent based file systems preallocate contiguous space for files, growing them as needed.
Journaling
A file system journal records changes to the file system so that in the event of a system crash, changes can be replayed atomically.
Methodology
Strategy order: latency analysis, performance monitoring, workload characterization, micro-benchmarking, static analysis and event tracing.
Disk Analysis
Common mistake: ignore file system and focus on disk performance instead. This is generally true for simpler file system and smaller caches but it might miss entire classes of issues since FS matters more and more nowadays.
Latency Analysis
4 layers: Application -> Syscall interface -> VFS -> Top of file system.
Workload Characterization
- Operation rate and operation types.
- File IO throughput
- File IO size
- Read/write ratio
- Synchronous write ration
- Random versus sequential file offset access
Checklist:
- What is the fs cache hit rate?
- what are the fs cache capacity and current usage?
- what other caches are present (directory, inode, buffer) and what are their statistics?
- which applications or users are using the fs?
- what files and dirs are being accessed? Created and deleted?
- have any errors been encountered? was this due to invalid requests, or issues from the fs?
- why is fs IO issued (user level call path)?
- to what degree is the fs IO application synchronous?
- what is the distribution of IO arrival times.
Monitoring
Key metrics for file system performance are:
- operation rate
- operation latency
Event Tracing
Last resort but adds performance overhead due to capturing and savings of details.
Analysis
Tools
$strace -ttT -p 123TIME_STAMP read(12, "asdas"..., 65536) = 65536 <system call time>
pi@pi-aw:~$ free -m # m means in megabytes
total used free shared buff/cache available
Mem: 7924 895 6038 4 990 6767
Swap: 2047 0 2047
Tuning
Application Calls
posix_fadvise(int fd, off_t offset, off_t len, int advice);
Advice includes:
POSIX_FADV_SEQUENTIAL : data range will be accessed sequentially.
POSIX_FADV_RANDOM : will be accessed randomly.
POSIX_FADV_NOREUSE : will not be reused.
POSIX_FADV_WILLNEED : will be used again in near future.
POSIX_FADV_DONTNEED : will not be used in near future.
Another library call to operate on a memory mapping:
int madvise(void *addr, size_tlength, int advice);
MADV_RANDOM
MADV_SEQUENTIAL
MADV_WILLNEED
MADV_DONTNEED
tune2fs / e2fsck
tune2fs -I dir_index /dev/hdX // uses hashed B tress to speed up lookups in large directories.
e2fsck -D -f /dev/hdX // reindex directories in a fs.