Saturday, June 12, 2021

Unix File I/O - Beyond read(2) and write(2) - Part I

read(2) and write(2) system calls are well known for doing file I/O operations. They are perhaps covered in the very beginning in C or Unix programming courses. But more commonly programmers tend to use the fread(3), fwrite(3), fprintf(3) et al functions. One of the advantage of using the latter set of functions for file I/O is that they are buffered and thus offer better performance. The read(2) and write(2) system calls on the other hand are unbuffered which means that when you use write(2) to write some data, it has to be written to the file, unlike fwrite(3) where it might sit in a buffer for a while before eventually being written to the file. Not writing data immediately to the file results in better performance because the function can return faster and the actual writing will happen in the background at some point of time (e.g. when the buffer gets full or another process is trying to read that data at which point the in memory data needs to be flushed to the file on disk).

But there are situations where using unbuffered I/O, such as the read(2) and write(2) system calls makes sense. For example, database systems prefer to have control over their reads and writes and want to make sure that when they write data it is actually written to the disk, or the data they are reading from the disk is not stale. For this reason many databases avoid using buffered I/O functions and stick to the unbuffered versions, while managing the cache themselves so that they have control over the consistency of the data.

Using unbuffered I/O for these reasons is fine but the read(2) and write(2) system calls are cumbersome to use in complex applications such as databases. Let's see why.

Database systems when reading or writing data, always want to read from specific offsets in the data file. For instance when looking up a record, it will first do a lookup for the record in the index file. The index file usually provides the offset of the record in the data file where the actual data for that record is stored. The database system then needs to do a read from that offset in the data file. This could translate to something like the following imaginary code

typedef struct record_metadata {
off_t rec_offset;
size_t rec_size;
} record_metadata;

record_metadata *
search_index(db_t *db, void *key)
{
// returns record_metadata object which contains the offset of the record
// in the data file and the record size
}

record_t *
lookup_record(db_t *db, void *key)
{
record_metadata *record_meta = search_index(db, key);
if (record_offset == -1) {
return NULL;
}
record_t *rec = allocate_record(record_meta);
// ignoring error handling for simplicity
off_t ret = lseek(db->datafd, record_meta->rec_offset, SEEK_SET);
if (ret == -1) {
goto ERROR_HANDLE;
}
ssize_t bytes_read = read(db->datafd, rec->data, record_meta->rec_size);
if (bytes_read == -1) {
goto ERROR_HANDLE;
}
free(record_meta);
return rec;
ERROR_HANDLE:
warn("record lookup failed");
free(record_meta);
free(rec);
return NULL;
}

What we see here is that we need to issue two system calls to read from the data file, once to seek to the right offset and then to actually do the read. Similar patterns repeat throughout the system whether reading/writing the index or the data file.

Another problem with this arises if the system is multi-threaded. For every open file descriptor in the process, the kernel maintains the value of the current offset for that process in a table. With every read or write for n number of bytes, the kernel increments the current offset by that many bytes, so that the next read or write will happen at that position. Since the threads within the process share the same file descriptors as the main process, they also share the same file offsets. Which means that if one of the threads seeks to a particular offset to read something, it's possible that another thread issues a seek to write some data at another offset. This can result in all kinds of chaos and data corruption.

To avoid the above two problems, POSIX provides two system calls - pread(2) and pwrite(2). Their signature and behavior is very similar to read(2) and write(2) but with one important difference. Following is their synopsis:


#include <unistd.h>

ssize_t pread(int fd, void *buf, size_t count, off_t offset);

ssize_t pwrite(int fd, const void *buf, size_t count, off_t offset);

The difference is that these system calls have an additional fourth argument, called offset. These system calls explicitly ask for the offset from which to start reading or writing the data, thus avoiding the need to manually first seek to that offset. This not only makes the code simpler, avoids one extra system call but also makes the read and writes thread safe. Using these the above sample code will change to something like this:

record_t *
lookup_record(db_t *db, void *key)
{
record_metadata *record_meta = search_index(db, key);
if (record_offset == -1) {
return NULL;
}
record_t *rec = allocate_record(record_meta);
ssize_t bytes_read = pread(db->datafd, rec->data, record_meta->rec_size,
record_meta->offset);
if (bytes_read == -1) {
goto ERROR_HANDLE;
}
free(record_meta);
return rec;
ERROR_HANDLE:
warn("record lookup failed");
free(record_meta);
free(rec);
return NULL;
}

Apart from pread(2) and pwrite(2), there are two more exciting (😬) system calls, readv(2), writev(2) which provide ways to do vectorized I/O. Database systems tend to exploit these as well, I will talk about these with specific examples in another post. Stay tuned! ⏳

No comments:

Post a Comment