3-Database Storage

2022-08-30

字数统计: 3.1k字 | 阅读时长≈ 19分

Source From: https://15445.courses.cs.cmu.edu/fall2020/schedule.html

Storage

We will focus on a “disk-oriented” DBMS architecture that assumes that primary storage location of the database is on non-volatile disk. At the top of the storage hierarchy, you have the divices that are closest to the CPU. This is the fastest storage, but it is also the smallest and most expensive. The further you get away from the CPU, the larger but slower the storage devices get. These devices also get cheaper per GB.

Volatile Devices:

Volatile means that if you pull the power from the machine, then the data is lost.
Volatile storage supports fast random access with byte-addressable locations. This means that the program can jump to any byte address and get the data that is there.
For our purpose, we will always refer to this storage class as “memory”.

Non-Volatile Devices:

Non-volatile means that the storage device does not need to be provided continuous power in order for the device to retain the bits that it is storing.
It is also block/page addressable. This means that in order to read a value at a particular offset, the program first has to load the 4KB page into memory that holds the value the program wants to read.
Non-volatile storage is traditionally better at sequential access (reading mutiple continguous chunks of data at the same time).
We will refer to this as “disk”. We will not make a (major) distinction between solid-state storage(SSD) and spinning hard devices(HDD).

There is also a new class of storage devices that are becoming more popular called persistant memory. These devices are designed to be the best of both worlds: almost as fast as DRAM with the persistence of disk. We will not cover these devices in this course.

Since the system assumes that the databse is stored on disk, the components of the DBMS are responsible for figuring out how to move data between non-volatile disk and volatile memory since the system cannot operate on the data directly on disk.

We will focus on how we can hide the latency(隐藏) of the disk rather than focusing on optimizations with registers and caches since getting data from disk is so slow. If reading data from the L1 cache reference took half a second, reading from an SSD would take 1.7 days, and reading from an HHD would take 16.5 weeks.

Disk-Oriented DBMS Overview

The database is all on disk, and the data in database files is organized into pages, with the first page being the directory page. In order to operate on the data, the DBMS needs to bring the data into memory. It does this by having buffer pool that manages the movement of data back and forth between disk and memory. The DBMS also has an execution engine that will execute queries.The execution engine will ask the buffer pool for a specific page, and the buffer pool will take care of bringing that page into memory and giving the execution engine a pointer to that page in memory. The buffer pool mamager will ensure that the page is there while the execution engine is operating on thar part of memory.

DBMS vs. OS

A high-level design goal of the DBMS is to support databases that exceed the amount of memory available. Since reading/writing to disk is expensive, it must be managed carefully. We do not want large stalls from fetching something from disk to slow down everything else. We want the DBMS to be able to process other queries while it is waiting to get the data from disk.

This high-level design goal is like virtual memory, where there is a large address space and a place for the OS to bring in pages from disk.

One way to achieve this virtual memory is by using mmap to map the contents of a file in a process’ address space, which makes the OS responsible for moving pages back and forth between disk and memory. Unfortunately, this means that if mmap hits a page fault, the process will be blocked.

You never want to use mmap in your DBMS if you need to write.
The DBMS(almost) always wants to control things itself and can do a better job at it since it knows more about the data being accessed and the queries being processed.
The operating system is not your friend.

It is possible to use the OS by using:

madvise: Tells the OS know when you are planning on reading certain pages.
mlock: Tells the OS to not swap memory ranges out to disk.
msync: Tells the OS to flush memory ranges out to disk.

We do not advise using mmap in a DBMS for correctness and performance reasons.

Even though the system will have functionalities that seem like something the OS can provide, having the DBMS implement these procedures itself gives it better control and performance.

File Storage

In its most basic form, a DBMS stores a database as files on disk. Some may use a file hierarchy, others may use a single file(e.g., SQLite).

The OS does not know anything about the contents of these files. Only the DBMS knows how to decipher their contents, since it is encoded in a way specific to the DBMS.

The DBMS’s storage manager is responsible for managing a database’s files. It represents the files as a collection of pages. It also keeps track of what data has been read and written to pages as well how much free space there is in these pages.

Database Pages

The DBMS organizes the database across one or more files in fixed-size blocks of data called pages. Pages can contain different kinds of data (tuples, indexes, etc). Most systems will not mix these types within pages. Some systems will require that pages are self-contained, meaning that all the information needed to read each page is on the page itself.

Each page is given a unique identifier. If the database is a single file, then the page id can just be the file offset. Most DBMSs have an indirection layer that maps s page id to a file path and offset. The upper levels of the system will ask for a specific page number. Then, the storage manager will have to turn that page number into a file and an offset to find the page.

Most DBMSs use fixed-size pages to avoid the engineering overhead needed to support variable-sized pages. For example, with variable-size pages, deleting a page could create a hole in file that the DBMS cannot easily fill with new pages.

There are three concepts of pages in DBMS:

Hardware page (usually 4KB)
OS page (4KB)
Database page (1-16KB)

The storage device guarantees an atomic write of the size of the hardware page. If the hardware page is 4KB and the system tries to write 4KB to the disk, either all 4KB will be written, or none of it will. This means that if our database page is larger than our hardware page, the DBMS will have to take extra measures to ensure that the data gets written out safely since the program can get partway through writing a database page to disk when the system crashes.

Database Heap

There are a couple ways to find the location of the page a DBMS wants on the disk, and a heap file organization is one of those ways.

A heap file is an unordered collection of pages where tuples are stored in random order.

The DBMS can locate a page on disk given a page_id by using a linked list of pages or a page directory.

Linked List: Header page holds pointers to a list of free pages and a list of data pages. However, if the DBMS is looking for a specific page, it has to do s sequential scan on the data page list until it finds the page it is looking for.
Page Directory: DBMS maintains special pages that track locations of data pages along with the amount of free space on each page.

Page Layout

Every page includes a header that records meta-data about the page’s contents:

Page size.
Checknum.
DBMS version.
Transcation visibility.
Self-containment. (Some systems like Oracle require this.)

A strawman approach to laying out data is to keep track of how many tuples the DBMS has stored in a page and then append to the end every time a new tuple is added. However, problems arise when tuples are deleted or when tuples have variable-length attributes.

There are two main approaches to laying out data in pages: (1) slotted-pages; (2) log-structured.

Slotted Pages: Page maps slots to offsets.

Most common approach used in DBMSs today.
Header keeps track of the number of used slots, the offset of the starting location of the last used slot, and a slot array, which keeps track of the location of the start of each tuple.
To add a tuple, the slot array will grow from the beginning to the end, and the data of the tuples will grow from end to the beginning. The page is considered full when the slot array and the tuple data meet.

Log-Structured: Instead of storing tuples, the DBMS only stores log records.

Stores records to file of how the database was modified(insert, update, delete).
To read a record, the DBMS scans the log file backwards and “recreates” the tuple.
Fast writes, potentially slow reads.
Works well on append-only storage because the DBMS cannot go back and update the data.
To avoid long reads, the DBMS can have indexes to allow it to jump to specific locations in the log. It can also periodically compact the log. (If it had a tuple and then made an update to it, it could compact it down to just inserting the updated tuple.) The issue with compaction is that the DBMS ends up with write amplification. (It re-writes the same data over and over again.)

Tuple Layout

A tuple is essentially a sequence of bytes. It is the DBMS’s job to interpret those bytes into attribute types and values.

Tuple Header: Contains meta-data about the tuple.

Visibility information for DBMS’s concurrency control protocol(i.e. information about which transcation created/modified that tuple).
Bit Map for NULL values.
Note that the DBMS does not need to store meta-data about the schema for the database here.

Tuple Data: Actual data for attributes.

Attributes are typically stored in the order that you specify them when you create the table.
Most DBMSs do not allow a tuple to exceed the size of a page.

Unique Identifier:

Each tuple in the database is assigned a unique identifier.
Most common: page_id + (offset or slot).
An application cannot rely on these ids to mean anything.

Denormalized Tuple Data: If two tables are related, the DBMS can “pre-join” them, so the tables end up on the same page. This makes reads faster since the DBMS only has to load in one page rather two separate pages. However, it makes updates more expensive since the DBMS needs more space for each tuple.

Data Representation

The data in a tuple is essentially just byte arrays. It is up to the DBMS to know how to interpret those bytes to derive the values for attributes. A data represenetation scheme is how a DBMS stores the bytes for a value.

There are five high level datatypes that can be stored in tuples: integers, variable-precision numbers, fixed-point precision numbers, variable length values, and dates/times.

Intergers: Most DBMSs store integers using their “native” C/C++ types as specified by the IEEE-754 standard. These values are fixed length. (Example: INTERGER, BIGINT, SMALLINT, TINYINT).

Variable Precision Numbers: These are inexact, variable-precision numeric types that use the “native” C/C++ types specified by IEEE-754 standard. These values are also fixed length.

Operations on variable-precision numbers are faster to compute than arbitrary precision numbers because the CPU can execute instructions on them directly. However, there may be rounding errors when performing computations due to the fact that some numbers cannot be represented precisely. (Example: FLOAT, REAL)

Fixed-Point Precision Numbers: These are numberic data types with arbitrary precision and scale. They are typically stored in exact, variable-length binary representation (almost like a string) with additional meta-data that will tell the system things like the length of the data and where the decimal should be.

These data types are used when rounding errors are unacceptable, but the DBMS pays a performance penalty to get this accuracy. (Example: NUMERIC, DECIMAL)

Variable-Length data: These represent data types of arbitrary length. They are typically stored with a header that keeps track of the length of the string to make it easy to jump to the next value. It may also contain a checksum for the data.

Most DBMSs do not allow a tuple to exceed the size of a single page. The ones that do store the data on a special “overflow” page and have the tuple contain a reference to that page. These overflow pages can contain pointers to additional overflow pages until all the data can be stored.

Some systems will let you store these large values in an external file, and then the tuple will contain a pointer to that file. For example, if the database is storing photo information, the DBMS can store thr photos in the external files rather than having them take up large amounts of space in the DBMS. One downside of this is that the DBMS cannot manipulate the contents of this file. Thus, there are no durability or transaction protections. (Example: VARCHAR, VARBINARY, TEXT, BLOB)

Dates and Times: Representations for date/time vary for different systems. Typically, these are represented as some unit time (micro/milli) seconds since the unix epoch. (Example: TIME, DATE, TIMESTAMP)

System Catalogs: In order for the DBMS to be able to deciphter the contents of tuples, it maintains an internal catalog to tell it meta-data about the databases. The meta-data will contain information about what tables and columns the databases have along with their types and the orderings for the values.

Most DBMSs store their catalog inside of themselves in the format that they use for their tables. They use special code to “bootstrap” these catalog tables.

Workloads

There are many different workloads for database systems. By workload, we are referring to the general nature of requests a system will have to handle. This course will focus on two types: Online Transaction Processing and Online Analytical Processing.

OLTP: Online Transaction Processing
An OLTP workload is characterized by fast, short running operations, simple queries that operate on single entity at a time, and repetitive opertations. An OLTP workload will typically handle more writes than reads.

An example of an OLTP workload is the Amazon sotrefront. User can add things to their cart, they can make purchases, but the action only affect their account.

OLAP: Online Analytical Processing
An OLAP workload is characterized by long running, complex queries, reads on large portions of the database. In OLAP workloads, the database system is analyzing and deriving new data from existing data collected on the OLTP side.

An example of an OLAP workload would be Amazon computing the five most bought items over a one month period for these geographical locations.

HTAP: Hybrid Transaction + Analytical Processing
A new type of workload which has become popular recently is HTAP, which is like a combination which tries to do OLTP and OLAP together on the same database.

Storage Models

There are different ways to store tuples in pages. We have assumed the n-ary storage model so far.

N-Ary Storage Model(NSM)

In the n-ary storage model, the DBMS stores all of the attributes for a single tuple contiguously in a single page, so NSM is also known as a “row store”. This approach is ideal for OLTP workloads where requests are insert-heavy and transactions tend to operate only an individual entity. It is ideal because it takes only one fetch to be able to get all of the attributes for a single tuple.

Advantages

Fast inserts, updates, and deletes.
Good for queries that need the entire tuple.

Disadvantages

Not good for scanning large portions of the table and/or a subset of the attributes. This is because it pollutes the buffer pool by fetching data that is not needed for processing the query.

Decomposition Storage Model(DSM)

In the decomposition storage model, the DBMS stores a single attribute (column) for all tuples contiguously in a block of data. Thus, it is also known as a “column store”. This model is ideal for OLAP workloads with many read-only queries that perform large scans over a subset of the table’s attributes.

Advantages

Reduces the amount of wasted work during query execution because the DBMS only read the data that it needs for that query.
Enables better compresion because all of the values for the same attributes are stored contiguously.

Disadvantages

Slow for point queries, inserts, updates, and deletes because of tuple splitting/stitching.

To put the tuples back together when using a column store, there are two common approaches: The most commonly used approach is fixed-length offsets. Assuming the attributes are all fixed-length, the DBMS can compute the offset of the attributes for each tuple. Then when the system wants the attribute for a specific tuple, it knows how to jump to that spot in the file from the offset. To accommodate the variable-length fields, the system can either pad fields so that they are all the same length or use a dictionary that takes a fixed-size integer and maps the integer to the value.

A less common approach is to use embedded tuple ids. Here, for every attribute in the columns, the DBMS stores a tuple id (ex: a primary key) with it. The system the would also store a mapping to tell it how to jump to every attribute that has that id. Note that this method has a large storage overhead because it needs to store a tuple id for every attribute entry.

本文作者： 夏花
本文链接： http://xiahua19.github.io/2022/08/30/cmu15-3-Database-Storage/
版权声明： 本博客所有文章除特别声明外，均采用 MIT 许可协议。转载请注明出处！