What is Database Management System(DBMS)? In short, a DBMS should support the following functionalities:

  • store data (large amount of data!)
  • query data
  • modify data (add, update, delete, …)
  • enable durability (recover from failure)
  • control access of many users
    • isolation (each transaction appears to execute in isolation from other transactions)
    • atomicity (prevents updates to the database occurring only partially)
  • Allow users to create new databases and specify their schemas using a data definition language

Concurrency and Transactions

A transaction is a single logical unit of work which accesses and possibly modifies the contents of a database. In order to maintain consistency in a database, before and after the transaction, each transaction follows ACID properties:

  • Atomicity : prevents updates to the database occurring only partially (all or nothing)
  • Consistentcy : regarding constraints & transactions preserve them
  • Isolation : appear to be executed as if by itself
  • Durability : transaction never lost once completed (even if system failure occurs)

In a real-world scenario of DBMS operation,

  • The clients can open one or more parallel connections to a SQL DBMS server
  • Within each connection, we can pass statements consisting of SQL queries, modifications, transactions,…
  • We use the concept of a cursor to iterate over results from individual queries, allowing us parse sequentially through large query results even when it does not fit into memory

When any of the clients is passing statements, the isolation level of a transaction can be one of:

Levels Dirty Reads Nonrepeatable Reads Phantoms Lost Updates
Read Uncommitted v v v v
Read Committed   v v v
Repeatable Read     v v
Serializable        

Every transaction has its isolation level set to one of these when it is created. For exmaple, the default level in PostgreSQL is “read committed”.

Dirty Reads

A dirty read (uncommitted dependency) occurs when a transaction is allowed to read data from a row that has been modified by another running transaction and not yet committed.

When one (unfinished) transaction inserts rows in a table and the other (also unfinished) transaction tries to read all rows in the table.

The first transaction can rollback and the second transaction would have read “phantom” rows that never existed.

Nonrepeatable Reads

A non-repeatable read occurs, when during the course of a transaction, a row is retrieved twice and the values within the row differ between reads.

Phantoms

A phantom read occurs when, in the course of a transaction, new rows are added or removed by another transaction to the records being read.

Lost Updates

Updates performed in one transaction can be “lost”, or overwritten by another transaction that happens to run concurrently

For example, the first transaction’s change is lost, because the second one “overwrote” the row

Data Model

To organize and manage elements of data, a DBMS leverage Data Model to relate data to the properties of real-world entities, where a Data Model consists of

  1. a mathematical representation of data
    • tree-based
    • gragh-based (semi-structures model)
    • table-based (relational model)
    • custom structures…
  2. operations on data
    • insert, delete, update, query
  3. constraints
    • uniqueness
    • key / value contraints

In 1970, the concept of the Relational Model was proposed (Codd, E. F. A Relational Model of Data for Large Shared Data Banks). With Relational Model, activities of users at terminals and most application programs should remain unaffected when the internal representation of data is changed and even when some aspects of the external representation are changed. In addition, changes in data representation will often be needed as a result of changes in query, update, and report traffic and natural growth in the types of stored information.

Here are the key definitions in the Relational Model

  • “contents” is a set of tuples
  • a tuple $=$ a row (e.g., (student, enrollment), (student, ID), …)
  • attributes $=$ column names
  • relation name $=$ table name (e.g., Students, Cources, Grades, …)
  • relation schema (e.g., Students(_studentId_: integer, name: string, enrollment: date))
  • database schema $=$ a set of all relation schemas in the database

For example, figures below are instances of two relations that might constitute part of a banking database.

1. The attributes of each relation

Relation name Attributes
Accounts accNo, type, balance
Customers firstName, lastName, idNo, account

2. The tuples of each relation

Relation name Tuples
Accounts (12345, savings, 12000),
(23456, checking, 1000),
(34567, savings, 25)
Customers (Robbie, Banks, 901-222, 12345),
(Lena, Hand, 805-333, 12345),
(Lena, Hand, 805-333, 23456)

3. The components of one tuple from each relation

Relation name Components
Accounts 12345 $\rightarrow$ accNo,
savings $\rightarrow$ type,
12000 $\rightarrow$ balance
Customers Robbie $\rightarrow$ firstName,
Banks $\rightarrow$ lastName,
901-222 $\rightarrow$ idNo,
12345 $\rightarrow$ account

4. The relation schema for each relation

Relation name Schema
Accounts Accounts(accNo: integer, type: string , balance: integer)
Customers Customers(firstName: string, LastName: string, idNo: string, account: integer)

5. The database schema

  • Accounts(accNo: integer, type: string , balance: integer)
  • Customers(firstName: string, LastName: string, idNo: string, account: integer)

And below are some key concepts in the Relational Model:

Bag Semantics

A bag is an unordered collection of items where identical items may appear multiple times. It generalizes the notion of a set of items.

Suppose some element $t$ is contained in bag $A$ $n$ times and in bag $B$ $m$ times, then

  • $A ⋃ B$ contains $t$ exactly $n+m$ times
  • $A ⋂ B$ contains $t$ exactly $\min(n,m)$ times
  • $A-B$ contains $t$ exactly $\max(0, n-m)$

Consider $A = {1,1,2}$, $B = {1, 2, 3}$.

  • $A ⋃ B$ = {1,1,1,2,2,3}
  • $A ⋃ A$ = {1,1,2,1,1,2}
  • $A ⋂ B$ = {1,2}
  • $A – B$ = {1}

Null Values

Null is a special value denoted $NULL$ in SQL and $\bot$ in relational algebra.

Use cases:

  • Value unknown
    • e.g unknown name in student tuple (11232, NULL, 2019-01-01)
  • Value Inapplicable
    • e.g. a course without capacity constraints (“DD0001”, “Welcome to KTH”, NULL)
  • Value witheld
    • masked grades

In Aggregations, Null Values do not contribute to SUM/AVG etc, but contributes to COUNT (e.g., SUM({1, NULL, 2}) = 3, COUNT({1, NULL, 4}) = 3

3-valued Logic

3-valued logic is a consequence of supporting null to mark absent data. Logic conditions of SQL use three-valued logic with values TRUE, FALSE and UNKNOWN:

  1. Comparing($>,<, \geq, \leq$) any value to NULL (including NULL) returns UNKNOWN
  2. A tuple is in the result of a Selection if and only if the condition evaluates to TRUE on that tuple.
  3. We think of TRUE as 1, FALSE as 0 and UNKNOWN as 0.5, then
    • A AND B = MIN(A,B)
    • A OR B = MAX(A, B)
    • NOT A = 1 - A

However, not all classical logic identities hold in 3-valued logic. The exceptions are

  • P or NOT P = true (does not hold when P is unknown)
  • A or B = B or A
  • (A or B) or C = A or (B or C)

How relations are stored: “Column-oriented” vs. “Row-based”

The differences between “Column-oriented DBMS” and “Row-based DBMS” are illustrated in the table below:

  Row-based DBMS Column-oriented DBMS
Explanation Data is stored and retrieved one row at a time Data is stored and retrieved one column at a time
Advantages Records in Row Oriented Data stores are easy to read and write (1)Efficient in performing operations applicable to the entire dataset and hence enables aggregation over many rows and columns (2)Basically permits high compression rates due to little distinct or unique values in columns.
Disadvantages (1)Not efficient in performing operations applicable to the entire datasets and hence aggregation in row-oriented is an expensive job or operations. (2)Typical compression mechanisms which provide less efficient result than what we achieve from column-oriented data stores. Read and write operations are slower as compared to row-oriented.
Best suited application online transaction system online analytical processing

Consequently, if I’m tasked with setting up a new database for analytics, I may prefer using column-oriented DBMS since the data itself will not undergoing many modifications but I’ll need to handle operations applicable to the entire datasets and aggregation over many rows and columns, which is efficient in a column-oriented DBMS but inefficient in a row-based DBMS.

References