

Why database systems?
Provide means for user to manage data
Allow users to
Specify data to be stored (data modeling)
Specify operation on the data (query language)
Structurally language is not stupid language
Ensure consistency and integrity of data (integrity checking)
Manage how data is being stored (indexing, file organization)
Manage how data is actually retrieved (query processing, optimization)
Manage how data is being shared/not shared (concurrency control)
Recover data after a failure (recovery)
No power. Database can save the information.
Ensure data are not accessible by people who should not (security)
Access data from a variety of sources/locations (distribution processing)
Data are represented as tuples in relations
Represented as tables

Relation can be viewed as SETS of attributes (set in a mathematical sense)
Constraints on the table
if you specify a table to have a primary key, the database will automatically ensure that actually is unique
So let’s say you have a table with a student SSN and and you call a student id a primary key. It does not automatically ensure the SSN is unique. You may have to post a different student ID, but they have the same SSN and they’re different student. And you have this in your table, you are dead.
A typical SQL query has the form:
select A1, A2, ..., An
from r1, r2, ..., rm
where P
A i A_i Ai represents an attribute
R i R_i Ri represents a relation
P P P is a predicate.
The result of an SQL query is a relation.
API available for a variety of programming languages to interact (e.g. ODBC, JDBC)
Also database specific APIs are available
An abstract query language on relations
A set of operations on a relations, returning another relation
Basic operations:
Selection: σ c o n d i t i o n ( R ) \sigma_{condition}(R) σcondition(R)
Projection: ∏ a t t r i b u t e l i s t ( R ) \prod_{attribute\ list}(R) ∏attribute list(R)
Set operations: ∩ , ∪ , − \cap, \cup, - ∩,∪,−,
There are corresponding SQL commands for these (not often used)
-cut the diiference not to set.
SQL basically tell the database system what I want. But it doesn’t tell the database system how to find.
There are two kind of bosses. One boss says, hey go to finish the project, I don’t care. Cannot finish the project in two days, you’ll fire. And then you ask him, How do I finish the project? They’ll say, You’ll figure it out. If you are the boss, it is super cool, but if you are employee not cool.
So relational algebra is the query language that kind of keep in more detail how to do it. It’s not totally procedural, it’s not totally every detail, but it goes towards.
Student table
… … … … … … Find all students that have gpa at least 3.7
σ g p a ≥ 3.7 ( S t u d e n t ) \sigma_{gpa \geq 3.7}(Student) σgpa≥3.7(Student)
Projection:
This is student table, so I have a lot of academic, all I want just SSN
∏ S S N ( S t u d e n t ) \prod_{SSN}(Student) ∏SSN(Student)
Basic operations:
Cartesian product: r × s r\times s r×s
$A = {1, 2, 3} $ B = { a , b , c } B = \{a, b, c\} B={a,b,c}
What is A × B A \times B A×B ?
$ A \times B = {(1,a), (1,b),(1,c),(2,a),(2,b),(2,c),(3,a),(3,b),(3,c)}$
Now you can link to tables by doing a Cartesian product, followed by selection
Join: r ⋈ s r\Join s r⋈s
$\sigma_{A \times B} = \Join $
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bEyTO9YB-1663289939496)(https://tva1.sinaimg.cn/large/e6c9d24egy1h685y0ojqxj21k10u0785.jpg)]
Very straight forward way of converting SQL statements into a list of operations
Now executing a query becomes executing a program of such operations
Implementing them correctly and efficiently will ensure the database system perform well
NoSQL (from data modelling perspective)
Internals of a DBMS
Query execution and optimization
SElECT name From Student WHERE gpa > 3.9
- 1
- 2
- 3
Concurrency Control
Recovery
Distributed databases / Big Data
At least most of the mainstream database systems like mysql, SQL Server, DB2, all these are still running on relational model. Now, there’s some modification here. Some people will tell you they are running the object relational model. But underlying it is still a relational model. And relational model has been very popular, has been very successful.
At least the second part of the class will be focused on the internals of a database system, a relational database system.
So this class, we’re going to look more inside the internal operations.We’re going to look at the clear execution optimization.
if you understand underlying how the database system work now we can flip the table back and let’s say you IDBA, let’s say you the administrator of a database.
Murphy’s Law: Anything that can go wrong will go wrong, and at the worst possible time.
So let’s say an example. Let’s say I want to transfer money from a savings account to a checking account. I want to transfer $100 from me to you. It basically need to take to subtract $100 from my account. And they need to add $100 to your account
From my account: -100
---------------------------------------- Murphy’s law: the system will crash here.
To your account: +100
If the system crashed here for 5 seconds and the system restart. But how do you make sure thing doesn’t go haywire?
ACID doesn’t cover full recovery, It also cover an issue called Concurrency Control.
What should the ATM do? The first thing is to check whether your have $100. If you does have 100$. Then put a cut out. The will minus 100$ from my account.
You put your card in ATM. And you to check if you have a $100
if subtract $100 from your account and then give you 100 bucks


From: Guy Harrison, “[Next generation databases : NoSQL and Big Data](https://smu.primo.exlibrisgroup.com/discovery/fulldisplay?docid=alma9952401872403716&context=L&vid=01SMU_INST:01SMU&lang=en&adaptor=Local Search Engine)”, Apress, 2015

Think about a database represented as a tree. And each piece of data if they knock on the tree.
tree structure is also very rigid
Relational Model
Advantages
Well-defined mathematical background
**Normal forms. ** // Normalizaiton
Separation of physical and logical layers
Make things like optimization possible/manageable
SELECT name FROM Student WHERE gpa > 3.6
- 1
- 2
- 3
Do you have any clue how the data is actually physically organized in the database?
You have no clue. Do you need to have any clue?
NO. That’s the key point. That’s where the religion model becomes very successful.
If I am a user, do I care how the data is being stored? No, I don’t. But if you are a DBA, you might want to care because you want to create to return faster. but the total separation of physical and logical layer is important.
Full transaction model (ACID)
For concurrency and recovery
ACID 表示原子性(atomicity)、一致性(consistency)、隔离性(isolation)、持久性(durability)。ACID 是数据库事务(database transaction)需要向用户确保有效的属性——即使在出现崩溃、错误、硬件故障等情况时也需要保证这些属性。
Well suited for client-server systems
You have a database that sits somewhere on a computer, constantly running
You have other people have either web browser machine, they send a query, hey, I want to ask this school query. He wants to assess the centralized database, collect the query, execute it, and send the results back to the user.
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6EUwaIhk-1663289939499)(https://tva1.sinaimg.cn/large/e6c9d24egy1h685yi7k86j20wc03iaah.jpg)]
First normal form
No compound value (set/list/sequence etc.)
1 7330, 7445 2 7330, 7555, 7688 This might be what you want to do right, but in the relation model, this is seriously discouraged.
Why is it being discouraged in a traditional relational database?
So that means a human being had to encode this.
I can encode this way 7330, 7555, 7688 or 7330 # 7555 # 7688
Then let’s say find all student who asks who is taking 7555? Your sql will be very ugly
Because you basically are giving up all the advantage to a relational database give you so you are forced to do something like this. Which is fine. It’s just that people don’t like it.
To store this require multiple tuples:
| Student ID | Course |
|---|---|
| 1 | CS 7330 |
| 1 | CS 7445 |
| 2 | CS 7330 |
| 2 | CS 7555 |
| 2 | CS 7688 |
a student was taking two classes, I need to have two separate tuples
| Edge ID | Node 1 | Node2 |
|---|---|---|
| 1 | 1 | 2 |
| 2 | 1 | 4 |
| 3 | 3 | 2 |
| 4 | 5 | 1 |
| 5 | 4 | 6 |
Can you think of some real life data that is very nicely suited to be model by graph?
Can you think of application where the data is better suited to be as a graph?
Do we need to duplicate edge?
Either query will be complicated (e.g. node1 = 3 or node2 = 3)
If I want to see Node1 and Node5 is connected, how will the sql looks like?
select count(*) from t where ((node1 = "1") AND (node2 = "5")) or ((node1 = '5') AND (node2 = "1"))
- 1
- 2
- 3
- 4
- 5
Or inconsistency may occur (deleted (1, 3) but forget to delete (3,1)
If you delete one five and forget to delete five one, then you’re screwed.
Very high potential inconsistency.

Development of new applications
Large amount of data
Not necessarily structured
Required high availability and fault tolerance
SQL does not do recursion, because you don’t know how many tables you will jump.
More flexible data models
Weak consistency model
Different storage management
Designed to allow replication
Targeted for Big Data / Web applications
事务具有4个特性:原子性(Atomicity)、一致性(Consistency)、隔离性(Isolation)和持续性(Durability)。这4个特性简称为ACID特性(ACID properties)。
Key-Value stores
Wide Columns stores
Document stores
Graph stores
Data are represented as a set of (key, value) pairs
Key
Value
Treated as a BLOB (Binary large object)
No structure (schemaless)
Some system provide a little bit of structure
Typically provide efficient operations on
Suitable for system that want maximum flexibility
Example: Redis

But the whole thing the whole point said the key value store call value very efficient organization based on the key and the key only.
Example: doctor -> patient treatment records
Rows and column sets
A set of rows
Each row has a fixed number of column sets (family)
Many such system allow timestamps
Data can be group by rows or by column sets
Index structure available for columns
Support big data operations such as map reduce
Example: Apache HBase

Extended key-value structure where value is represented as a “document”
Data are grouped into buckets (like tables)
Relationship between documents are modelling via embedding or referencing documents
Embedding: attributes are documents themselves
duplicating the way
References: attributes are keys of other documents
Provide indexes for document attributes
Provide query language support to retrieve (part of) a document
Example: MongoDB, Amazon DynamoDB

Data Represented as graphs
Potential different types of graphs
Two types of storage
Graph Query
Example: Neo4j, Titan


Most NoSQL database are aim at large, distributed data with long transactions
Different consistency/isolation levels then ACID
Distributed data issues
Data may be distributed and replicated
Multiple copies of the database running on different sites
Network may fail without any of the database failing
The CAP issue
CAP Theorem: You can only choose 2 out of the 3 to maintain correctly.