• Review Intro to NoSQL


    Review / Intro to NoSQL

    Screen Shot 2022-09-11 at 13.26.01

    Screen Shot 2022-09-11 at 13.23.48

    Database Systems

    • Why database systems?

      • Provide means for user to manage data

      • Allow users to

        • Specify data to be stored (data modeling)

        • Specify operation on the data (query language)

          Structurally language is not stupid language

        • Ensure consistency and integrity of data (integrity checking)

        • Manage how data is being stored (indexing, file organization)

        • Manage how data is actually retrieved (query processing, optimization)

        • Manage how data is being shared/not shared (concurrency control)

        • Recover data after a failure (recovery)

          No power. Database can save the information.

        • Ensure data are not accessible by people who should not (security)

        • Access data from a variety of sources/locations (distribution processing)

    Relational Model

    • Data are represented as tuples in relations

    • Represented as tables

      • Rows (tuples): each unit of data
      • Columns (attributes) : attributes of each unit
        Screen Shot 2022-09-11 at 12.44.45
    • Relation can be viewed as SETS of attributes (set in a mathematical sense)

    • Constraints on the table

      • First Normal Form
        • But NULL values allowed
      • No duplicate tuples
        • Thus (primary) keys
      • Domain values
      • Referential Integrity
        • Foreign keys

    if you specify a table to have a primary key, the database will automatically ensure that actually is unique

    So let’s say you have a table with a student SSN and and you call a student id a primary key. It does not automatically ensure the SSN is unique. You may have to post a different student ID, but they have the same SSN and they’re different student. And you have this in your table, you are dead.

    SQL

    • A typical SQL query has the form:

      select A1, A2, ..., An
      from r1, r2, ..., rm
      where P
      
      • 1
      • 2
      • 3

      A i A_i Ai represents an attribute

      R i R_i Ri represents a relation

      P P P is a predicate.

    • The result of an SQL query is a relation.

    • API available for a variety of programming languages to interact (e.g. ODBC, JDBC)

    • Also database specific APIs are available

    Relational Algebra

    • An abstract query language on relations

    • A set of operations on a relations, returning another relation

    • Basic operations:

      • Selection: σ c o n d i t i o n ( R ) \sigma_{condition}(R) σcondition(R)

        • Pick tuples from the relation based on a condition, keeping all attributes
      • Projection: ∏ a t t r i b u t e   l i s t ( R ) \prod_{attribute\ list}(R) attribute list(R)

        • Select all tuples, but only keep attributes on the attribute list
      • Set operations: ∩ , ∪ , − \cap, \cup, - ,,,

        • There are corresponding SQL commands for these (not often used)

          - cut the diiference not to set.

    SQL basically tell the database system what I want. But it doesn’t tell the database system how to find.

    There are two kind of bosses. One boss says, hey go to finish the project, I don’t care. Cannot finish the project in two days, you’ll fire. And then you ask him, How do I finish the project? They’ll say, You’ll figure it out. If you are the boss, it is super cool, but if you are employee not cool.

    So relational algebra is the query language that kind of keep in more detail how to do it. It’s not totally procedural, it’s not totally every detail, but it goes towards.

    Student table

    Find all students that have gpa at least 3.7

    σ g p a ≥ 3.7 ( S t u d e n t ) \sigma_{gpa \geq 3.7}(Student) σgpa3.7(Student)

    Projection:

    This is student table, so I have a lot of academic, all I want just SSN

    ∏ S S N ( S t u d e n t ) \prod_{SSN}(Student) SSN(Student)

    • Basic operations:

      • Cartesian product: r × s r\times s r×s

        • Create a table such that every pair of tuples in r , s r, s r,s is match to a topic
        • Basis of merging multiple tables

        $A = {1, 2, 3} $ B = { a , b , c } B = \{a, b, c\} B={a,b,c}

        What is A × B A \times B A×B ?

        $ A \times B = {(1,a), (1,b),(1,c),(2,a),(2,b),(2,c),(3,a),(3,b),(3,c)}$

        Now you can link to tables by doing a Cartesian product, followed by selection

      • Join: r ⋈ s r\Join s rs

        • Cartesian Product followed by a selection
        • The selection usually (not always) match the corresponding attributes (foreign keys) of the two tables
        • Most common operation for merging tables.

        $\sigma_{A \times B} = \Join $

    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-bEyTO9YB-1663289939496)(https://tva1.sinaimg.cn/large/e6c9d24egy1h685y0ojqxj21k10u0785.jpg)]

    Extension

    • Each query input is a table (or set of tables)
    • Each query output is a table.
    • All data in the output table appears in one of the input tables
    • There are many things that relational algebra (and SQL) cannot do
      • E.g. things that require recursion
    • Extension of relational algebra to incorporate SQL statements such as
      • SUM
      • AVG
      • MAX
      • MIN
      • GROUP BY…HAVING

    Why?

    • Very straight forward way of converting SQL statements into a list of operations

    • Now executing a query becomes executing a program of such operations

    • Implementing them correctly and efficiently will ensure the database system perform well

    Course outline

    • NoSQL (from data modelling perspective)

    • Internals of a DBMS

      • Query execution and optimization

        SElECT name
        From Student
        WHERE gpa > 3.9
        
        • 1
        • 2
        • 3
      • Concurrency Control

      • Recovery

    • Distributed databases / Big Data

      • NoSQL (from a performance perspective)

    At least most of the mainstream database systems like mysql, SQL Server, DB2, all these are still running on relational model. Now, there’s some modification here. Some people will tell you they are running the object relational model. But underlying it is still a relational model. And relational model has been very popular, has been very successful.

    At least the second part of the class will be focused on the internals of a database system, a relational database system.

    So this class, we’re going to look more inside the internal operations.We’re going to look at the clear execution optimization.

    if you understand underlying how the database system work now we can flip the table back and let’s say you IDBA, let’s say you the administrator of a database.

    Murphy’s Law: Anything that can go wrong will go wrong, and at the worst possible time.

    So let’s say an example. Let’s say I want to transfer money from a savings account to a checking account. I want to transfer $100 from me to you. It basically need to take to subtract $100 from my account. And they need to add $100 to your account

    From my account: -100

    ---------------------------------------- Murphy’s law: the system will crash here.

    To your account: +100

    If the system crashed here for 5 seconds and the system restart. But how do you make sure thing doesn’t go haywire?

    ACID doesn’t cover full recovery, It also cover an issue called Concurrency Control.

    What should the ATM do? The first thing is to check whether your have $100. If you does have 100$. Then put a cut out. The will minus 100$ from my account.

    Screen Shot 2022-09-11 at 15.39.14

    ATM:you x = 100$
    Your spouse

    You put your card in ATM. And you to check if you have a $100

    if subtract $100 from your account and then give you 100 bucks

    IMG_A4D7DB51E1DC-1

    NoSQL databases

    • A history of database systems
      image-20220911162319193

    From: Guy Harrison, “[Next generation databases : NoSQL and Big Data](https://smu.primo.exlibrisgroup.com/discovery/fulldisplay?docid=alma9952401872403716&context=L&vid=01SMU_INST:01SMU&lang=en&adaptor=Local Search Engine)”, Apress, 2015
    IMG_9BD8D57FEB0B-1

    First Generation Databases

    • Network Model/Hierarchical Model
    • Works on mainframes
    • Navigation based (i.e. you need to tell the DBMS “where” the data is)
      • E.g. following pointers and links (parent-child etc.)
    • Drawbacks
      • Inflexible schema structure (next to impossible to change mid-stream)
      • Navigation based means complex query equal to complex program
        • User have to specify how to get to the data

    Think about a database represented as a tree. And each piece of data if they knock on the tree.

    tree structure is also very rigid

    Second Generation Database

    • Relational Model

    • Advantages

      • Well-defined mathematical background

      • **Normal forms. ** // Normalizaiton

        • “all non-key attributes must be dependent on “the key, the whole key, and nothing but the key—So Help Me Codd”
      • Separation of physical and logical layers

        • Make things like optimization possible/manageable

          SELECT name 
          FROM Student
          WHERE gpa > 3.6
          
          • 1
          • 2
          • 3

          Do you have any clue how the data is actually physically organized in the database?

          You have no clue. Do you need to have any clue?

          NO. That’s the key point. That’s where the religion model becomes very successful.

          If I am a user, do I care how the data is being stored? No, I don’t. But if you are a DBA, you might want to care because you want to create to return faster. but the total separation of physical and logical layer is important.

      • Full transaction model (ACID)

        • For concurrency and recovery

          ACID 表示原子性(atomicity)、一致性(consistency)、隔离性(isolation)、持久性(durability)。ACID 是数据库事务(database transaction)需要向用户确保有效的属性——即使在出现崩溃、错误、硬件故障等情况时也需要保证这些属性。

      • Well suited for client-server systems

      You have a database that sits somewhere on a computer, constantly running

      You have other people have either web browser machine, they send a query, hey, I want to ask this school query. He wants to assess the centralized database, collect the query, execute it, and send the results back to the user.

      A
      Database
      C
      D
      E
      ...

    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-6EUwaIhk-1663289939499)(https://tva1.sinaimg.cn/large/e6c9d24egy1h685yi7k86j20wc03iaah.jpg)]

    Limit of relational models

    • First normal form

    • No compound value (set/list/sequence etc.)

      17330, 7445
      27330, 7555, 7688

      This might be what you want to do right, but in the relation model, this is seriously discouraged.

      Why is it being discouraged in a traditional relational database?

      So that means a human being had to encode this.

      I can encode this way 7330, 7555, 7688 or 7330 # 7555 # 7688

      Then let’s say find all student who asks who is taking 7555? Your sql will be very ugly

      Because you basically are giving up all the advantage to a relational database give you so you are forced to do something like this. Which is fine. It’s just that people don’t like it.

    • To store this require multiple tuples:

    Student IDCourse
    1CS 7330
    1CS 7445
    2CS 7330
    2CS 7555
    2CS 7688

    a student was taking two classes, I need to have two separate tuples

    • Modeling of (undirected) graph can be very tricky
    Edge IDNode 1Node2
    112
    214
    332
    451
    546

    Can you think of some real life data that is very nicely suited to be model by graph?

    Can you think of application where the data is better suited to be as a graph?

    • Do we need to duplicate edge?

    • Either query will be complicated (e.g. node1 = 3 or node2 = 3)

      If I want to see Node1 and Node5 is connected, how will the sql looks like?

      select count(*)
      from t
      where ((node1 = "1") AND (node2 = "5"))
      			 or
      			 ((node1 = '5') AND (node2 = "1"))
      
      • 1
      • 2
      • 3
      • 4
      • 5
    • Or inconsistency may occur (deleted (1, 3) but forget to delete (3,1)

      If you delete one five and forget to delete five one, then you’re screwed.

      Very high potential inconsistency.

    Screen Shot 2022-09-12 at 00.15.15

    • Development of new applications

      • Large amount of data

      • Not necessarily structured

        • Potentially evolving
      • Required high availability and fault tolerance

        • ACID may not be the best options for transactions

    SQL does not do recursion, because you don’t know how many tables you will jump.

    NoSQL databases

    • More flexible data models

      • Semi-structured / Non-structured
      • Allow for evolving (non-fixed schema)
    • Weak consistency model

      • Not requiring ACID properties
    • Different storage management

      • E.g. storing via columns
    • Designed to allow replication

      • Need to handle consistency (or allow limited level of inconsistency)
    • Targeted for Big Data / Web applications

    事务具有4个特性:原子性(Atomicity)、一致性(Consistency)、隔离性(Isolation)和持续性(Durability)。这4个特性简称为ACID特性(ACID properties)。

    NoSQL Data Models

    • Key-Value stores

    • Wide Columns stores

    • Document stores

    • Graph stores

    Key-Value Store

    • Data are represented as a set of (key, value) pairs

    • Key

      • Can be simple (single attributes) or structured (composition of attributes)
      • DBMS provide efficient means of indexing them
    • Value

      • Treated as a BLOB (Binary large object)

      • No structure (schemaless)

        • User has to interpret the structure him/herself
      • Some system provide a little bit of structure

        • List/tables of values
    • Typically provide efficient operations on

      • Get(Key)
      • Put(Key, value)
      • Delete(Key
    • Suitable for system that want maximum flexibility

      • Cost: no database support of handling/interpreting values
    • Example: Redis
      image-20220912004000132

    But the whole thing the whole point said the key value store call value very efficient organization based on the key and the key only.

    Example: doctor -> patient treatment records

    Wide Column Store

    • Rows and column sets

    • A set of rows

      • Each with a row-key
    • Each row has a fixed number of column sets (family)

      • Each column set has an arbitrary set of columns
      • No need to be the same for different rows
      • Each column has a column name and value associated with it.
    • Many such system allow timestamps

      • Each column has a timestamp indicating when it is added to the database
    • Data can be group by rows or by column sets

    • Index structure available for columns

    • Support big data operations such as map reduce

    • Example: Apache HBase

    image-20220912004000132

    Document Store

    • Extended key-value structure where value is represented as a “document”

      • E.g. XML, JSON
      • Basically a document has a set of attributes
      • Each attribute has a name and a set of values
      • Attributes can be different for different keys
    • Data are grouped into buckets (like tables)

    • Relationship between documents are modelling via embedding or referencing documents

      • Embedding: attributes are documents themselves

        duplicating the way

      • References: attributes are keys of other documents

    • Provide indexes for document attributes

    • Provide query language support to retrieve (part of) a document

    • Example: MongoDB, Amazon DynamoDB

    image-20220912004506174

    Graph Store

    • Data Represented as graphs

    • Potential different types of graphs

      • Undirected/Directed graphs
      • Labeled graph: vertex/edge have scalar values
      • Attribute graph: (key, value) pairs associated
      • Multigraphs: Multiple edges between vertices
      • Hypergraph: Each edge is a set connecting multiple vertices
    • Two types of storage

      • Non-native: build on top of other storage system (e.g. Document store)
      • Native: Build from scratch
    • Graph Query

      • Path query: check if a path exist between two nodes
        • May specify only certin type of paths (e.g. only contain certain labels)
      • Pattern matching query: find all the path that isomorphic with the query graph (maybe allow wildcards in some places)
      • Some system provide more complicated graph queryies
    • Example: Neo4j, Titan

    image-20220912004832874

    image-20220912004841245

    Other major issues

    • Most NoSQL database are aim at large, distributed data with long transactions

      • Different consistency/isolation levels then ACID

      • Distributed data issues

        • Data may be distributed and replicated

        • Multiple copies of the database running on different sites

        • Network may fail without any of the database failing

        • The CAP issue

          • Consistency – (close to) serializability
          • Availability – How likely the system can withstand failure of multiple sites
          • Parition tolerance – will network failure causes different copy of the database to become inconsisant
        • CAP Theorem: You can only choose 2 out of the 3 to maintain correctly.

  • 相关阅读:
    java对象四大引用案例
    Ubuntu中USB端口与外设绑定,ROS读取IMU模块数据
    笔试强训第一天
    虚拟环境创建、配置及激活
    企业数据治理的下一步是数据资产管理?
    Bloc入门之Bloc详解
    掌握这四步,月收入1万+的自媒体人可能就是你
    简单了解:什么是低代码?
    doc转html后添加style和导航
    Java笔记——网络原理03
  • 原文地址:https://blog.csdn.net/weixin_40184069/article/details/126883523