cs224n-2022-assignment1

基于计数的词向量

用共现矩阵M（ $M_{ab}$ ：词a在词b上下窗口大小内出现的次数）获得词向量。

由于共现矩阵维数过高，采用SVD（事实上是Truncated SVD，以保留最小的几个奇异值）进行降维投影，获得词嵌入并正则化。

distinct_words

Python list comprehensions
set：用于保留不重复词

def distinct_words(corpus):
    """ Determine a list of distinct words for the corpus.
        Params:
            corpus (list of list of strings): corpus of documents
        Return:
            corpus_words (list of strings): sorted list of distinct words across the corpus
            n_corpus_words (integer): number of distinct words across the corpus
    """
    corpus_words = []
    n_corpus_words = -1
    
    # ------------------
    # Write your implementation here.
    corpus = [y for x in corpus for y in x]
    corpus_words = sorted(set(corpus))
    n_corpus_words = len(corpus_words)

    # ------------------

    return corpus_words, n_corpus_words
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20

compute_co_occurrence_matrix

计算固定窗口大小的共现矩阵

def compute_co_occurrence_matrix(corpus, window_size=4):
    """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).
    
        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
              number of co-occurring words.
              
              For example, if we take the document "<START> All that glitters is not gold <END>" with window size of 4,
              "All" will co-occur with "<START>", "that", "glitters", "is", and "not".
    
        Params:
            corpus (list of list of strings): corpus of documents
            window_size (int): size of context window
        Return:
            M (a symmetric numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): 
                Co-occurence matrix of word counts. 
                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
            word2ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
    """
    words, n_words = distinct_words(corpus)
    M = None
    word2ind = {}
    
    # ------------------
    # Write your implementation here.
    M = np.zeros(shape=(n_words, n_words), dtype=np.int32)
    for i in range(n_words):
        word2ind[words[i]] = i
    
    for sent in corpus:
        for i in range(len(sent)):
            tmp = word2ind[sent[i]] # 计算当前词的行号
            
            # before
            for w in sent[max(0, i-window_size):i]:
                M[tmp][word2ind[w]]+=1
            
            # after
            for w in sent[i+1: min(len(sent), i+window_size+1)]:
                M[tmp][word2ind[w]]+=1      

    # ------------------

    return M, word2ind
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43

****reduce_to_k_dim

将共现矩阵（N×N）降维到Truncated SVD（N×2），也就是将N维词向量投影到二维空间，形成二维词嵌

def reduce_to_k_dim(M, k=2):
    """ Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
        to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
            - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
    
        Params:
            M (numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): co-occurence matrix of word counts
            k (int): embedding size of each word after dimension reduction
        Return:
            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
                    In terms of the SVD from math class, this actually returns U * S
    """    
    n_iters = 10     # Use this parameter in your call to `TruncatedSVD`
    M_reduced = None
    print("Running Truncated SVD over %i words..." % (M.shape[0]))
    
    # ------------------
    # Write your implementation here.
    tmp = TruncatedSVD(n_components=k, n_iter=n_iters)
    M_reduced = tmp.fit_transform(M)
    
    # ------------------

    print("Done.")
    return M_reduced
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

运行 plot_embeddings
分析共现矩阵

计算共现矩阵，通过SVD获得二维词嵌入，正则化词嵌入

基于预测的词向量

用GloVe生成词向量并探索。

探究了多义词、近义词反义词、类比和偏见。

多义词：事实上，取与当前多义词最相似的10个词时，该词不同词义的近义词经常都会出现，也就是在多义词上表现不错；
近义词反义词：例如good和fantastic为近义词，和bad为反义词，但反而后者的距离更近，前者的距离更远。这有可能是因为前者较少在同一上下文中出现，而后者较为频繁；
类比：已经在理论课笔记中写过；
偏见：https://www.sohu.com/a/108071041_114877

glove的词嵌入

后续运行结果均省略。

参考：

https://blog.csdn.net/TinyJian/article/details/88418294?spm=1001.2014.3001.5506

相关阅读:
译译交友项目介绍
python在线办公自动化oa系统django408
数组扁平化的方法
快速记住《计算机文化基础》海量题法
基于大数据的企业岗位需求决策
R 语言马尔可夫链蒙特卡洛：实用介绍
Squid代理服务器应用
PHP基础语法（上）
AVProVideo教程☀️一、一款U3D视频播放插件介绍
一文搞懂 ARM 64 系列: ADC

原文地址：https://blog.csdn.net/rd142857/article/details/125493967