• cs224n-2022-assignment1


    基于计数的词向量

    用共现矩阵M( M a b M_{ab} Mab :词a在词b上下窗口大小内出现的次数)获得词向量。

    由于共现矩阵维数过高,采用SVD(事实上是Truncated SVD,以保留最小的几个奇异值)进行降维投影,获得词嵌入并正则化。

    1. distinct_words

      def distinct_words(corpus):
          """ Determine a list of distinct words for the corpus.
              Params:
                  corpus (list of list of strings): corpus of documents
              Return:
                  corpus_words (list of strings): sorted list of distinct words across the corpus
                  n_corpus_words (integer): number of distinct words across the corpus
          """
          corpus_words = []
          n_corpus_words = -1
          
          # ------------------
          # Write your implementation here.
          corpus = [y for x in corpus for y in x]
          corpus_words = sorted(set(corpus))
          n_corpus_words = len(corpus_words)
      
          # ------------------
      
          return corpus_words, n_corpus_words
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17
      • 18
      • 19
      • 20
    2. compute_co_occurrence_matrix

      计算固定窗口大小的共现矩阵

      def compute_co_occurrence_matrix(corpus, window_size=4):
          """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).
          
              Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
                    number of co-occurring words.
                    
                    For example, if we take the document "<START> All that glitters is not gold <END>" with window size of 4,
                    "All" will co-occur with "<START>", "that", "glitters", "is", and "not".
          
              Params:
                  corpus (list of list of strings): corpus of documents
                  window_size (int): size of context window
              Return:
                  M (a symmetric numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): 
                      Co-occurence matrix of word counts. 
                      The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
                  word2ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
          """
          words, n_words = distinct_words(corpus)
          M = None
          word2ind = {}
          
          # ------------------
          # Write your implementation here.
          M = np.zeros(shape=(n_words, n_words), dtype=np.int32)
          for i in range(n_words):
              word2ind[words[i]] = i
          
          for sent in corpus:
              for i in range(len(sent)):
                  tmp = word2ind[sent[i]] # 计算当前词的行号
                  
                  # before
                  for w in sent[max(0, i-window_size):i]:
                      M[tmp][word2ind[w]]+=1
                  
                  # after
                  for w in sent[i+1: min(len(sent), i+window_size+1)]:
                      M[tmp][word2ind[w]]+=1      
      
          # ------------------
      
          return M, word2ind
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17
      • 18
      • 19
      • 20
      • 21
      • 22
      • 23
      • 24
      • 25
      • 26
      • 27
      • 28
      • 29
      • 30
      • 31
      • 32
      • 33
      • 34
      • 35
      • 36
      • 37
      • 38
      • 39
      • 40
      • 41
      • 42
      • 43
    3. ****reduce_to_k_dim

      将共现矩阵(N×N)降维到Truncated SVD(N×2),也就是将N维词向量投影到二维空间,形成二维词嵌

      def reduce_to_k_dim(M, k=2):
          """ Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
              to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
                  - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
          
              Params:
                  M (numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): co-occurence matrix of word counts
                  k (int): embedding size of each word after dimension reduction
              Return:
                  M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
                          In terms of the SVD from math class, this actually returns U * S
          """    
          n_iters = 10     # Use this parameter in your call to `TruncatedSVD`
          M_reduced = None
          print("Running Truncated SVD over %i words..." % (M.shape[0]))
          
          # ------------------
          # Write your implementation here.
          tmp = TruncatedSVD(n_components=k, n_iter=n_iters)
          M_reduced = tmp.fit_transform(M)
          
          # ------------------
      
          print("Done.")
          return M_reduced
      
      • 1
      • 2
      • 3
      • 4
      • 5
      • 6
      • 7
      • 8
      • 9
      • 10
      • 11
      • 12
      • 13
      • 14
      • 15
      • 16
      • 17
      • 18
      • 19
      • 20
      • 21
      • 22
      • 23
      • 24
      • 25
    4. 运行 plot_embeddings

      在这里插入图片描述

    5. 分析共现矩阵

      计算共现矩阵,通过SVD获得二维词嵌入,正则化词嵌入

      在这里插入图片描述

    基于预测的词向量

    用GloVe生成词向量并探索。

    探究了多义词、近义词反义词、类比和偏见。

    • 多义词:事实上,取与当前多义词最相似的10个词时,该词不同词义的近义词经常都会出现,也就是在多义词上表现不错;
    • 近义词反义词:例如good和fantastic为近义词,和bad为反义词,但反而后者的距离更近,前者的距离更远。这有可能是因为前者较少在同一上下文中出现,而后者较为频繁;
    • 类比:已经在理论课笔记中写过;
    • 偏见:https://www.sohu.com/a/108071041_114877
    1. glove的词嵌入

      [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NlYJ9bE7-1656344275357)(https://s3-us-west-2.amazonaws.com/secure.notion-static.com/585f1fbe-ee5e-40de-991c-db13b5f8a6f1/Untitled.png)]

    后续运行结果均省略。

    参考:

    1. https://blog.csdn.net/TinyJian/article/details/88418294?spm=1001.2014.3001.5506
  • 相关阅读:
    译译交友项目介绍
    python在线办公自动化oa系统django408
    数组扁平化的方法
    快速记住《计算机文化基础》海量题法
    基于大数据的企业岗位需求决策
    R 语言马尔可夫链蒙特卡洛:实用介绍
    Squid代理服务器应用
    PHP基础语法(上)
    AVProVideo教程☀️一、一款U3D视频播放插件介绍
    一文搞懂 ARM 64 系列: ADC
  • 原文地址:https://blog.csdn.net/rd142857/article/details/125493967