用共现矩阵M( M a b M_{ab} Mab :词a在词b上下窗口大小内出现的次数)获得词向量。
由于共现矩阵维数过高,采用SVD(事实上是Truncated SVD,以保留最小的几个奇异值)进行降维投影,获得词嵌入并正则化。
distinct_words
def distinct_words(corpus):
""" Determine a list of distinct words for the corpus.
Params:
corpus (list of list of strings): corpus of documents
Return:
corpus_words (list of strings): sorted list of distinct words across the corpus
n_corpus_words (integer): number of distinct words across the corpus
"""
corpus_words = []
n_corpus_words = -1
# ------------------
# Write your implementation here.
corpus = [y for x in corpus for y in x]
corpus_words = sorted(set(corpus))
n_corpus_words = len(corpus_words)
# ------------------
return corpus_words, n_corpus_words
compute_co_occurrence_matrix
计算固定窗口大小的共现矩阵
def compute_co_occurrence_matrix(corpus, window_size=4):
""" Compute co-occurrence matrix for the given corpus and window_size (default of 4).
Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
number of co-occurring words.
For example, if we take the document "<START> All that glitters is not gold <END>" with window size of 4,
"All" will co-occur with "<START>", "that", "glitters", "is", and "not".
Params:
corpus (list of list of strings): corpus of documents
window_size (int): size of context window
Return:
M (a symmetric numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)):
Co-occurence matrix of word counts.
The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
word2ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
"""
words, n_words = distinct_words(corpus)
M = None
word2ind = {}
# ------------------
# Write your implementation here.
M = np.zeros(shape=(n_words, n_words), dtype=np.int32)
for i in range(n_words):
word2ind[words[i]] = i
for sent in corpus:
for i in range(len(sent)):
tmp = word2ind[sent[i]] # 计算当前词的行号
# before
for w in sent[max(0, i-window_size):i]:
M[tmp][word2ind[w]]+=1
# after
for w in sent[i+1: min(len(sent), i+window_size+1)]:
M[tmp][word2ind[w]]+=1
# ------------------
return M, word2ind
****reduce_to_k_dim
将共现矩阵(N×N)降维到Truncated SVD(N×2),也就是将N维词向量投影到二维空间,形成二维词嵌
def reduce_to_k_dim(M, k=2):
""" Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
- http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
Params:
M (numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): co-occurence matrix of word counts
k (int): embedding size of each word after dimension reduction
Return:
M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
In terms of the SVD from math class, this actually returns U * S
"""
n_iters = 10 # Use this parameter in your call to `TruncatedSVD`
M_reduced = None
print("Running Truncated SVD over %i words..." % (M.shape[0]))
# ------------------
# Write your implementation here.
tmp = TruncatedSVD(n_components=k, n_iter=n_iters)
M_reduced = tmp.fit_transform(M)
# ------------------
print("Done.")
return M_reduced
运行 plot_embeddings

分析共现矩阵
计算共现矩阵,通过SVD获得二维词嵌入,正则化词嵌入

用GloVe生成词向量并探索。
探究了多义词、近义词反义词、类比和偏见。
glove的词嵌入
![[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-NlYJ9bE7-1656344275357)(https://s3-us-west-2.amazonaws.com/secure.notion-static.com/585f1fbe-ee5e-40de-991c-db13b5f8a6f1/Untitled.png)]](https://1000bd.com/contentImg/2022/06/28/171059221.png)
后续运行结果均省略。
参考: