机器学习——K-means算法详解及python应用

文章目录

机器学习——K-means算法详解及python应用

1、概念

1、聚类概念

简单来说就是相似的东西分到一组

2、簇 (K)

想要将现有的数据分为几类

3、质心(O)

向量各维取平均值，得到的点。（迭代时使用）

4、距离的度量

两个单位的所具有的数值在坐标系中的距离，常用标准化后的欧式距离或余弦值相似度来进行衡量

标准化：将数值归一化

简单说，使数值大的数据和数值小的数据保持同量级。

5、优化：

对K个簇进行优化
$min\sum_{i=1}^K\sum_{x\in C_i}dist(c_i,x)^2$

2、优缺点

优点：

简单，快速，不需要很强的数学运算

缺点：

K值难确定（簇的个数）

形状任意的簇不好区分。因按距离划分，距离一般为区域的半径

3、python代码结构

我们代码的主要思路是实现K-means算法的各个步骤

1、确定分K个类

2、随机将K个质心放入样本中

3、将质心的位置不断地更新，直到训练成某样本的中心为止。
k_means.py

import numpy as np

class KMeans:
    def __init__(self,data,num_clustres):
        self.data = data
        # 簇的数量
        self.num_clustres = num_clustres


    # max_iterations：最大迭代次数
    def train(self,max_iterations):
        # K个质心
        centroids = KMeans.centroids_init(self,self.data,self.num_clustres)
        # 训练
        num_examples = self.data.shape[0]
        closest_centroids_ids = np.empty((num_examples,1))
        # 循环迭代
        for _ in range(max_iterations):
            # 得到当前每一个样本点到K个中心点的距离
            closest_centroids_ids = KMeans.centroids_find_closest(self,self.data,centroids)
            # 质心位置更新
            centroids = KMeans.centroids_compute(self,self.data,closest_centroids_ids,self.num_clustres)
        return centroids,closest_centroids_ids


    # 质心初始化
    @staticmethod
    def centroids_init(self,data, num_clustres):
        num_examples = data.shape[0]
        # 随机id
        random_ids = np.random.permutation(num_examples)
        # 取中心点
        centroids = data[random_ids[:num_clustres],:]
        return centroids
	# 得到当前每一个样本点到K个中心点的距离
    @staticmethod
    def centroids_find_closest(self,data,centroids):
        # 数据的个数
        num_examples = self.data.shape[0]
        # 质心的个数（簇个数）
        num_centroids = centroids.shape[0]
        closest_centroids_ids = np.zeros((num_examples,1))
        # 计算距离
        for example_index in range(num_examples):
            distance = np.zeros(num_centroids,1)
            for centroids_index in range(num_centroids):
                # 欧式距离计算
                distance_diff = data[example_index,:] - centroids[centroids_index,:]
                distance[centroids_index] = np.sum(distance_diff**2)
            closest_centroids_ids[example_index] = np.argmin(distance)
        return closest_centroids_ids
	# 质心位置更新
    @staticmethod
    def centroids_compute(self,data,closest_centroids_ids,num_cluestres):
        num_features = data.shape[1]
        centroids = np.zeros((num_cluestres,num_features))
        for centroid_id in range(num_cluestres):
            closest_ids = closest_centroids_ids == centroid_id
            # 各个维度做均值
            centroids[closest_ids] = np.mean(data[closest_ids.flatten(),:],axis=0)
        return centroids
        

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63

4、案例:IRIS（鸢尾花）数据集聚类任务

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from k_means import KMeans
# 导入数据集
data = pd.read_csv('iris.csv')
iris_types = ['setosa','versicolor','virginica']

x = 'Petal_Length'
y = 'Petal_Width'
# 已知类别的数据可视化
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
for iris_type in iris_types:
    plt.scatter(data[x][data['Species']==iris_type],data[y][data['Species']==iris_type],label =iris_type)

plt.title('label known')
plt.legend()
# 未知类别的数据可视化
plt.subplot(1,2,2)
plt.scatter(data[x][:],data[y][:])
plt.title('label unknown')
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

可视化模型展示如下：

可见，数据集中有三种类别的花，但当我们不知道数据集分为几类时(也就是K值不确定时)，就很难进行聚类任务。

比如，当K=2时。我们就将左下角和其他的数据分为两类，当K=3时，如图所示，依此类推。

所以能用分类就不用聚类，K-means适用于已经分出几类的样本

如下，将上述的label unknown未知数据进行k-means训练，簇值选设3

# 实现聚类效果

# 整理数据
num_examples = data.shape[0]
x_train = data[[x,y]].values.reshape(num_examples,2)

#簇个数，K值
num_clusters = 3
#迭代50次
max_iteritions = 50

k_means = KMeans(x_train,num_clusters)
#训练
centroids,closest_centroids_ids = k_means.train(max_iteritions)

#对比结果
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
for iris_type in iris_types:
    plt.scatter(data[x][data['Species']==iris_type],data[y][data['Species']==iris_type],label =iris_type)
plt.title('label known')
plt.legend()

plt.subplot(1,2,2)
for centroids_id,centroid in enumerate(centroids):
    current_examples_index = (closest_centroids_ids == centroids_id).flatten()
    plt.scatter(data[x][current_examples_index], data[y][current_examples_index], label=centroids_id)

for centroids_id,centroid in enumerate(centroids):
    plt.scatter(centroid[0],centroid[1],c='black',marker='x')

plt.legend()
plt.title('label kmeans')
plt.show()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

聚类运行结果如下：

如图，迭代50次后，质心已经找到，且处于准确位置。

相关阅读:
计算机毕业论文java毕业设计选题基于springboot的社区服务管理包运行成功]
（C）一些题2
TikTok小店运营的三大技巧！跨境电商必看
docker运行镜像、docker删除镜像、docker运行容器、docker退出当前容器、docker关闭容器、docker重新回到后台运行的容器命令
SpringCloud学习笔记-Eureka的服务拉取
289. 生命游戏 Python
有名管道-
“气运”其实是可以改变的，方法也很简单！
算法组件部署方案归纳
ECharts多个数据视图进行自适应大小的解决方案

原文地址：https://blog.csdn.net/tianhai12/article/details/126415694