K-均值聚类算法(K-means algorithm)-白红宇的个人博客

发布日期：2021-07-01 05:05:03 浏览次数：2 分类：技术文章

本文共 2279 字，大约阅读时间需要 7 分钟。

k-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining. k-means clustering aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells.

The problem is computationally difficult (NP-hard); however, there are efficient heuristic algorithms that are commonly employed and converge quickly to a local optimum. These are usually similar to the expectation-maximization algorithm for mixtures of Gaussian distributions via an iterative refinement approach employed by both algorithms. Additionally, they both use cluster centers to model the data; however, k-means clustering tends to find clusters of comparable spatial extent, while the expectation-maximization mechanism allows clusters to have different shapes.

The algorithm has a loose relationship to the k-nearest neighbor classifier, a popular machine learning technique for classification that is often confused with k-means because of the k in the name. One can apply the 1-nearest neighbor classifier on the cluster centers obtained by k-means to classify new data into the existing clusters. This is known as nearest centroid classifier or Rocchio algorithm.

此算法的主要作用：屏幕上很多的点，把相邻的点聚到离他最近的点。

k-means algorithm算法是一个聚类算法，把n个对象根据他们的属性分为k个分割，k < n。它与处理混合正态分布的最大期望算法很相似，因为他们都试图找到数据中自然聚类的中心。

聚类（clustering），其实本质就是寻找联系紧密的事物，把他们区分出来。如果这些事物较少，人为的就可以简单完成这一目标。但是遇到大规模的数据时，人力就显得十分无力了。所以我们需要借助计算机来帮助寻找海量数据间的联系。

聚类过程中有一个关键的量，这个量就是标识两个事物之间的关联度的值，称为相关距离度量（distance metrics），之前的两篇博文相似性度量、皮尔逊相似性系数都是计算这种距离度量的方法。根据实际情况的不同，选择不同的适用的度量方法。这一点十分重要，直接影响聚类的结果是否符合实际需要和情况。

K-均值聚类（K-Means Clustering）

这个是经典的聚类算法，无论时间复杂度还是空间复杂度都是比较好的。这个算法的名称已经说明了算法的核心意图，会对数据进行K个类别的聚类。算法过程就是：

1、在数据集里随机选K个点，当作每个类别的中心点（你也可以通过一定方法选择K个点）

2、通过距离度量，把数据集里的所有点根据距离远近分配给这K个中心点（即数据分给最近的一个中心点），组成一个类别，即获得K个类别。

3、在获得的K个类别里进行均值计算，算出新的中心点（根据需求进行不同模型的均值计算，一般就是选个中心点使相应聚类里的所有点到这个点的距离和最小），把得到的中心点替换各个类别的K点值。

4、判断新获得的一组K值是否和上一次的一组K值相同，如果不同则跳到第2步。如果相同则完成了聚类过程。

1. C++标准模板库从入门到精通

2.跟老菜鸟学C++

3. 跟老菜鸟学python

4. 在VC2015里学会使用tinyxml库

5. 在Windows下SVN的版本管理与实战

6.Visual Studio 2015开发C++程序的基本使用

7.在VC2015里使用protobuf协议

8.在VC2015里学会使用MySQL数据库

转载地址：https://mysoft.blog.csdn.net/article/details/60585015 如侵犯您的版权，请留言回复原文章的地址，我们会给您删除此文章，给您带来不便请您谅解！

上一篇：混合高斯模型（Mixtures of Gaussians）和EM算法

下一篇：批量学习（batch learning）和在线学习（online learning）

发表评论

关于作者

喝酒易醉，品茶养心，人生如梦，品茶悟道，何以解忧？唯有杜康！

-- 愿君每日到此一游！