Created by 马明
K-Means
kmeans 2 failures
Gaussian Mixture Models (GMMs) give us more flexibility than K-Means
With GMMs we assume that the data points are Gaussian distributed
Two parameters to describe the shape of the clusters: the mean and the standard deviation
Expectation–Maximization (EM) used to find parameter for each cluster
How to evaluate a cluster algorithm?
核心思想:类内距离尽可能小;类间距离近可能大
大部分评价方法主要衡量三个方面:
$Index = \frac{(\alpha \times Separation)}{(\beta \times Compactness)}$
Silhouette coefficient
The silhouette analysis measures how well an observation is clustered and it estimates the average distance between clusters
For each observation $i$ , the silhouette width $s_i$ is calculated as follows:
For each observation $i$ , calculate the average dissimilarity $a_i$ between $i$ and all other points of the cluster to which $i$ belongs
For all other clusters $C$ ,to which $i$ does not belong, calculate the average dissimilarity $d(i,C)$ of $i$ to all observations of $C$ .The smallest of these $d(i,C)$ is defined as $b_i= \min_C d(i,C)$. The value of $b_i$ can be seen as the dissimilarity between $i$ and its “neighbor” cluster
Finally the silhouette width of the observation $i$ is defined by the formula: $S_i = (b_i - a_i)/max(a_i, b_i)$
Dunn index
For each cluster, compute the distance between each of the objects in the cluster and the objects in the other clusters
Use the minimum of this pairwise distance as the inter-cluster separation (min.separation)
For each cluster, compute the distance between the objects in the same cluster.
Use the maximal intra-cluster distance (i.e maximum diameter) as the intra-cluster compactness
Calculate the Dunn index (D) as follow: $D = \frac{min.separation}{max.diameter}$
Elbow method
Average silhouette method
Gap statistic method
Top-down and Bottom-up
How do we define "distance"?
Linkage
Complete
Group Average
Ward's Method
$$d(A,B)=SSE_{A\cup B}-(SSE_A+SSE_B)$$
Advanced Hierarchical clustering
Balanced Iterative Reducing and Clustering using Hierarchies
Clustering Feature: $CF=(N,\overrightarrow{LS},SS)$
$N$ : Number of data points
$LS$ : $\sum_{i=1}^N=\overrightarrow{X_i}$
$SS$ : $\sum_{i=1}^N=\overrightarrow{X_i}^2$
The nonleaf nodes store sums of the CFs of their children
Branching factor: maximum number of children
Threshold: max diameter of sub-clusters stored at the leaf nodes
CF Tree
CHAMELEON algorithm
$$RI(C_i,C_j)=\frac{2\times |EC(C_i,C_j)|}{|EC(C_i)|+|EC(C_j)|}$$
$$RC(C_i,C_j)=\frac{(|C_i|+|C_j|)EC(C_i,C_j)}{|C_j|EC(C_i)+|C_i|EC(C_j)}$$
$C_i$ 表示聚簇 $i$ 内数据点的个数,$EC(C_i)$ 表示的是 $C_i$ 内边的权重和,$EC(C_i,C_j)$ 表示连接两个聚簇的边的权重和。
$RI(C_i,C_j) \times RC(C_i,C_j)^\alpha$
Mean-Shift Clustering
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
In both cases that point is marked as “visited”
Fuzzy clustering (also referred to as soft clustering or soft k-means) is a form of clustering in which each data point can belong to more than one cluster
Fuzzy c-means (FCM) clustering
Compute the centroid for each cluster
For each data point, compute its coefficients of being in the clusters
Centroid
Any point $x$ has a set of coefficients giving the degree of being in the kth cluster $w_k(x)$, the centroid is defined as: $c_k=\frac{\sum_xw_k(x)^mx}{\sum_xw_k(x)^m}$
$m$ is the hyper-parameter that controls how fuzzy the cluster will be
Given a finite set of data, the algorithm returns a list of $c$ cluster centres $C=\{c_1,\cdots,c_c}$ and a partition matrix $W=w_{ij} \in [0,1]$, $i=1,\cdots,n$, $j=1,\cdots,c$
$w_{ij}$ , tells the degree to which element, $x_i$, belongs to cluster $c_j$
The FCM aims to minimize an objective function: $argmin_C \sum_{i=1}^n \sum_{j=1}^cw_{ij}^m||x_i-c_j||^2$
Partition Clustering Algorithm
K-medoids
CLARA
CLARA (Clustering Large Applications, (Kaufman and Rousseeuw 1990)) is an extension to k-medoids (PAM) methods to deal with data containing a large number of objects (more than several thousand observations) in order to reduce computing time and RAM storage problem
This is achieved using the sampling approach
CLARANS
Clustering In QUEst
CLIQUE can be considered as both density-based and grid-based
Graph-based Clustering
3 Major steps for Spectral Clustering
我们一般用点的集合 $V$ 和边的集合 $E$ 来描述一个图,记为 $G(V,E)$
其中,$V$ 即数据中的所有点 $(v_1, v_2,...v_n)$. 定义权重 $w_{ij}$ 为点 $v_i$ 与 $v_j$ 之间的权重. $w_{ij}$ 构成了图的邻接矩阵 $W_{n\times n}$
每个点的度 $d$ 定义为 $d_i = \sum\limits_{j=1}^{n}w_{ij}$
度矩阵为对角矩阵:$\mathbf{D} = \left( \begin{array}{ccc} d_1 & \ldots & \ldots \\ \ldots & d_2 & \ldots \\ \vdots & \vdots & \ddots \\ \ldots & \ldots & d_n \end{array} \right)$
Step 1 — Compute a similarity graph
$$w_{ij}=s_{ij}=exp(-\frac{||x_i-x_j||_2^2}{2\sigma^2})$$
Step 2 — Project the data onto a low-dimensional space
$0=\lambda_1 \le \lambda_2 \le \cdots \le \lambda_k$
Laplacian Matrix
Step 3 — Create clusters
Advantages
Disadvantages
Graph Partitioning
Graph cut
An intuitive goal is find the partition that minimizes the cut
$cut(A,B)=\sum_{i\in A,j\in B}W_{ij}=0.3$
Normalized Cut
Consider the connectivity between groups relative to the volume of each group
Minimized when $Vol(A)$ and $Vol(B)$ are equal, thus encourage balanced cut
$min_xNcut(x)=min_yy^T(D-W)y$, $y^TDy=1$
Write content using inline or external Markdown. Instructions and more info available in the readme.
<section data-markdown>
## Markdown support
Write content using inline or external Markdown.
Instructions and more info available in the [readme](https://github.com/hakimel/reveal.js#markdown).
</section>
Hit the next arrow...
... to step through ...
... a fragmented slide.
There's different types of fragments, like:
grow
shrink
fade-out
fade-up (also down, left and right!)
current-visible
Highlight red blue green
You can select from different transitions, like:
None -
Fade -
Slide -
Convex -
Concave -
Zoom
reveal.js comes with a few themes built in:
Black (default) -
White -
League -
Sky -
Beige -
Simple
Serif -
Blood -
Night -
Moon -
Solarized
Set data-background="#dddddd"
on a slide to change the background color. All CSS color formats are supported.
<section data-background="image.png">
<section data-background="image.png" data-background-repeat="repeat" data-background-size="100px">
<section data-background-video="video.mp4,video.webm">
Different background transitions are available via the backgroundTransition option. This one's called "zoom".
Reveal.configure({ backgroundTransition: 'zoom' })
You can override background transitions per-slide.
<section data-background-transition="zoom">
function linkify( selector ) {
if( supports3DTransforms ) {
var nodes = document.querySelectorAll( selector );
for( var i = 0, len = nodes.length; i < len; i++ ) {
var node = nodes[i];
if( !node.className ) {
node.className += ' roll';
}
}
}
}
Code syntax highlighting courtesy of highlight.js.
Item | Value | Quantity |
---|---|---|
Apples | $1 | 7 |
Lemonade | $2 | 18 |
Bread | $3 | 2 |
These guys come in two forms, inline: The nice thing about standards is that there are so many to choose from
and block:
“For years there has been a theory that millions of monkeys typing at random on millions of typewriters would reproduce the entire works of Shakespeare. The Internet has proven this theory to be untrue.”
You can link between slides internally, like this.
There's a speaker view. It includes a timer, preview of the upcoming slide as well as your speaker notes.
Press the S key to try it out.
Presentations can be exported to PDF, here's an example:
Set data-state="something"
on a slide and "something"
will be added as a class to the document element when the slide is open. This lets you
apply broader style changes, like switching the page background.
Additionally custom events can be triggered on a per slide basis by binding to the data-state
name.
Reveal.addEventListener( 'customevent', function() {
console.log( '"customevent" has fired' );
} );
Press B or . on your keyboard to pause the presentation. This is helpful when you're on stage and want to take distracting slides off the screen.