足ることを知らず

Data Science, global business, management and MBA

Day 63 in MIT Sloan Fellows Class 2023, Advanced Data Analytics and Machine Learning in Finance 4, clustering

Clustering 101

Clustering is the most typical machine learning of unsupervised learning.

It means we don't have training data or correct label, but machine learning automatically put labels, or grouping datasets.

 

In today's article, I will introduce typical clustering techniques and their pros/cons briefly.

 

Basic concepts of clustering

When we consider clustering, two components are crucially important.

  1. How to define "distance" between data points?
  2. Soft or hard clustering 

There are two kinds of distances, internal cohesion, and external isolation. They mean how strongly different data points are from other clusters, and how strongly similar data points are from the data points in the same cluster.

Also, soft and hard clusterings mean whether one data point can belong to multiple clusters or not.

 

Clustering methodologies

I just skipped hierarchical clustering because it is rule-based and not unsupervised machine learning.

  • k-means
    • Pros: fast and efficient(linear complexity O(n))
    • Cons: # of groups is self-defined, dependent on initial value(lack of consistency) restricted to circle shape
  • Mean-Shift Clustering
    • Pros: # of clusters automatically determined
    • Cons: window size and radius can be non-trivial
  • Density-Based Spatial Clustering of Application with Noise(DBSCAN)
    • Pros:# of clusters automatically determined, detection of outliers automatically
    • Cons: if clusters are of varying density, it does not work.
  • Expectation-Maximization(EM) Clustering using Gaussian Mixture Models(GMM)
    • Pros: It considers both mean and standard deviation; can take ellipse shape. They can have multiple clusters per data point.
    • Cons:Relatively slow clustering and try to create only one cluster.(Do not divide data points into three points)
  • Affinity propagation
    • Pros: cluster centers are actual data points,

      only requires that we have pre-computed pairwise similarity between data points

    • Cons: if # of datapoints are massive, it would be very slow. 

If you want to learn AP more, please see the link below.

https://utstat.toronto.edu/reid/sta414/frey-affinity.pdf

I referred to those articles to write this blog.

The 5 Clustering Algorithms Data Scientists Need to Know | by George Seif | Towards Data Science

Clustering | Types Of Clustering | Clustering Applications