tailieunhanh - Lecture Applied data science: Clustering

Lecture "Applied data science: Clustering" includes content: Exemplary technique - K-means clustering; Exemplary technique - Hierarchical clustering; Practical issues in clustering; Case study; . We invite you to consult! | Clustering Overview 1. Introduction 8. Validation 2. Application 9. Regularisation 3. EDA 10. Clustering 4. Learning Process 11. Evaluation 5. Bias-Variance Tradeoff 12. Deployment 6. Regression review 13. Ethics 7. Classification Lecture outline - Exemplary technique - K-means clustering - Exemplary technique - Hierarchical clustering - Practical issues in clustering - Case study Unsupervised learning and clustering - Tend to be more subjective - Often a part of the exploratory data analysis - No universally accepted mechanism to validate the results - Clustering - partition a data set into distinct non-overlapping groups Exemplary technique - K-means clustering - Assign each observation to exactly one of K clusters K must be predefined - A good clustering is one for which the within-cluster variation is smallest - There are K n ways to partition n observations in K clusters thus the approximating algorithm Exemplary technique - K-means clustering Exemplary technique - K-means clustering - The above algorithm is repeated until the elements in the K clusters are stable - The algorithm only gives a local optimum - Run the algorithm multiple times and selected the best solution . one that has the smallest within-cluster variation of all clusters. Exemplary technique - Agglomerative hierarchical clustering Exemplary technique - Agglomerative hierarchical clustering The dendrogram Hierarchical means that clusters obtained by cutting the dendrogram at a given height are nested within clusters at any greater height gt not a suitable approach to all data sets. Choice of dissimilarities Euclidean distance Manhattan distance Jaccard distance Cosine distance Correlation based distance Choice of dissimilarity - The Euclidean distance - similar items have shorter distance between them - The correlation based distance - similar items are stronger correlated Practical issues in clustering - Standardising features before clustering - Hierarchical clustering - dissimilarity .