K-mean Clustering & its real use case in the security domain..

Dolly Mehra
5 min readJul 19, 2021

KNN stands for K- Nearest Neighbor.

K-means is an unsupervised learning-based algorithm and KNN is a supervised, lazy learning-based algorithm. By using KNN we can solve Binary classification problems., Euclidean distance is a mathematical formula that is used by many machine learning models and algorithms, such as k-means clustering and KNN. It helps to find the distance between two points and normalize it by under root. It has a standard formula. Lazy learning is a method under supervised learning, on which KNN is based. In the method of, learning instead of the training model where we find the right weight and bias, we just calculate the shortest distance for the predicted point and tell its status. KNN is preferred for datasets where we have multiple features and target as a binary classification . In KNN, unlike the traditional machine learning model, the best fit line can be non-linear as it depends on the data points accumulation and what those data points represent in a group. We store the historical data in KNN so that we perform comparison and decide the state of predicting data.

- Flow chart explaining how does K-Means Clustering works

Types of Clustering

The various types of clustering are:

  • Hierarchical clustering
  • Partitioning clustering

Hierarchical clustering is further subdivided into:

  • Agglomerative clustering
  • Divisive clustering

Partitioning clustering is further subdivided into:

  • K-Means clustering
  • Fuzzy C-Means clustering

Where we can apply k-means?

K-means can typically be applied to data that has a smaller number of dimensions, is numeric, and is continuous. think of a scenario in which you want to make groups of similar things from a randomly distributed collection of things; k-means is very suitable for such scenarios.

1. Document classification

cluster documents in multiple categories based on tags, topics, and the content of the document. this is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. the initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document.

2. Delivery store optimization

optimize the process of good delivery using truck drones by using a combination of k-means to find the optimal number of launch locations and a genetic algorithm to solve the truck route as a traveling salesman problem.

3. Identifying crime localities

with data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.

4. Customer segmentation

clustering helps marketers improve their customer base, work on target areas, and segment customers based on purchase history, interests, or activity monitoring. how telecom providers can cluster pre-paid customers to identify patterns in terms of money spent in recharging, sending sms, and browsing the internet. the classification would help the company target specific clusters of customers for specific campaigns.

5. Fantasy league stats analysis

analyzing player stats has always been a critical element of the sporting world, and with increasing competition, machine learning has a critical role to play here. as an interesting exercise, if you would like to create a fantasy draft team and like to identify similar players based on player stats, k-means can be a useful option.

6. Insurance fraud detection

machine learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.

7. Rideshare data analysis

the publicly available UBER ride information dataset provides a large amount of valuable data around traffic, transit time, peak pickup localities, and more. analyzing this data is useful not just in the context of UBER but also in providing insight into urban traffic patterns and helping us plan for the cities of the future.

8. Cyber-profiling criminals

cyber-profiling is the process of collecting data from individuals and groups to identify significant co-relations. the idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene.

9. Call record detail analysis

a call detail record (cdr) is the information captured by telecom companies during the call, sms, and internet activity of a customer. this information provides greater insights about the customer’s needs when used with customer demographics. , you will understand how you can cluster customer activities for 24 hours by using the unsupervised k-means clustering algorithm. it is used to understand segments of customers with respect to their usage by hours.

- Advantages of k-means

Relatively simple to implement.

Scales to large data sets.

Guarantees convergence.

Can warm-start the positions of centroids.

Easily adapts to new examples.

Generalizes to clusters of different shapes and sizes, such as elliptical clusters

- Disadvantages of k-means

Choosing k manually.

Being dependent on initial values.

Clustering data of varying sizes and density.

Clustering outliers.

Scaling with number of dimensions.

Thanks for reading my blog.

LinkedIn: https://www.linkedin.com/in/dollymehra/

--

--