๐ŸŽ“K-Means Clustering and Its Use Cases in the Security Domain

Rishabh Jain
6 min readJul 19, 2021

๐Ÿ”ฐ What is Clustering ?

โ€œClustering is a process of dividing the datasets into groups, consisting of similar data pointsโ€. It means grouping of objects based on the information found in the data, describing the objects based on the information found in the data, describing the objects or their relationship.

Clustering is dividing data points into homogeneous classes or clusters:

  • Points in the same group are as similar as possible
  • Points in the different groups are as dissimilar as possible

When a collection of objects is given, we put objects into groups based on similarity.

๐Ÿ“Œ Why is Clustering Used?

The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. Sometimes, Partitioning is the goal.

๐Ÿ“‘ Types of Clustering?

  1. Exclusive Clustering (K-Means)
  2. Overlapping Clustering (C-Means)
  3. Hierarchical Clustering

๐ŸŽฏ Application of Clustering:

Clustering is used in almost all fields. You can infer some ideas from Example 1 to come up with a lot of clustering applications that you would have come across.

Listed here are few more applications, which would add to what you have learned.

  • Clustering helps marketers improve their customer base and work on the target areas. It helps group people (according to different criteria such as willingness, purchasing power, etc.) based on their similarity in many ways related to the product.
  • Clustering helps in the identification of groups of houses based on their value, type, and geographical locations.
  • Clustering is used to study earth-quake. Based on the areas hit by an earthquake in a region, clustering can help analyze the next probable location where an earthquake can occur.

โœ๐Ÿป What is K-Means Clustering?

K-means (Macqueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem. K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining.

K-Mean clustering is an algorithm used to solve the unsupervised machine learning datasets which only have historical data containing only the input variables. Unsupervised learning doesnโ€™t depend upon the known outputs.

K-Means clustering algorithm is defined as an unsupervised learning method having an iterative process in which the dataset are grouped into k number of predefined non-overlapping clusters or subgroups, making the inner points of the cluster as similar as possible while trying to keep the clusters at distinct space it allocates the data points to a cluster so that the sum of the squared distance between the clusters centroid and the data point is at a minimum, at this position the centroid of the cluster is the arithmetic mean of the data points that are in the clusters.

K-Means Clustering

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to minimize the sum of distances between the data point and their corresponding clusters.

๐Ÿ‘จ๐Ÿปโ€๐Ÿ’ป Algorithm steps Of K Means โ€”

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the value of K, to decide the number of clusters to be formed.

Step-2: Select random K points which will act as centroids.

Step-3: Assign each data point, based on their distance from the randomly selected points (Centroid), to the nearest/closest centroid which will form the predefined clusters.

Step-4: place a new centroid of each cluster.

Step-5: Repeat step no.3, which reassign each data point to the new closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to Step 7.

Step-7: FINISH

Steps Of K-Means Algorithm

๐Ÿง  Advantages of K-Means Clustering :

K-Means offer many advantages while doing unsupervised data mining. The various advantages it may offer the user are :

๐Ÿ‘๐Ÿผ Relatively simple to implement.

๐Ÿ‘๐Ÿผ Scales to large data sets.

๐Ÿ‘๐Ÿผ Guarantees convergence.

๐Ÿ‘๐Ÿผ Can warm-start the positions of centroids.

๐Ÿ‘๐Ÿผ Easily adapts to new examples.

๐Ÿ‘๐Ÿผ Generalizes to clusters of different shapes and sizes, such as elliptical clusters.

๐ŸŽฏ Disadvantages of K-Means Clustering :

Some Disadvantages of K-Means are :

๐Ÿ‘Ž๐Ÿผ Choosing k manually.

๐Ÿ‘Ž๐Ÿผ Being dependent on initial values.

๐Ÿ‘Ž๐Ÿผ Clustering data of varying sizes and density.

๐Ÿ‘Ž๐Ÿผ Clustering outliers.

๐Ÿ‘Ž๐Ÿผ Scaling with number of dimensions.

๐Ÿ“‘ Use Cases in the Security Domain

Here is a list of some of the interesting use cases of K-Means in Security Domain:

1. Identifying Crime Localities

With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.

Identifying Crime Localities

2. Crime Document Classification

Cluster documents in multiple categories based on tags, topics, and the content of the document. This is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. The initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document. the document vectors are then clustered to help identify similarity in document groups.

Crime Document Classification

3. Delivery Store Optimization

optimize the process of good delivery using truck drones by using a combination of k-means to find the optimal number of launch locations and a genetic algorithm to solve the truck route as a traveling salesman problem.

4. Insurance and Fraud Detection

Machine learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.

Insurance and Fraud Detection

5. Automatic Clustering of it alerts

Large enterprise it infrastructure technology components such as network, storage, or database generate large volumes of alert messages. because alert messages potentially point to operational issues, they must be manually screened for prioritization for downstream processes. Clustering of Data can provide insight into categories of alerts and mean time to repair, and help in failure predictions.

Automatic Clustering of it alerts

Thatโ€™s All Guysโ€ฆ

Thanks for patient reading my article๐Ÿค—

See you soon with new article..

For any help or suggestions find me on LinkedIn.

--

--