๐K-Means Clustering and Its Use Cases in the Security Domain
๐ฐ What is Clustering ?
โClustering is a process of dividing the datasets into groups, consisting of similar data pointsโ. It means grouping of objects based on the information found in the data, describing the objects based on the information found in the data, describing the objects or their relationship.
Clustering is dividing data points into homogeneous classes or clusters:
- Points in the same group are as similar as possible
- Points in the different groups are as dissimilar as possible
When a collection of objects is given, we put objects into groups based on similarity.
๐ Why is Clustering Used?
The goal of clustering is to determine the intrinsic grouping in a set of unlabeled data. Sometimes, Partitioning is the goal.
๐ Types of Clustering?
- Exclusive Clustering (K-Means)
- Overlapping Clustering (C-Means)
- Hierarchical Clustering
๐ฏ Application of Clustering:
Clustering is used in almost all fields. You can infer some ideas from Example 1 to come up with a lot of clustering applications that you would have come across.
Listed here are few more applications, which would add to what you have learned.
- Clustering helps marketers improve their customer base and work on the target areas. It helps group people (according to different criteria such as willingness, purchasing power, etc.) based on their similarity in many ways related to the product.
- Clustering helps in the identification of groups of houses based on their value, type, and geographical locations.
- Clustering is used to study earth-quake. Based on the areas hit by an earthquake in a region, clustering can help analyze the next probable location where an earthquake can occur.
โ๐ป What is K-Means Clustering?
K-means (Macqueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well-known clustering problem. K-means clustering is a method of vector quantization, originally from signal processing, that is popular for cluster analysis in data mining.
K-Mean clustering is an algorithm used to solve the unsupervised machine learning datasets which only have historical data containing only the input variables. Unsupervised learning doesnโt depend upon the known outputs.
K-Means clustering algorithm is defined as an unsupervised learning method having an iterative process in which the dataset are grouped into k number of predefined non-overlapping clusters or subgroups, making the inner points of the cluster as similar as possible while trying to keep the clusters at distinct space it allocates the data points to a cluster so that the sum of the squared distance between the clusters centroid and the data point is at a minimum, at this position the centroid of the cluster is the arithmetic mean of the data points that are in the clusters.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim of this algorithm is to minimize the sum of distances between the data point and their corresponding clusters.
๐จ๐ปโ๐ป Algorithm steps Of K Means โ
The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the value of K, to decide the number of clusters to be formed.
Step-2: Select random K points which will act as centroids.
Step-3: Assign each data point, based on their distance from the randomly selected points (Centroid), to the nearest/closest centroid which will form the predefined clusters.
Step-4: place a new centroid of each cluster.
Step-5: Repeat step no.3, which reassign each data point to the new closest centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to Step 7.
Step-7: FINISH
๐ง Advantages of K-Means Clustering :
K-Means offer many advantages while doing unsupervised data mining. The various advantages it may offer the user are :
๐๐ผ Relatively simple to implement.
๐๐ผ Scales to large data sets.
๐๐ผ Guarantees convergence.
๐๐ผ Can warm-start the positions of centroids.
๐๐ผ Easily adapts to new examples.
๐๐ผ Generalizes to clusters of different shapes and sizes, such as elliptical clusters.
๐ฏ Disadvantages of K-Means Clustering :
Some Disadvantages of K-Means are :
๐๐ผ Choosing k manually.
๐๐ผ Being dependent on initial values.
๐๐ผ Clustering data of varying sizes and density.
๐๐ผ Clustering outliers.
๐๐ผ Scaling with number of dimensions.
๐ Use Cases in the Security Domain
Here is a list of some of the interesting use cases of K-Means in Security Domain:
1. Identifying Crime Localities
With data related to crimes available in specific localities in a city, the category of crime, the area of the crime, and the association between the two can give quality insight into crime-prone areas within a city or a locality.
2. Crime Document Classification
Cluster documents in multiple categories based on tags, topics, and the content of the document. This is a very standard classification problem and k-means is a highly suitable algorithm for this purpose. The initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document. the document vectors are then clustered to help identify similarity in document groups.
3. Delivery Store Optimization
optimize the process of good delivery using truck drones by using a combination of k-means to find the optimal number of launch locations and a genetic algorithm to solve the truck route as a traveling salesman problem.
4. Insurance and Fraud Detection
Machine learning has a critical role to play in fraud detection and has numerous applications in automobile, healthcare, and insurance fraud detection. utilizing past historical data on fraudulent claims, it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. since insurance fraud can potentially have a multi-million dollar impact on a company, the ability to detect frauds is crucial.
5. Automatic Clustering of it alerts
Large enterprise it infrastructure technology components such as network, storage, or database generate large volumes of alert messages. because alert messages potentially point to operational issues, they must be manually screened for prioritization for downstream processes. Clustering of Data can provide insight into categories of alerts and mean time to repair, and help in failure predictions.
Thatโs All Guysโฆ
Thanks for patient reading my article๐ค
See you soon with new article..
For any help or suggestions find me on LinkedIn.