Designing Neural Networks   
Clustering with the SOM/Kohonen Map, Part II

This is part 2 of a two part article. Part 1 is included below.

Determining Clusters in the SOM
After the inputs are mapped to the 2D SOM (e.g. trained), there are three visual displays that are used to "determine" where the natural cluster boundaries are in the SOM. These three visual tools are the Histogram, U-matrix, and Component Plane displays. An important concept in interpreting these displays is the interaction of the SOMs two properties, the neighborhood relationship, and the density mapping. Neighboring PEs in the SOM cannot be TOO far away from each other (in order to maintain their similarity) but the SOM also wants to place more PEs in an area of high input density (e.g. logical clusters). Because of this, there are PEs that are placed in areas between natural clusters where there is typically low input density (so that the map can "stretch" between clusters).

The Histogram and U-matrix Displays
The Histogram display shows the percentage of times each PE wins a competition, or equivalently the number of data points clustered in that particular PE. In the SOM, the Histogram display will show clusters as areas of high density surrounded by areas of low density PEs that are "stretching" between clusters. The U-matrix shows the distance between each PE and its neighbors. Similar to the Histogram, PEs that are in a natural data cluster will be close to each other (because of the higher input density) and will be surrounded by PEs that are farther apart because of the lower input density between clusters. Thus, the U-matrix display will show low values inside a cluster and high values between the clusters. These two displays together can be visually inspected to determine the natural clustering in the SOM.

The Component Plane Representation
Lastly, the component plane representation shows how the individual input features (e.g. columns) vary throughout the map. As you vary the feature number in the 2D Kohonen synapse inspector, you will see different maps. Each map shows the average value of that feature in each PE. For instance, you may find that the upper left corner of the map has high values of feature 1 and low values of feature 3 as its primary distinctive characteristics. This analysis will allow you to determine what type of data was mapped to each region of the map.

   Designing Neural Networks   
Clustering with the SOM/Kohonen Map, Part I

Clustering algorithms can be used to group together objects or conditions with similar characteristics. Unlike classification, the groupings associated with clustering are typically more abstract and not easily defined. Examples of where clustering has been used include identifying shopping patterns between visitors and grouping types of web page or e-mail content.

Clustering in NeuroSolutions
The NeuroSolutions Neural Expert uses the two-dimensional SOM (a.k.a. Kohonen map) for clustering. Clustering with the SOM requires some work, but is also much more powerful than many other clustering methods. The SOM has a few unique properties that make it very effective for clustering, including: 1) density matching: the number of SOM processing elements (PEs) placed in an area of input space is similar to the density of inputs in that area and 2) neighborhood relationships: the SOM processing elements have an intrinsic neighborhood relationship where inputs mapped to PEs that are close (e.g. PE (1,1) and PE (1,2)) are also close in input space (e.g. similar inputs).

The Basics of Clustering
The first important concept in SOM clustering is that a single PE does not normally define a cluster and that the clustering is not predefined by the number of PEs. Typically, a SOM is created of size N x N (where N is dependent upon the number of data points and the "resolution" of your desired mapping) and each logical cluster of input data is located in REGIONS (subsets) of PEs in that NxN map. For instance, a cluster can be in the top left region of the map (say PEs (1,1) (1,2) (1,3) (2,1) and (2,2)). Since clustering is unsupervised, there is no predefined number of clusters in the dataset and the clustering is left to the interpretation of the user. There are situations where in one application it may be beneficial to split the top left region of the map into three smaller clusters where as other times it may be beneficial to consider it a single cluster.

Using the Clustering Information
Once you have completed your analysis of the map, new inputs can be mapped by simply running the new data through the map in a "testing" or "production" set and finding the winning processing element for each input. This winning processing element's location will determine which natural cluster it belongs in and the user can then use this information in the manner of his/her choosing. For example, if a new website visitor is clustered with visitors who have purchased certain products, then the new visitor could be shown advertisements for those types of products.

Visit ND.com