Recently, we have designed a Web-based application to visualize the spatiotemporal evolution of Web related research topics (See the previous blog here). In this application, a word-cloud-based timeline has been employed to visualize the evolution of research topics on the Web. While word clouds have been a popular way to display the prominent terms, they are unable to reveal the relations among the research topics. This blog introduces a cartographic visualization approach, which is based on self-organization map (SOM) and hierarchical clustering, to discover the hidden relations among research topics. This approach was proposed by Prof. Andre Skupin in the paper "Skupin, A. (2002) A Cartographic Approach to Visualizing Conference Abstracts. IEEE Computer Graphics and Applications. 22 (1): 50 - 58."
Let's first take a look at the final visualization result:
The above visualization shows two levels of topic clusters in the WWW 2013 conference. The higher level (labeled with red and bigger font) consists of 7 clusters, including "Twitter", "Web", "Data", and so forth. The lower level (labeled with orange and smaller font), which contains 27 clusters, shows the sub clusters belonging to each higher-level cluster. As you can see from the figure about, the bigger research topic "Twitter" contains sub research topics about "Tweet", "Group", and "Network". Similarly, the general research topic "Data" can be divided into the sub topics of "Linked Data", "Model", "Retrieval", "RDF", and "Social Machine". The landscape-like visualization underneath shows the dominance value of the displayed key topic in each cluster. A high dominance value (showed as brownish color in the figure) indicate the terms have a high dominance in the papers belonging to the cluster (examples such as the clusters of "Twitter", "Web", and "Social Network"). A low dominance value (showed as blue colors) indicates the papers in that cluster also have mentioned many other terms (examples include the clusters of "Task", "Classification", and "Review").
So how was this visualization created? Below we show a step by step process:
1. Extracting key concepts from WWW papers. The WWW conference data retrieved from the Scopus API has been used in this work. Keywords are from two sources. One source is the keywords directly provided by authors in their paper. The other is the key concepts automatically extracted from using a customized Wikification program (A recursive Wikification algorithm has been designed in this step).
2. Generating the paper-term-count matrix. The keywords extracted from the last step are examined against the papers' abstracts to count how many times that each keyword has shown up in the figure. Then, a paper-term-count matrix was generated with papers in rows and the number of keywords as columns.
3. Self-organization map classification. A SOM is a type of neural network that consists of only one output layer. The neurons constitute a nice 2D layout, and this characteristic makes SOM a good tool for dimension reduction. In this work, a 25-by-25 SOM are employed, and the paper-term-count matrix was put into the SOM to find the classification of each paper. This process will give each paper a 2D coordinates on the hexagon map. Below is a figure of the 25-by-25 SOM.
4. A randomization process. This process randomized the coordinates of each paper based on the neurons they were classified. This step can help differentiate papers in the same cluster. Below is the randomized visualization of papers' coordinates.
5. Voronoi tessellation. A voronoi tessellation was further created based on the papers' coordinates. This tessellation divides the entire space by finding the nearest point based on Euclidean distance.
6. Hierarchical clustering. A hierarchical algorithm was applied based on the similarity of the papers, and a dendrogram (see below) of the hierarchical clustering was created. Two levels of clusters are selected to show the final results. Voronoi polygons are then merged based on the hierarchical clusters.
7. Cluster label selection. After papers are classified into clusters, we need to label each cluster with suitable terms. Term frequency and inverse document frequency (TF-IDF) can be employed for this term selection process. However, since the IDF will significantly reduce the weights of some common terms (such as "Web"), we only use term frequency in this work. Papers belong to the same cluster are merged into one big document, and the frequency of each term is counted. The term that has the highest frequency is selected as the label for that cluster.
8. Creating dominance value for each paper. For each paper, we find its corresponding clusters, and the labels of the clusters. We then calculate the percentage of the labels of the clusters with regard to the terms of each paper (i.e. what's the percentage that the cluster labels have shown in the paper. 21%, 50%, or 60%).
9. Interpolating dominance value into a landscape. The dominance value is used to interpolate a continuous surface to show whether the papers in one cluster are dominated by a small number of terms or many terms.
10. Hillshade creation. To give a 3D visualization of the landscape, hillshade is created for the dominance surface, and sunshine is applied to show shadows.
This cartographic visualization has already been integrated into the Web@25 website (a Web application we built to celebrate the 25th birthday of the Web). After accessing the website, you can click the "SOM Vis" to see the visualization.
Your comments and feedback on this work is very welcome. For any questions, please feel free to contact firstname.lastname@example.org .
To learn more on this landscape metaphor visualization, here are some related papers:
Fabrikant, S.I., Montello, D. R., and Mark, D. M. (2010). The Natural Landscape Metaphor in Information Visualization: The Role of Commonsense Geomorphology. Journal of the American Society for Information Science and Technology, vol. 61, no. 2: 253-270.
Fabrikant, S.I. and Skupin A. (2005). Cognitively Plausible Information Visualization. In: Dykes, J., MacEachren, A.M. & Kraak M. J. (eds). Exploring GeoVisualization. Amsterdam, The Netherlands, Elsevier: 667-690.
Fabrikant, S.I., Montello, D. R., and Mark, D. M. (2006). The Distance-Similarity Metaphor in Region-Display Spatializations, IEEE Computer Graphics & Application. July/August 2006: 34-44.
Skupin, A. (2002) A Cartographic Approach to Visualizing Conference Abstracts. IEEE Computer Graphics and Applications. 22 (1): 50 - 58.