NicheWorks - Interactive Visualization of Very Large Graphs

Graham J Wills
Room 1u334, Lucent Technologies (Bell Laboratories),
1000 E. Warrenville Road, Naperville IL 60566, USA
Email: gwills@research.bell-labs.com
Web: www.bell-labs.com/~gwills

Abstract

The difference between displaying networks with 100-1000 nodes and displaying ones with 10,000-100,000 nodes is not merely quantitative, it is qualitative. Layout algorithms suitable for the former are too slow for the latter, requiring new algorithms or modified (often relaxed) versions of existing algorithms to be invented. The density of nodes and edges displayed per inch of screen real estate requires special visual techniques to filter the graphs and focus attention. Compounding the problem is that large real-life networks are often weighted graphs and usually have additional data associated with the nodes and edges. A system for investigating and exploring such large, complex data sets needs to be able to display both graph structure and node and edge attributes so that patterns and information hidden in the data can be seen. In this paper we describe a tool that addresses these needs, the NicheWorks tool. We describe and comment on the available layout algorithms and the linked views interaction system, and detail two examples of the use of NicheWorks for analyzing web sites and detecting international telephone fraud.

Introduction

NicheWorks is a visualization tool for the investigation of very large graphs. By Îvery largeÌ we mean graphs for which we cannot look at the complete set of labeled nodes and edges on one static display. Typical analyses performed using NicheWorks have between 20,000 and 1,000,000 nodes. On current mid-range workstations a network of around 50,000 nodes and edges can be visualized and manipulated in real time with ease. Increased size decreases interactive performance linearly. NicheWorks allows the user to examine a variety of node and edge attributes in conjunction with their connectivity information. Categorical, textual and continuous attributes can be explored with a variety of one-way, two-way and multi-dimensional views.

NicheWorks was designed to examine large telecommunications networks, and has been applied extensively to understanding calling patterns among telephone customers both for customer understanding and fraud detection. In this domain, information about the customer (type of service, geographical location, etc.) and the calls they make (date, time of day, duration) need to be understood in the context of who calls whom. NicheWorks was developed to examine such inter-relationships. Typical questions that NicheWorks was designed to answer deal with clustering users based on their calling patterns and collected statistical attributes; detecting and characterizing atypical calling patterns; understanding how an interesting subsetÌs calling patterns differ from those of the whole; and identifying individuals who have been classified into a given category by a purely data- driven algorithm, but who do not appear to fit that category based on their calling patterns.

Since its inception, a number of other data sources have been analyzed by NicheWorks and NicheWorks has been adapted and modified into a general purpose tool. It has been applied to a variety of different problem areas, including:

This paper is intended to describe both the methodology behind NicheWorks and our general approach to visualizing large complex data sets; in this case, weighted graphs. Section 2 will give a brief overview of the tool, with section 3 detailing layout algorithms and section 4 the interactive interface. Sections 5 and 6 present examples of the tool being used to analyze some real-life data and section 7 summarizes our findings.

Overview

The NicheWorks tool is part of a suite of visualization views that have been created by the Bell Labs Visualization Group for interactive analysis of large data sets. It shares a number of features with its sibling tools (e.g. SeeSoft [Ei94]), including: These capabilities are shared by all tools in the linked views environment and their use is discussed in section 4. There are also methods specific to graph analysis that are incorporated into the NicheWorks view. These include: A typical session with a new data set starts with the data definition stage. Tables of nodes and edges with associated data attributes are loaded into the tool. An initial layout algorithm is selected and the user can map data variables to the available node/edge attributes. The weight attribute for edges is the only one that will affect the layout algorithm; the others provide visual information only. The user can also tell the program to regard edges between nodes as directed or undirected.

The user can then select one of several available iterative algorithms to use to improve the layout; the user also allocates the amount of time allowed for these to run. While laying out the data, the user can continue to change attributes, re-draw the graph at intermediate stages, show or hide parts of the graph, and even change the weighting. Finally, the user can save the positioning results for later use.

Layout

Among others, Coleman [Co96] gives a list of properties towards which good graph layout algorithms should strive. The list includes notions of clarity, generality and ability to produce satisfying layouts for a fairly general class of graphs. Speed is also a criterion. We want our algorithms to lay out very large general weighted graphs, producing a straight- edge layout that reflects the edge weightings and that places nodes close to other nodes to which they are similar. Di Battista [Ba94] gives four aesthetics that are important for the general graph case. Since our layout is straight-edge, we trivially satisfy the aesthetic of avoiding bends in edges. Instead of keeping edge lengths uniform, we wish them to reflect the edge weights; our algorithms generally try to set the edge lengths inversely proportional to the weights, so that the strongest linked nodes are closest together in the ideal layout. Due to the computational cost of multiple edge-crossings detection, we elect to ignore the criterion that edge crossings should be minimized, trusting that our algorithms will produce good results without directly involving a measure of edge- crossings. Since we wish to show clusters of nodes and discriminate nodes far from such clusters, the aesthetic of distributing nodes evenly is not obviously useful, and we relax it considerably; usually adopting only a final polishing algorithm to move overlapping nodes a small amount. A last point worth making is that algorithms that are particularly good at displaying symmetric or planar / near-planar graphs [e.g. HaSa96, Ka93] are of limited value in our domain. Large weighted graphs are rarely close to planar or symmetric.

There are two types of algorithms used for laying out graphs in NicheWorks. First, an initial layout (section 3.1) is chosen. This initial layout should be fast, capable of laying out up to a million node graphs in a few minutes, so only simple algorithms should be used. The user may then chose to improve the graph layout with one or more incremental algorithms (section 3.2).

In the following discussion, algorithms are run on each connected component of the graph. The final layout is achieved by placing the components close to each other, with the largest in the center. The algorithm currently in place for this stage is rather naŒve: Components are represented by a circle sufficiently large to encompass the component. The circles are then laid out using a greedy algorithm that places the circles as close as possible to the center of the display in decreasing order of size. Figure 8(a) shows the limitations of this approach. A better approach would be to represent the components by their convex hulls and run an annealing algorithm that would move and rotate the polygons until they fitted together more compactly. Since there are usually relatively few components compared to the number of nodes, this stage should take only a small percentage of the total fitting time. This is not yet performed in the current version of the tool.

On an implementational note, we use any available parallel machine architecture to process each component separately. This is trivial to implement as no synchronization is necessary until all the components are placed together at the end.

3.1 Initial Layout

The currently available initial component layout algorithms are: The first two layout algorithms simply place the nodes at random locations either on the circle or on the grid. The tree layout algorithm was inspired by the cone-tree 3D visualization method for large hierarchies [Ro91], but avoids the occlusion problem induced by a 3D visual system while still easily coping with the size graphs typically displayed in cone tree examples. [Ba94] gives other examples of radial layout algorithms of which this is an example. The tree layout algorithm is designed to work for (surprise) trees, but also works well for DAGs and has proved to be useful for both general directed and undirected graphs. In the latter cases we find the source of the graph by working inward from the leaves until we find the center-most node(s). If there are multiple sources, the algorithm creates a fake root node as parent to all the real root nodes and lays out this enhanced graph.

Figure 1.
Radial Placement

The algorithm creates a tree from the enhanced graph by creating a subgraph G', initially consisting of just the root node. An iterative scheme is performed whereby all nodes that are one step away from G' are added to G', along with the strongest weighted edge from that node to G'. This builds up a tree on the graph and terminates when all the nodes are added. A naive implementation of this algorithm runs in O(DE) time . Each node is then labeled with the size of its subtree (leaf nodes only) A variable indicating subtree angle is also attached to each non-leaf node.

The root node is positioned at the center and given an angle of 360 degrees. This indicates the angular span of its subtree. We then perform the following iterative layout method:

For each of the leaf nodes of the positioned graph, we divide up the angular span available to it subtree using the size of each of its children's subtrees as weights. Thus if we had a node with angle 20' and three children with subtree sizes 3, 2 and 5, the respective angles allotted to them would be 6, 4 and 10 degrees respectively. The children are placed on a circle with radius proportional to their distance from the root node and are placed at the midpoint of their individual angular ranges, with their parent in the center of the overall range (complying with a common criterion for hierarchical layouts mentioned in, for example, [Co96]). An example is shown in figure 1. The root node (R) is drawn at the center, with its children on a circle centered at R of radius l. R has a subtree of size 20 and its child S has a subtree of size 10, so S acquires an angular span of 360*10/20 = 180 degrees. Its child T with subtree of size 5 gets a span of 180*5/10 = 90 degrees, U gets 180*2/10 = 36 degrees and V gets 180*3/10 = 54 degrees. This process continues until all the nodes have been positioned. The order of placing subnodes around the circle is the final element requiring definition. We have decided on the approach of placing the strongest weighted edges from a node to a child node in the center of the span, with the weaker ones to the edges. Thus the child with the strongest edge to its parent will be in the center of the range. This works well with a principle used in the iterative algorithms of section 3.2, that the shortest edges should indicate the strongest weights.

In practice we have found that a slight modification to this algorithm which does not use all the available arc to place children improves the layout. The slight loss in available space is more than offset by the improvement in the visible separation between subtrees.

3.2 Incremental algorithms

There are three incremental algorithms available. In each case, the user defines a potential function which describes the disparity between a weighted edge and the length of that edge. The edge length should be inversely proportional to its weight, so that strongly tied nodes are close together. Two of the more useful functions are a sum of terms of the following form:

a) (1-dw)2
b) |1-dw|
where d is the edge length and w is the edge weight. Each potential contribution is minimized when d = 1/w. The difference between (a) and (b) can be seen if we add a small perturbation to the optimal solution, making it instead 1/w+e
Then (a) gives w2 e2
whereas (b) gives we
so for a small absolute perturbation of the distance, (a) is more forgiving of minor variations than (b), assuming a transformation of the weights. Note that one natural layout method, multi-dimensional scaling, MDS [KrWi76] is inappropriate as it minimizes:
(d-1/w)2
for which a perturbationgives potential difference of e2, completely ignoring the weights, so that more irrelevant weak edges require the same precision of fitting as do strong edges. When an MDS potential function was used in the NicheWorks algorithms, it produced consistently worse layouts for graphs with differing edge weights Òthe drawback has practical as well as theoretical problems.

An important point to note is that the potential is a function only of the graph edges - if two nodes do not have an edge between them, then the distance between them is irrelevant to the potential function. This characteristic ensures that the potential calculations are fast, but has the substantial drawback that there is no force repelling nodes from each other. One outcome of this is that nodes that are unconnected may be drawn very close together if they are connected to similar other nodes with similar weights. In some contexts this is a very useful characteristic. One case is when examining large networks of phone calls to detect fraudulent calls; the grouping of telephone numbers that have similar, unusual calling patterns but do not call each other is highly desirable. In other contexts it is not so useful, and we have provided a specific incremental algorithm to ameliorate the overlapping problem (section 3.2.3).

3.2.1 Steepest Descent

For this method we consider the potential of the graph to be a function of the 2N- dimensional vector of locations of its nodes. Moving the location of the graph in this high dimensional space is equivalent to moving every node in the graph simultaneously. We calculate the gradient of this vector and want to move in that direction a suitable amount. To decide how far to move, we take three small trial steps in the direction of the unit gradient, using the potential at those locations to fit a polynomial of degree three to the potential function in that direction. We then solve for the distance that minimizes the fitted polynomial. Then we can move the configuration in 2N-space to the specified point along the gradient direction. We terminate the algorithm when we cannot move in any direction and improve our fit. The basic method is described for one dimensional functions in [BuFa85] and has been modified for our higher dimensional problem.

This method works well when the initial position is close to a local minimum. It does not require the initial configuration to be as close as is required for quadratic methods such as Newton's method, but it can take a long time to reach a good solution, especially for very large graphs. For this reason we suggest that either a good initial placement algorithm should be used, such as the tree layout, or that the simulated annealing swapping algorithm (3.2.2) be run prior to this method.

This method is a relatively slow method, with each step requiring the calculation of several gradient potential functions for offsets from the current location in 2N-space. Although each calculation is of order O(E) so the order of the whole process is O(E), the constant multiplier is quite high and our informal experience suggests that the number of iterations required to achieve good results is around O(—E) giving an overall order of O(L—E). Even for a million edges, this is not an overwhelming number, especially since the gradient calculations can all be performed easily with coarse-grained parallelism, providing up to a four-fold speed increase.

3.2.2 Simulated Annealing Swapping Algorithm

This algorithm is designed to improve a random layout rapidly; it is especially useful for random grid layouts. The algorithm randomly picks a pair of nodes and calculates the difference in potential if the nodes were swapped. If the swap decreases the potential, or if the increase is allowed by the annealing algorithm, then the nodes' positions are swapped. The annealing schedule is based on the amount of time the user allocates for the swapping routine to be run. Details of annealing algorithms in a graph layout context can be found in Davidson and HarelÌs paper [DaHa96]. They use an annealing approach to decide whether to move a node to a new randomly chosen position and we use annealing to decide whether to swap nodes, but the process is conceptually very similar and much of their discussion is appropriate for our algorithm. One important difference is in the number of iterations and the cooling schedule. [DaHa96] suggests 30N iterations, which is impractical for problems of our size. Instead our algorithms prompt the user for the maximal amount of time to run the algorithm, and the annealing algorithm uses the proportion of allotted time taken to reduce the temperature for each iteration.

Calculating the effect on the potential of a swap is linearly dependent on the number of edges involving either node. Sparse graphs are more common for large networks than near-complete graphs, with the average degree of the nodes remaining nearly constant as more nodes are added to the network . Thus the potential calculation is typically very fast and can be performed many times. Since nodes can be moved very long distances with one swap, this method is a powerful way of improving random layouts rapidly.

3.2.3 Repelling algorithm

The descent algorithm of (3.2.1) can produce layouts with nodes placed very close to each other since it only uses inter-node distances if there is a edge between them. To solve this problem, we introduced a last-stage algorithm to be run a few times only which calculates the nearest neighbors for all nodes and then moves the closest ones apart a small distance. Running this a few times will move overlapping nodes apart. This algorithm uses a quad-tree with an implementation as described in [NiHi93] that is O(logN) for all three operations of adding, deleting and calculating nearest neighbors. Thus each step of the algorithm is O(N logN), which is completely acceptable. 4. Interactive interface

The interface to NicheWorks falls under the general classification of a linked views environment, described in detail in [Wi97] and [EiWi95] and implemented at least partially in systems such as [Ve88, Wi90, Sw91, Ti90]. Under this paradigm, each view of the data is required to represent both the data themselves and a state vector that is attached to the data. This state vector indicates how each data point should contribute to the view appearance. In our implementation the possible states are: Deleted Treat the data point as if it were not present Normal Show the data
Highlighted Show the data so it will stand out against normal data Focused Show as much detail as possible on the data Furthermore the user should be allowed to modify the state vector by interacting with the data views. For example, selecting a specific bar from a bar chart view and highlighting it will change the data state vector for items represented by that bar, causing other views of the data immediately to update their representation.

Figure 2. Web site with nodes of type 'link' highlighted

Figure 3. Nodes with degree zero have been deleted and of the rest,
nodes with type 'query' have been highlighted.

In NicheWorks there are a number of different options for displaying the graph using the state vector, and for interacting with the graph. The crucial point is that nodes and edges each have a state vector which must be taken into account when drawing the graph. If an edge is highlighted, but each of its end nodes are deleted, should it be shown, and if so, how? We have not performed solid user trials in this area, but our experience with using the tool has lead us to create a set of possible options, some of which are examined in this section. We use an example data set consisting of a few hundred web pages and links between them to exemplify the approach. This data set was collected by listing all the pages near the top level of the authorÌs directory and feeding the references to them to MOMspider [Fi94] which uses references in those documents to search out new pages on the web.

Figure 2 shows the results of selecting only nodes labeled as 'link' (a standard web page) in the default configuration. Selected nodes are drawn in a highlight color, with unselected nodes in gray. Edges are only drawn from selected nodes to other selected nodes. To create figure 3, we created a histogram of the degree of each node (not shown) and then used the mouse to select those of degree zero. We then set their state to 'deleted' as we are not interested in these degenerate components. In the 'Type' bar chart we then select the 'query' type to see which links called query routines.

There is an interesting component where all the child nodes are queries. In NicheWorks we can move the pointer over those nodes so that as the mouse is moved over the data the labels appear and disappear rapidly. We see that all the query nodes are searches into a film database for various films, and the central node is called 'films96.html' - it looks like a page of film reviews.

Figure 4(a). Option to
show unselected links in
gray on
Figure 4(b) Option to
show unselected links in
gray off

Compare figures 4(a) and 4(b) in which we have used NicheWorks to select only one component and then hidden the others. In (b) we have hidden the unselected or normal data - we see only those data that are highlighted. This can be very useful when visualizing large graphs as it allows us to focus in on subsets of nodes or edges which fulfill certain conditions. One commonly observed use among novice NicheWorks users in the telephony domain is that they will hide all nodes (telephone numbers) except those that make a large number of calls. After working with the usage patterns of these high-runners, they will then broaden the scope of the exploration.

Figure 5. Using edge statistics to highlight nodes and show the
distributions of statistics for those selected nodes

In figure 5, we show how fairly complex queries can be posed naturally through the linked views metaphor. In this figure we have changed the 'Type' bar chart to a Spineplot, where each bar has a fixed height and the width of the bar indicates its count. Within each bar the darker area shows the percentage of selected cases within the bar as a height. We have also created a bar chart of edge counts. This bar chart shows the number of times a given URL refers to another URL. We have selected all counts above one; i.e. all those links to a URL from a URL that occur multiple times. This selection defines a subset of highlighted edges which in turn highlights those nodes that are endpoints of the edges. These are the nodes that either refer to the same URL more than once or are referred to more than once by the same URL. The type bar chart and the NicheWorks view both immediately show this result; we can see that images never have multiple edges to them, regular pages sometimes do and queries do about half the time.

Figure 6. Node appearance details

As well as selection mechanisms, the user can use attributes directly to encode information onto the view. In figure 6 we have zoomed in on a section of the site showing a set of bookmarks and can see the arrowheads as well as the node encoding. This graph has been defined as a directed graph. The nodes' shapes have been coded by type; regular URLs are squares, images are circles and queries are diamonds. The color represents the action taken by the web spider on the node. Although they are hard to distinguish in this grayscale representation of the view, the darker gray represents pages that were successfully accessed, and lighter grays represent ones avoided or ones where access to them failed. We have also used the mouse to selectively label nodes.

4.1 Speed considerations

Research into human-computer systems has shown that if an operation is performed sufficiently fast (under 200 ms for the average person) then it is perceived as being instantaneously connected to the initiating actions. One important application for a graph drawing program is that using the mouse to zoom in or out and to pan around or rotate the graph should have the graph being continuously re-displayed within this time period. Failure to do this means the user will difficulty manipulating the view. Of course, re- drawing a million nodes and edges of varying shapes and colors within 200 ms is impossible on most machines, and by the time it is possible, we'll want to visualize a billion nodes and edges. We need some partial-drawing solution.

A naive approach would be to start drawing at time t0 and continuously check the time until we reach time t0 + 200 ms, then stop drawing. This has two drawbacks. First, and less importantly, polling the time is a notoriously slow operation on most computer systems and checking more than a few times will degrade performance - exactly what we want to avoid. The other failing is critical; if we adopt the naive approach for a very large, dense graph which has a lot of overplotting, then the partially drawn graph is guaranteed to look very different from the completed drawing. This is because the partial drawing shows only those nodes and edges that lie below the ones that are usually drawn later and so appear on top.

For these reasons we adopt a different approach in NicheWorks. Using the results of previous drawing times (number of nodes drawn and time to draw that many) we predict how many items can be drawn within the next 200 ms. We then commence drawing midway through the ordered list of items until we reach the end. This method ensures that the picture is as similar as possible to the complete version. The details of the prediction algorithm are quite complex, as it needs to be robust, especially to the sudden spikes associated with processor sharing, and yet smooth, as we do not want nodes and edges flickering as we zoom and pan. It must also be adaptive, since as we zoom and pan we are actually drawing different numbers of nodes and edges (the major source of delay time). We intend to deal with this algorithm in detail in another paper, but in brief the algorithm uses a robust moving regression model that ignores the most extreme times in the moving average range. It simultaneously calculates several different such models and uses past performance to decide which of the models to choose for the current decision. Empirical results on UNIX X-Windows systems and Windows/DOS boxes indicate that the prediction method works well, rarely overshooting more than 50% of the time in the absence of a spike and without too much flicker.

4.2 Impact on layout algorithms

The state vector can be very useful when laying out large graphs. If we set a node's state to deleted, then it plays no part of the layout process, nor do any edges