| The difference between displaying networks with 100-1000 nodes and displaying ones with 10,000-100,000 nodes is not merely quantitative, it is qualitative. Layout algorithms suitable for the former are too slow for the latter, requiring new algorithms or modified (often relaxed) versions of existing algorithms to be invented. The density of nodes and edges displayed per inch of screen real estate requires special visual techniques to filter the graphs and focus attention. Compounding the problem is that large real-life networks are often weighted graphs and usually have additional data associated with the nodes and edges. A system for investigating and exploring such large, complex data sets needs to be able to display both graph structure and node and edge attributes so that patterns and information hidden in the data can be seen. In this paper we describe a tool that addresses these needs, the NicheWorks tool. We describe and comment on the available layout algorithms and the linked views interaction system, and detail two examples of the use of NicheWorks for analyzing web sites and detecting international telephone fraud. |
NicheWorks was designed to examine large telecommunications networks, and has been applied extensively to understanding calling patterns among telephone customers both for customer understanding and fraud detection. In this domain, information about the customer (type of service, geographical location, etc.) and the calls they make (date, time of day, duration) need to be understood in the context of who calls whom. NicheWorks was developed to examine such inter-relationships. Typical questions that NicheWorks was designed to answer deal with clustering users based on their calling patterns and collected statistical attributes; detecting and characterizing atypical calling patterns; understanding how an interesting subsetÌs calling patterns differ from those of the whole; and identifying individuals who have been classified into a given category by a purely data- driven algorithm, but who do not appear to fit that category based on their calling patterns.
Since its inception, a number of other data sources have been analyzed by NicheWorks and NicheWorks has been adapted and modified into a general purpose tool. It has been applied to a variety of different problem areas, including:
The user can then select one of several available iterative algorithms to use to improve the layout; the user also allocates the amount of time allowed for these to run. While laying out the data, the user can continue to change attributes, re-draw the graph at intermediate stages, show or hide parts of the graph, and even change the weighting. Finally, the user can save the positioning results for later use.
There are two types of algorithms used for laying out graphs in NicheWorks. First, an initial layout (section 3.1) is chosen. This initial layout should be fast, capable of laying out up to a million node graphs in a few minutes, so only simple algorithms should be used. The user may then chose to improve the graph layout with one or more incremental algorithms (section 3.2).
In the following discussion, algorithms are run on each connected component of the graph. The final layout is achieved by placing the components close to each other, with the largest in the center. The algorithm currently in place for this stage is rather naŒve: Components are represented by a circle sufficiently large to encompass the component. The circles are then laid out using a greedy algorithm that places the circles as close as possible to the center of the display in decreasing order of size. Figure 8(a) shows the limitations of this approach. A better approach would be to represent the components by their convex hulls and run an annealing algorithm that would move and rotate the polygons until they fitted together more compactly. Since there are usually relatively few components compared to the number of nodes, this stage should take only a small percentage of the total fitting time. This is not yet performed in the current version of the tool.
On an implementational note, we use any available parallel machine architecture to process each component separately. This is trivial to implement as no synchronization is necessary until all the components are placed together at the end.
|
Radial Placement |
The algorithm creates a tree from the enhanced graph by creating a subgraph G', initially consisting of just the root node. An iterative scheme is performed whereby all nodes that are one step away from G' are added to G', along with the strongest weighted edge from that node to G'. This builds up a tree on the graph and terminates when all the nodes are added. A naive implementation of this algorithm runs in O(DE) time . Each node is then labeled with the size of its subtree (leaf nodes only) A variable indicating subtree angle is also attached to each non-leaf node.
The root node is positioned at the center and given an angle of 360 degrees. This indicates the angular span of its subtree. We then perform the following iterative layout method:
For each of the leaf nodes of the positioned graph, we divide up the angular span available to it subtree using the size of each of its children's subtrees as weights. Thus if we had a node with angle 20' and three children with subtree sizes 3, 2 and 5, the respective angles allotted to them would be 6, 4 and 10 degrees respectively. The children are placed on a circle with radius proportional to their distance from the root node and are placed at the midpoint of their individual angular ranges, with their parent in the center of the overall range (complying with a common criterion for hierarchical layouts mentioned in, for example, [Co96]). An example is shown in figure 1. The root node (R) is drawn at the center, with its children on a circle centered at R of radius l. R has a subtree of size 20 and its child S has a subtree of size 10, so S acquires an angular span of 360*10/20 = 180 degrees. Its child T with subtree of size 5 gets a span of 180*5/10 = 90 degrees, U gets 180*2/10 = 36 degrees and V gets 180*3/10 = 54 degrees. This process continues until all the nodes have been positioned. The order of placing subnodes around the circle is the final element requiring definition. We have decided on the approach of placing the strongest weighted edges from a node to a child node in the center of the span, with the weaker ones to the edges. Thus the child with the strongest edge to its parent will be in the center of the range. This works well with a principle used in the iterative algorithms of section 3.2, that the shortest edges should indicate the strongest weights.
In practice we have found that a slight modification to this algorithm which does not use all the available arc to place children improves the layout. The slight loss in available space is more than offset by the improvement in the visible separation between subtrees.
a) (1-dw)2where d is the edge length and w is the edge weight. Each potential contribution is minimized when d = 1/w. The difference between (a) and (b) can be seen if we add a small perturbation to the optimal solution, making it instead 1/w+e
b) |1-dw|
Then (a) gives w2 e2so for a small absolute perturbation of the distance, (a) is more forgiving of minor variations than (b), assuming a transformation of the weights. Note that one natural layout method, multi-dimensional scaling, MDS [KrWi76] is inappropriate as it minimizes:
whereas (b) gives we
(d-1/w)2for which a perturbationgives potential difference of e2, completely ignoring the weights, so that more irrelevant weak edges require the same precision of fitting as do strong edges. When an MDS potential function was used in the NicheWorks algorithms, it produced consistently worse layouts for graphs with differing edge weights Òthe drawback has practical as well as theoretical problems.
An important point to note is that the potential is a function only of the graph edges - if two nodes do not have an edge between them, then the distance between them is irrelevant to the potential function. This characteristic ensures that the potential calculations are fast, but has the substantial drawback that there is no force repelling nodes from each other. One outcome of this is that nodes that are unconnected may be drawn very close together if they are connected to similar other nodes with similar weights. In some contexts this is a very useful characteristic. One case is when examining large networks of phone calls to detect fraudulent calls; the grouping of telephone numbers that have similar, unusual calling patterns but do not call each other is highly desirable. In other contexts it is not so useful, and we have provided a specific incremental algorithm to ameliorate the overlapping problem (section 3.2.3).
This method works well when the initial position is close to a local minimum. It does not require the initial configuration to be as close as is required for quadratic methods such as Newton's method, but it can take a long time to reach a good solution, especially for very large graphs. For this reason we suggest that either a good initial placement algorithm should be used, such as the tree layout, or that the simulated annealing swapping algorithm (3.2.2) be run prior to this method.
This method is a relatively slow method, with each step requiring the calculation of several gradient potential functions for offsets from the current location in 2N-space. Although each calculation is of order O(E) so the order of the whole process is O(E), the constant multiplier is quite high and our informal experience suggests that the number of iterations required to achieve good results is around O(—E) giving an overall order of O(L—E). Even for a million edges, this is not an overwhelming number, especially since the gradient calculations can all be performed easily with coarse-grained parallelism, providing up to a four-fold speed increase.
Calculating the effect on the potential of a swap is linearly dependent on the number of edges involving either node. Sparse graphs are more common for large networks than near-complete graphs, with the average degree of the nodes remaining nearly constant as more nodes are added to the network . Thus the potential calculation is typically very fast and can be performed many times. Since nodes can be moved very long distances with one swap, this method is a powerful way of improving random layouts rapidly.
The interface to NicheWorks falls under the general classification of a linked views
environment, described in detail in [Wi97] and [EiWi95] and implemented at least
partially in systems such as [Ve88, Wi90, Sw91, Ti90]. Under this paradigm, each view of
the data is required to represent both the data themselves and a state vector that is
attached to the data. This state vector indicates how each data point should contribute to
the view appearance. In our implementation the possible states are:
Deleted Treat the data point as if it were not present
Normal Show the data
Highlighted Show the data so it will stand out against normal data
Focused Show as much detail as possible on the data
Furthermore the user should be allowed to modify the state vector by interacting with the
data views. For example, selecting a specific bar from a bar chart view and highlighting it
will change the data state vector for items represented by that bar, causing other views of
the data immediately to update their representation.
|
|
nodes with type 'query' have been highlighted. |
In NicheWorks there are a number of different options for displaying the graph using the state vector, and for interacting with the graph. The crucial point is that nodes and edges each have a state vector which must be taken into account when drawing the graph. If an edge is highlighted, but each of its end nodes are deleted, should it be shown, and if so, how? We have not performed solid user trials in this area, but our experience with using the tool has lead us to create a set of possible options, some of which are examined in this section. We use an example data set consisting of a few hundred web pages and links between them to exemplify the approach. This data set was collected by listing all the pages near the top level of the authorÌs directory and feeding the references to them to MOMspider [Fi94] which uses references in those documents to search out new pages on the web.
Figure 2 shows the results of selecting only nodes labeled as 'link' (a standard web page) in the default configuration. Selected nodes are drawn in a highlight color, with unselected nodes in gray. Edges are only drawn from selected nodes to other selected nodes. To create figure 3, we created a histogram of the degree of each node (not shown) and then used the mouse to select those of degree zero. We then set their state to 'deleted' as we are not interested in these degenerate components. In the 'Type' bar chart we then select the 'query' type to see which links called query routines.
There is an interesting component where all the child nodes are queries. In NicheWorks we can move the pointer over those nodes so that as the mouse is moved over the data the labels appear and disappear rapidly. We see that all the query nodes are searches into a film database for various films, and the central node is called 'films96.html' - it looks like a page of film reviews.
|
|
show unselected links in gray on | show unselected links in gray off |
Compare figures 4(a) and 4(b) in which we have used NicheWorks to select only one component and then hidden the others. In (b) we have hidden the unselected or normal data - we see only those data that are highlighted. This can be very useful when visualizing large graphs as it allows us to focus in on subsets of nodes or edges which fulfill certain conditions. One commonly observed use among novice NicheWorks users in the telephony domain is that they will hide all nodes (telephone numbers) except those that make a large number of calls. After working with the usage patterns of these high-runners, they will then broaden the scope of the exploration.
|
distributions of statistics for those selected nodes |
In figure 5, we show how fairly complex queries can be posed naturally through the linked views metaphor. In this figure we have changed the 'Type' bar chart to a Spineplot, where each bar has a fixed height and the width of the bar indicates its count. Within each bar the darker area shows the percentage of selected cases within the bar as a height. We have also created a bar chart of edge counts. This bar chart shows the number of times a given URL refers to another URL. We have selected all counts above one; i.e. all those links to a URL from a URL that occur multiple times. This selection defines a subset of highlighted edges which in turn highlights those nodes that are endpoints of the edges. These are the nodes that either refer to the same URL more than once or are referred to more than once by the same URL. The type bar chart and the NicheWorks view both immediately show this result; we can see that images never have multiple edges to them, regular pages sometimes do and queries do about half the time.
|
As well as selection mechanisms, the user can use attributes directly to encode information onto the view. In figure 6 we have zoomed in on a section of the site showing a set of bookmarks and can see the arrowheads as well as the node encoding. This graph has been defined as a directed graph. The nodes' shapes have been coded by type; regular URLs are squares, images are circles and queries are diamonds. The color represents the action taken by the web spider on the node. Although they are hard to distinguish in this grayscale representation of the view, the darker gray represents pages that were successfully accessed, and lighter grays represent ones avoided or ones where access to them failed. We have also used the mouse to selectively label nodes.
A naive approach would be to start drawing at time t0 and continuously check the time until we reach time t0 + 200 ms, then stop drawing. This has two drawbacks. First, and less importantly, polling the time is a notoriously slow operation on most computer systems and checking more than a few times will degrade performance - exactly what we want to avoid. The other failing is critical; if we adopt the naive approach for a very large, dense graph which has a lot of overplotting, then the partially drawn graph is guaranteed to look very different from the completed drawing. This is because the partial drawing shows only those nodes and edges that lie below the ones that are usually drawn later and so appear on top.
For these reasons we adopt a different approach in NicheWorks. Using the results of previous drawing times (number of nodes drawn and time to draw that many) we predict how many items can be drawn within the next 200 ms. We then commence drawing midway through the ordered list of items until we reach the end. This method ensures that the picture is as similar as possible to the complete version. The details of the prediction algorithm are quite complex, as it needs to be robust, especially to the sudden spikes associated with processor sharing, and yet smooth, as we do not want nodes and edges flickering as we zoom and pan. It must also be adaptive, since as we zoom and pan we are actually drawing different numbers of nodes and edges (the major source of delay time). We intend to deal with this algorithm in detail in another paper, but in brief the algorithm uses a robust moving regression model that ignores the most extreme times in the moving average range. It simultaneously calculates several different such models and uses past performance to decide which of the models to choose for the current decision. Empirical results on UNIX X-Windows systems and Windows/DOS boxes indicate that the prediction method works well, rarely overshooting more than 50% of the time in the absence of a spike and without too much flicker.