VISUAL EXPLORATION OF LARGE STRUCTURED DATA SETS

Graham J. Wills
Bell Laboratories( Lucent Technologies) 1000 E. Warrenville Road,
Naperville IL 60532

Abstract: Visualization is a critical technology for understanding complex, data-rich systems. Effective visualizations make important features of the data immediately recognizable and enable the user to discover interesting and useful results by highlighting patterns. Visualization is particularly appropriate for large data sets, where standard statistical methodology can become swamped. Furthermore, such large data sets often have a unique structure associated with them which needs to be understood in the context of the data. Visual exploration provides a way to investigate relationships and patterns in such data which are hard to find by other means.

1.Motivation

Current trends in the growth of data sources and in the growth of computing power are encouraging to the analyst. It seems that even though the amount of data in the world is increasing exponentially, the ability of computers to process the data is at least keeping pace with that growth curve. This is a happy picture as far as the computer scientist is concerned, but the analysis process has another element, the capacity of which is not growing at such a rate; the analyst. With an increasing amount of data, the standard analytical tools develop problems; with millions of data points every parameter becomes significant (in the statistical sense), outliers become too numerous to deal with individually (if 0.1% of the data is erroneous, a million point data set has a thousand outliers), and summary information can hide major structure in the data. In a typical small analysis, a model is required to fit all the data. A model that fits only part of the data is rarely considered and its utility is limited. For a large data set a model fitting a small fraction of the data can be very useful.

A further difference between large and small data sets is structure. Many big databases are built to deal with a specific type of data that has intrinsic structure. This structure is of vital importance in the analysis process and needs to be incorporated into the exploration and modelling stages. Examples of such databases include:

At Bell Laboratories, our research group has been developing tools which visualize these and similar domains. We integrate interactive forms of a number of statistical views with specific visualizations of the structural component of the data so that the analyst can understand the inter-relationships in a way not possible with standard statistical tools.

2. Large Data Sets

As time progresses, the definition of 'large' increases at an exponential rate. Older statistics texts speak of hundred case sets as being large. In the mid 80's anything that did not fit on a 640K floppy was large. At the end of the 80's large data sets were several megabytes in size. Now they are measured in gigabytes and in the future we may expect them to be measured in terabytes. The exploration of such large data sets is a task that combines aspects of computer science and statistics. Information visualization is of key importance in examining large data sets, particularly in the following areas:

A good visualization system allows the user to explore relationships among the data, investigate for what subsets of the data the relations hold true, suggest models for the data and indicate anomalous or potentially interesting data points. Such a system should require the user to make minimal assumptions about the form of the data, and should allow the user to try out hypotheses rapidly; facilitating a "what if" exploration of the data. In the rest of this section we will explore some features that are commonly found in visual exploration systems. Eick and Wills (1995) provides an overview of the field of information visualization, including techniques and operations that are used in this paper. We describe them only briefly here.

2.1. Linked Views, Filtering and Focusing

The linked views paradigm offers a solution to the common problem of identifying relationships in a multivariate data set. The problem is that to display relationships among many variables is tricky. One variable can be displayed using histograms, bar charts, boxplots, dotplots, rose diagrams and the like. Two variables can be compared with scatterplots, multiple boxplots / dotplots and categorical tables. Although some efforts have been made with three or higher dimensional views1, they are often hard to navigate and interpret. This results in the benefits of the view from an exploratory viewpoint being lost.

The linked views approach is to allow the user to create a number of simple univariate, bivariate or even multivariate views to examine variables that the analyst believes to be important. The user can interact with a view using a mouse (typically, clicking on bar or cell or dragging the mouse around a region of interest) to indicate a subset of the data that they find worth further study. That subset will be drawn differently in each plot that the user has created, allowing interesting features of one plot to be seen in the context of other variables.

This technique has its origins in scatterplot brushing (Cleveland and McGill, 1984a, also in several articles collected in Cleveland and McGill, 1988) but has since been generalized and implemented in many systems, notably Velleman (1988), Haslett et al. (1990), Tierney (1990) and Swayne et al. (1991).

2.2. Example - Correlating hitters' performances with their salaries in baseball

Figure 4. A first look at the ASA baseball data.

Figure 4 shows a system being developed by the author for linked windows analyzing a set of statistics on baseball data provided by the American Statistical Association (ASA). This is only a small data set, but it is highly multivariate and thus a good candidate for linked windows techniques. The fundamental question posed by the ASA was to investigate if players' salaries were related to their performances. Hoaglin and Velleman (1995) surveys a number of approaches to this data. To create figure 4 we focused on seven of the relevant variables and created plots for them. We created a smoothed histogram of the log of the salary variable, bar charts for both number of years playing the game and field position, and scatterplots of average career home runs against average career hits (these measure the batting ability of a player) and of average put-outs vs. assists (these measure defensive fielding ability)2.

In this figure we have swept the mouse over the high end of the salary data, focusing our attention on the high-salary players. All plots immediately update to show this subset in a different color. We can very rapidly change our selection in this plot and examine how middle and low salary players differ in their statistics. We learn the following:

To explore our hypothesis of a relationship between salary and years in the league, we can select bars of the 'years' barchart sequentially and see the effects. The results confirm our expectation, with 'young' players getting a narrowly defined (and low) salary range. Plotting years against log salary also demonstrates that there is an interesting non-linear relationship here with a break point between 3 and 4 years. This appears to be related to the rules that govern when players are eligible to become free agents.

Finally, we noted the two groups in the putouts-assists scatterplot. When we select one of those groups (figure 5), we see that they correspond directly with fielding position. To someone familiar with baseball, this would not be surprising, but the author, at least, learned something new from these linked views.

Figure 5. Exploration of the assists-putouts groups

2.3. Minimal Labeling and "non-data ink"

Both Tukey (1977), writing about exploratory data analysis, and Tufte (1983), writing about visual presentation of information, stress the importance of showing the data. Tufte especially emphasizes the importance of maximizing the "data-ink ratio"-- in other words using as much ink as possible to show the data and as little as possible to show ancillary information. He gives amusing examples of poor graphics where this principle is ignored. On a computer screen, where we are limited to a relatively small number of pixels for each plot, it is even more important to make sure the data are the most visible part of the plot. By drawing minimal labels and ensuring that they are less visually prominent we can achieve our goal of focusing attention on the data without losing contextual information. In figure 4 the labels are drawn in a dark shade of gray that stands out against the black background less than the shade for drawing the data. The color for the highlighted points stands out most strongly of all.

This graded attention focus allows our visual system to see the most important aspects of a view immediately, followed by the less important aspects. If we need more detailed labeling or more information about a specific plot, we have the advantage over a paper representation that we can click on the plot and display more detail immediately. Hidden details can rapidly be displayed then hidden again. We can click on a point in a scatterplot and see its exact values, so drawing a grid across a scatterplot is unnecessary. Similarly the shape of a histogram is the most important thing. Knowing how many items fall in a given bin is a less useful piece of information, so it too is hidden by default.

3. Structured Data Sets

Classical statistical methods are based upon a matrix of data, with each variable being a column and each case a row. Most tools and computer packages are designed with this scenario in mind. With the growth in computer usage in recent years, more companies, government agencies and academics are recording data not based on experiment, but on observation of an existing process. The resulting data sets rarely fit into the 'data matrix' paradigm since real-life processes often have an interesting and important structure that must be understood and analyzed in conjunction with the data. Classical tools are often used on subsets of the data, but the basic task of evaluating the data and forming hypotheses requires the analyst to have exploratory tools that aid them in studying the data in the context of the structure. Some examples of structured data that the author has worked with include:

3.1. Example

The software production problem introduced above is an important one to the author's company. Our research team developed a customized view of software code source files that allowed up to 100,000 lines of code to be displayed using a reduced representation method. It allows a number of statistics based on the code to be analyzed. These usually include date of addition, reason for addition, category of addition (e.g. fixing an error or adding a feature), who made the alteration and a reference to the other alterations made for the same reason. The program, SeeSoft (Eick, 1994), can also be used to visualize other bodies of text. In figure 6 we show SeeSoft being used to analyze a novel - Rudyard Kipling's The Jungle Book. Here we have shown one statistic, the major character who is referred to on this line.

The main window of figure 6 has three sections. To the top left is a list of values of the statistic, each assigned to a color. In this example the statistic is the name of the major character referred to on that line (if none, '---' is recorded). The width of the bars can be set to encode the frequency of that value, forming a bar chart. In this case we have omitted that option as the large majority of lines mention no character.

At the bottom left is set of command buttons and the rest of the main window is taken up with the data. Each file in a set of files (here, each chapter in the book) is represented by one or more columns, each containing representations of lines of text as single pixel height lines on the screen. In figure 6, chapter 1 can be seen in its entirety to the left of the display. The width and start positions of the lines mirror the length and indentation of the original data. The color of the line represents the character appearing in that line, as specified by the statistic list in the top left. By examining the display we can see the relative lengths of chapters, who appears in which chapters and the relationships between the characters. For example, we see that Shere Khan (in red) is the first character to appear followed by Mowgli (green), the hero. Akela, Bagheera and Baloo (shades of blue) then join the narrative. This follows a classical pattern of story-telling. First the villain is introduced, then the hero and finally his friends and supporters. In the program itself, the user can drag the mouse over the color bar and gray out colors immediately. By graying out everything and then selecting only a few characters it is much easier to discover relationships than figure 6 suggests. Other options allow re-coloring of statistics to aid discovery.

The other window is a browsing window giving the user full access to the text. By clicking on the reduced text representation, the browser window shows the actual text associated with that line, again colored by the statistic. This allows the nature of discovered relationships to