EDV: Baseball analysis

This data set consists of statistics for a number of hitters (non-pitchers) in the game when this data set was collected several years ago (it's left as an exercise to work out the date. Hint: Look at home long Pete Rose has been in the game in the data set). The object of the analysis is to work out which factors most strongly affect how much a baseball player is paid, but we've also looked at other interesting features of the data set.

A quick description of the interface

In EDV, the user reads data into the program and then selects variables using the mouse. Menu choices allow them to derive new variables, rank variables and otherwise modify them; various options can be set and, of course, data views can be created.

In this example, all of the variables form one data table so all views created of the data are automatically linked. In the following analysis, we concentrate on the career data as opposed to the current year's data (e.g. CHits/Years as opposed to Hits/AtBat) and present a distilled version of our original exploration.

Time is money?

Looking at the distribution of salary, it clearly has a skewed distribution. Since salaries are usually increased by a percentage amount, rather than by fixed amounts, we expect this kind of exponential distribution, and for the purposes of analysis it is common for analysts to try a log transformation to symmetrize the data. It seems to work pretty well, although there appears to be a longer tail at the low salary end - there seems to be an additional effect that gives some players lower salaries than we might expect.

One factor which seems very likely to have an effect is the amount of time a player has been playing for; we would expect players' salaries to increase over time. We create a scatterplot of salary against years and add a smoothing line through the data. We've also colored the players by their salaries with low paid players showed as blue and high paid ones as red.

This plot has a very clear interpretation; for the first 5-6 years a players salary increases exponentially (remember the salary data is on a log scale), after which it remains more or less constant, perhaps slightly dipping towards the end of a longer career as a player's increases do not match the average increases.

Other factors?

In a set of box plots of hitting statistics, we selected the highest paid players. Each of the boxplots and the bar chart shows the results. From the bar chart we see the age effect already noted, and in the boxplots we can see that the higher-paid players are indeed better than average. In particular note that the AtBats and Hits are almost always above the median - only a few outliers are below it. This is not true of the HR - home runs - statistic. There are a significant number of well-paid players who are not big hitters.

For these views, we selected those players with very long careers. We colored them via their average number of AtBats per year and created a spreadsheet-like list view so we can identify them.

Although most of the hitters are fairly similar, there are two obvious unusual cases:

  • Pete Rose: Not only the player with the longest career, he also has a very high AtBAt average. He's also clearly not a home run hitter and is RBIs are lowish too.
  • Rick Dempsey: Low figures for a reasonable salary

What about league or division differences?

A variety of tables and graphs were created to examine differences in the leagues and divisions. Apart from factors attributable to the 'Designated Hitter' rule, there seemed no evidence of any overall effect. These plots show that neither league nor dision prefers more experienced players or is more active in recruiting younger players.

A digression: Fielding information

This animation shows a triplot of fielding information and a bar chart coding player's fielding positions. Note that we have re-colored by position to accentuate the movie's effect.

The triplot is a plot that allocates variation among players to three variables; in this case Errors, PutOuts, and Assists. Players who have an exactly average ratio of the three variables to each other will be drawn in the center of the triangle. If they have more of one variable, then they will be closer to that variable's corner of the triangle.

The separation into two groups shows that there are two different types of fielders; there is a strong distinction between fielders with many PutOuts and those with many Assists. There are a few points in the middle area, but these turn out to be either Utility fielders or Designated Hitters.

In EDV we can animate over the Position bar chart as this movie shows. Not only does this show us how fielding stats are determined by position, but it also shows how these relate to fielding errors.

Correlation graph

How did we know to look for the effects we have displayed in the above sections? The figure here shows how we initially looked at the data. We created a graph that robustly correlates variables of all types and indicates whether or not there is any association. This graph is then displayed via the NicheWorks component of EDV. The resulting figure shows the strength of associations among variables

The strongest links to Log(1+Salary) are to the Years variable and then to sets of batting statistics - most strongly to the highly auto-correlated set of Runs/Year, Hits/Year and AtBats/Year. The links to HR, home runs, and to the Walks are much weaker. In fact, looking at the graph, we might suspect that RBIs are more important than home runs, except for the fielding statistics.

The fielding statistics form their own group away from the hitting statistics. One weakness of the method we use for correlations is that it cannot detect the interesting pattern in the PutOuts- Assists association, although it flags both of them as strongly dependent on Position.


This analysis has given a good initial picture of the relationships among the variables. We would go on to suggest building quantitative models and confirming the information we have discovered in the data. These results include:
  • The distribution of salary
  • The affect of career length on salary
  • The lack of difference between leagues
  • The relative lack of importance of home runs
  • The 4-way interaction of the fielding stats, errors and position
Home gwills@research.bell-labs.com