Tuesday, February 10, 2009

Further details on baseball+game theory

Okay, so, I've found that trying to gather the data I want to pull this off is going to take a while, so I'd like to start off by introducing the sort of methodology I have in mind. The way I'm thinking I'd do this right now is to start by creating a basic model where the actions available to the pitcher are to throw each of his different pitches in or out of the zone, so the total number of actions would be twice the number of pitches the pitcher has, and where the actions available to the hitter are to swing or not to swing. This is, admittedly, still a simplified model, but I think it's a good starting point. If this goes well (once I have real data, that is), areas I think I might move onto are: first, further subdividing the actions available to the pitchers into throwing pitches at different parts of the strike zone; and second, giving the hitter actions like, "Swing if the pitch is located at a certain position x amount of time after the pitcher's release"; in other words, to try to incorporate the fact that the hitter can decide whether or not to swing based on where the pitch seems to be heading, but only shortly after it leaves the pitcher's hand. This data is all available from pitch-f/x.

Once this model is set up, utilities have to be found for every possible combination of actions. The great thing about doing this model for baseball is that the utilities are actually very easy to find. They can be simply measured in terms of runs, and each possible outcome after a pitch has a specific run value (this is, essentially, the basis of John Walsh's runs100 system he uses at The Hardball Times). Every pitch will either end up as some sort of hit (a single, double, triple, or home run), some sort of out (groundout, flyout), or a ball or strikeout. (I'm ignoring, for simplicity's sake, things like hit-by-pitch, catcher's interference, etc.) The value of each of these things can be calculated in terms of runs by looking at how each event changes the run scoring expectancy. Here is a table for the values of the various hits, and the out, from Tom Tango's excellent The Book:



Establishing a value for balls and strikes is slightly more difficult, but still doable. Every count has a different run expectancy; not surprisingly, "hitters' counts", with more balls than strikes, increase the run expectancy, while "pitchers' counts" have the opposite effect. The value of a ball or strike can then be measured as the extent to which it changes the run expectancy of the count. John Walsh used this approach in developing runs100, and here are the values he came up with:





Now, it would be more useful to do different payoff matrices for pitches for each count, and that's the sort of thing I plan to do eventually, but for now, I'm going to come up with an average ball/strike value by averaging these numbers. This is technically incorrect, because some counts are rarer than other counts, but I don't have those relatively frequencies offhand, and, again, this post is about focusing on methodology, rather than correct numbers. This is what I came up with:



Now that we have a value for each outcome, the next step is to find the probability of each outcome, given each combination of actions from the pitcher and the hitter. This is where pitch-f/x data would come in. As I haven't waded through that swamp yet, I'm going to make up data for a fictional pitcher, pitcher X, to try to illustrate the principle. Let's say pitcher X has two pitches, a fastball and a curveball, and he can locate both of them either in or out of the zone. Let's look at the possibilities for when the hitter doesn't swing, first, as that is simpler.

If the hitter doesn't swing, clearly, the pitch is either a strike or a ball, and the outcome is dependent on where the pitcher meant to throw the ball, and how accurately he does so. Let's say that the pitcher can locate his fastball in the strike zone 70% of the time, and outside the zone 90% of the time. Let's also say he can locate his curveball in the zone only 50% of the time, and out of the zone 70% of the time. Therefore, if he attempts to throw a fastball in the zone and the batter doesn't swing, it will be a strike 70% of the time, and a ball 30% of the time. The percentages for other possibilities can similarly be found.

Now, if the batter swings, clearly things get more complicated. First of all, we have to take into account that, as explored above, the pitcher only locates his pitches properly some of the time. Next, we have to find the probability the hitter makes contact at all. I'll say that when Pitcher X throws his fastball in the zone, hitters make contact 90% of the time; when he throws it out of the zone, they make contact 40% of the time; and when he throws his curveball in the zone, contact is made 70% of the time; out of the zone, 20%.

Continuing with this ream of made up data, suppose that 70% of the time when batters make contact on any of Pitcher X's pitches, the result is a groundout or flyout. Of the remaining 30%, which are all hits, 50% of them are singles, 22.5% are doubles, 22.5% are home runs, and 5% are triples. This gives us this chart of probabilities:


(click on pictures to see larger)

Now that we have a list of probabilities, we can combine that with the list of values for each outcome to create a payoff matrix. The table looks like this; the value in each cell is the utility (measured in runs) to the hitter; the pitcher's utility is just the same value times (-1).



As you can see, most outcomes are unfavorable to the hitter. This is not surprising, as hitters make an out 60-70% of the times they step to the plate, and an out has a negative run value.

Now that we have a payoff matrix, we want to find the Nash Equilibrium, and therefore find the appropriate frequency with which the pitcher should throw each pitch, and the hitter should swing. Hopefully I'll get that up in a blog post tomorrow.

1 comment: