Ralph Melton (ralphmelton) wrote,
Ralph Melton
ralphmelton

2000 Games of Pandemic

I have been an enthusiastic fan of the board game Pandemic since I first played it in 2008. I like it because it involves cooperation and careful planning, which makes it a good way to talk to people through a game. For several years, I’ve been playing a lunchtime game once a week with friends. In 2013, Z-Man Games came out with a version of Pandemic for the iPad. It doesn’t have networked multiplayer capabilities, but it works very well as a solitaire game in which one person controls all the players in the game. I’ve played the iPad version extensively, and I started recording my games as a little science project to see if I could use experiments to support our debates about which Roles were strongest. (In many years I judge science fair projects for the Pennsylvania Junior Academy of Science for the school where Lori taught; I think that investigating a game like would be a totally legitimate and interesting science fair project.)

From the first 2000 games I’ve recorded, I have discovered that the Roles in Pandemic and its expansion On the Brink are balanced better than I had expected.

(I wrote this in Pages, and then pasted it to LiveJournal - but LJ ate my tables. If you want to see the original version, ask me and I'll send you a PDF.)


Hypotheses

The science fair projects of Mrs. Chirdon’s students start by stating a hypothesis at the very beginning, before they even define their terms. But I’m going to set the stage a bit before stating my hypotheses.

Pandemic is a cooperative game. The players work together to try to discover Cures for four diseases that are threatening the world. On each turn, a player get four actions they can use to move around the board, treat diseases, share knowledge, or discover Cures. They then draw cards from the player deck, then place new infection cubes on the board. Each player has a different Role with a different special ability.

There are three types of cards in the Player deck. Most player cards are city cards; a particularly important use of city cards is to discard five city cards of the same color to discover a Cure for the disease of that color. The player deck also includes Special Event cards that can be played out of turn for special benefits. The third type of player card is an Epidemic. These cards add disease cubes to a new city on the board, and the discarded cards from the Infection deck are then shuffled and played on top of the Infection deck, so that the same cities are likely to be infected again. In general, the more Epidemics there are in the deck, the harder it is to win.

When a city with three cubes is Infected, it has an Outbreak; instead of getting a fourth cube, all of its neighbors are infected with one cube. This may cause them to Outbreak as well in a chain reaction.

When a Cure has been discovered for a disease and all cubes of that disease have been removed from the board, that disease is Eradicated; no cubes of that color will appear for the rest of the game.

I assumed that changing the number of players or the number of Epidemic cards in the deck might significantly change the game; therefore, I’m going to use the shorthand XPYE to describe a game with X players and Y Epidemic cards in the deck.

I had a few hypotheses I was hoping to prove:


  • Pandemic is easier with fewer players. In particular, 3PYE is easier than 4PYE. We’d formed this conclusion early on in our Pandemic history, and had taken it for granted since – but it was worth validating.

  • The Epidemiologist Role is weaker than most Roles. In particular, it is weaker than the Scientist Role. The Scientist can discover a cure with four city cards of the same color, instead of the usual seven. The Epidemiologist can take a card from another player in the same city once per turn as a free action. I thought that the limit on how often the Epidemiologist could take cards and the fact that the Epidemiologist still had the limit of seven cards in hand made the Epidemiologist clearly inferior to the Scientist. (The original version of the Epidemiologist had been even weaker, and we had boosted it with a house rule in the group I usually play with.)

  • The Troubleshooter Role is significantly weaker than most Roles. The Troubleshooter has two abilities; they can fly to a city by simply showing the card for that city (unlike normal players who have to discard the card to make a direct flight), and they can see upcoming Infection cards at the beginning of their turn. I thought that the second power was very weak, because it seemed rare that that would provide information that would change the actions you take. (Even knowing that a city would have an Outbreak only helps if you’re able to move to that city.) In our play group, we had boosted the Troubleshooter with a house rule to look at the upcoming cards on every player’s turn, not just their own.

  • The Field Operative Role is significantly weaker. The Field Operative’s special ability is that it can collect samples of disease cubes as it treats disease, and then it can Discover a Cure with three cubes of one color and three city cards of the same color. But it can only collect a sample once per turn, and that restriction seemed too strict; it seemed that if you made a bad choice about what color to collect early on, it would be hard to catch up.

Methods

I played a bunch of games of Pandemic on the iPad, and recorded which roles I used and whether I won or lost. For each game that I recorded, I chose the number of players and the number of Epidemics beforehand, but used randomly selected roles.

I started recording games on December 4, 2013. On December 13, 2013, the makers of the iPad game released a new version with the Roles and Special Events from the Pandemic: On the Brink expansion. I bought those Roles and Special Events, and I’ve been playing with them ever since.

One critique of my methods is that I have not recorded quite all my games. There are a few reasons that I have not recorded games:


  1. There have been games that I have decided not to record before beginning play. For example, when I was playing games sitting by Lori’s hospital bed, I decided beforehand that I was probably not playing at my best, and therefore chose not to record them. In early of December 2015, I didn’t record several games, because I had declared to myself that at 2000 games, I would write up my results, and I didn’t have time to write then. So analyzing this data to draw conclusions about when I have played iPad games of Pandemic would be dubious, but these unrecorded games should not cast doubt on the conclusions about Pandemic itself.

  2. There have been some games that I have decided not to play or record, because the set of Roles has been very similar to that of a game that I’ve just played. For example, if I play one game with Scientist, Researcher, and Troubleshooter, and then my next game starts with Epidemiologist, Researcher, and Troubleshooter, I might decide that I don’t want to play such a group twice in a row and end the game before the first turn. This has happened about a dozen times. This might mean that the distribution of roles in my games is not quite random, but I think that this has happened rarely enough that there should be no strong effect.

  3. There have been a dozen or two games that I have quit in mid-game because I failed to execute my intended plan. The most common way this happens is that I plan to move the Epidemiologist to another player and take a card, do move the Epidemiologist into place, and then forget to take the card before I end the action phase of the Epidemiologist’s turn. If I was playing with my friends, we would fix that error; “You said beforehand that you were moving into position in order to take the card, so here you go.” But the iPad version won’t let you undo once you’ve drawn cards, so there’s no way to fix that error. I think that not recording those games gives data that more accurately represents the results I’d get with my friends, but I still have some qualms about those abandoned games.

Another potential critique is that I might have improved my Pandemic skills over the course of all these games. I started playing the iPad version after five years of playing Pandemic once a week or thereabouts, so I was well past the initial part of the learning curve. However, the Contingency Planner and Quarantine Specialist Roles only became available in 2013, so it is possible that I encountered a learning effect with those Roles. I also only started playing five-player games in September 2014, so it is possible that there was a learning effect for the particular complications of five-player games.

One more critique is that my Pandemic play might not be representative of other players. Perhaps there is some point of strategy that I consistently overlook, or I’m too frugal in my use of Special Events or something like that. I have no controls against such bias in this project, because involving other players would have made this more work.

I also note that I’ve discovered occasional errors in my spreadsheet of results. For example, I’ve noticed occasional games for which I’ve recorded both a victory and a type of failure, an obvious contradiction. I’ve fixed some of these errors when I felt certain of the correct answer, but there have been some errors that I did not have enough data to fix, and presumably some errors that I haven’t noticed at all.

Results

The first thing I learned from this project: it takes a lot more data points to gather meaningful statistics than I had expected. I had thought that with several hundred games, I’d be able to prove hypotheses about single Roles being weak, and start to be able to investigate theories about interactions with multiple Roles. But after 346 games of 4P5E, I had no measurable differences in the performance of any Roles.

So I changed directions and tried to confirm a hypothesis I considered obvious: Hypothesis: Increasing the number of Epidemics makes the game harder. It didn’t take many 4PxE games to confirm this:

(All confidence intervals are at p < 0.05.)

This is not a surprising result; I have a straightforward theoretical model of why increasing the number of Epidemics makes the game harder, and I have anecdotal data to support this. But it’s nice to show that it is possible to confirm the obvious.

(The other thing I learned with this exercise is that I prefer to play with a higher chance of winning. I usually play at a difficulty higher than 5 Epidemics when I play with my friends, but I didn’t enjoy playing 4P6E on the iPad. It may be that I’m more tolerant of failure when playing with my friends, or it may be that our house rules make a significant difference in our success rate. Our house rules: we deal each player two Roles and choose one of those Roles in collaboration with the other players, and we choose collaboratively which player goes first.)

From there, I turned to investigating the hypothesis that 3P5E is easier than 4P5E. I started off with a very strong streak that made it look very likely that the hypothesis would be confirmed. But I wanted to continue until it was confirmed at p < 0.05, and as I kept playing I started losing more often. After more than 500 games, my success percentages are nearly the same, at 84.3% ± 3.0% for 4P5E and 84.5% ± 3.0% for 3P5E.

But you should notice a lexical subterfuge there: I’ve stated a hypothesis about some games being easier, but the data I’ve presented is about what games are more winnable. It includes all the games that I barely win through extensive experience. Does that really measure what’s easier?

The statistics I’ve collected don’t record whether a game was easy, so I tried to define “easy” in terms of the statistics I recorded. I defined a proxy for an easy game by this logic:
A game is hard if I feel like I barely win, i.e., almost lose.
There are two ways to recognize that I almost lost a game:
1. The game is lost if the number of Outbreaks reaches 8. So a game that is threatening to hit that number of Outbreaks feels harrowing. I chose a threshold of 6 Outbreaks as marking a game that was nearly lost through too many Outbreaks.
2. The game is lost if a player needs to draw Player cards and there are not enough cards to draw. When the Player deck gets low, we start to worry and count out how many turns each player will get, in order to make sure that they will get enough turns to find the last Cures. I chose a threshold of 7 cards or fewer to identify a game that was nearly lost, because in a four-player game, that threshold means that the game is won on the last possible turn for some player.
Therefore, a game is easy if it is not nearly lost in either of those ways. (There is a third way to lose, by running out of cubes for one disease. The statistics I’ve recorded do not let me track how close I come to losing in that way. However, I know from experience that running low on cubes is strongly correlated with having many Outbreaks, so the first condition captures almost all of these cases.)

This proxy is inaccurate in both directions; there are easy games which this criterion would consider hard (such as a game with multiple Eradications, no Outbreaks, and a straightforward Cure of the last disease with 7 cards left), and there are hard games that this criterion would consider easy (such as a game that’s almost out of cubes of one color with five Outbreaks). But it is in the ballpark and it is objective.

Although my success ratios are nearly equivalent for 3P5E and 4P5E, I am significantly more likely to have an easy game with 3P5E. So perhaps there’s some merit to my early hypothesis about 3P5E being easier.

On the other hand, if this is a trend, it does not seem to extend to 2P5E. I didn’t play quite enough games of 2P5E to get a statistically significant result, but it certainly seems that I am less likely to win 2P5E than 3P5E.

In September 2014, it finally happened that all five of our usual Pandemic group were able to play on the same day. That turned my attention to five-player games. I had been timid about five-player games, because I thought that having five players would make it much harder to collect five cards of a color in a single player’s hand. However, I was wrong; it turns out that I am more likely to win 5P5E than 4P5E. I think there are two factors that explain this:
1. Though cards do get more distributed, the limit on the number of cards in a player’s hand is less of a burden.
2. The player deck includes two Special Event Cards per player, so 5P5E provides a lot of cards to mitigate problems.

At the beginning of my 5P5E play, I thought that the Contingency Planner was going to be exceptionally good in 5P5E play; the Contingency Planner’s special ability is to pick up Special Event cards that have been used in order to use them again, and with ten Special Event cards in the deck, there’s a lot of potential for that to work out well. The first 44 5P5E games with the Contingency Planner were victories, and I thought I was on track to demonstrate superiority there – but the Contingency Planner has now regressed to the mean. I have no significant evidence that any role is better than any other in 5P5E.

I also thought that the Field Operative would be particularly hampered in 5P5E. It takes three turns of sampling for the Field Operative to get a Cure, and if the Field Operative is not the first or second player, he only gets five turns total. But the Field Operative’s performance is in the middle of the pack, not distinguishable from everyone else. It seems that he either gets one Cure easily, or two Cures with a lot of help and support - and that’s enough to do his share.

Another interesting result with 5P5E: I get more Eradications with 5P5E than with fewer players. A lot more - 71% more Eradications than with 4P5E. (I will try to get Eradications even when it will not help win the game, but that’s true in games with fewer players as well.) I believe this is due to a combination of more players having greater mobility to get to the final cubes for Eradication, and having more Special Events to help bring off an Eradication.

An interesting but statistically suspect observation: the greater number of Eradications also shows up at the far end of the distribution, with games in which I manage to Eradicate all the diseases. Eradicating all the diseases is very tricky, because the iPad implementation ends the game immediately when a Cure is discovered for the last game; in order to Eradicate the last disease, you have to eliminate all the cubes of that color with or before the action that discovers the Cure. Before I started playing 5-player games, I managed to get a complete Eradication only once in 1357 games; with 5P5E, I’ve managed to get complete Eradications 14 times in 643 games.



Conclusions and Further Work

With over 500 games each of 3P5E, 4P5E, and 5P5E, I have found no evidence that any Role outperforms any other. This is a great surprise to me, because I felt certain that some Roles were excellent and some were weak. I salute the designers for balancing the Roles so well.

For future work, I’d like to measure the effects of playing with a Role that was obviously weaker. Consider a “Civilian” Role with no special abilities. This Role would obviously be inferior to any Role that did have special abilities; how many games would it take to prove statistically that it was inferior? I have considered trying to simulate this by marking one player as a Civilian and never using any special powers. But the Medic, Containment Specialist, and Quarantine Specialist have powers enforced by the game, so it would be hard to play one of those as a Civilian.

But in the immediate future, I’m more likely to investigate another path: The creators have just added the Virulent Strain Challenge to the game. I assume that I’m less likely to win with the Virulent Strain Challenge than without – but how much more difficult is it? In particular, how does the difficulty of a five-Epidemic game with Virulent Strain compare to the difficulty of a six-Epidemic game?

One final conclusion: I am still really enjoying playing Pandemic. And I have some numbers that shed a light on why. It comes down to the difference between victorious games and easy games. Even with my years of experience, over half of my games (with 5P5E) land in the zone where I win, but I feel I win only with cleverness and a bit of luck. That is my sweet spot for cooperative games, and Pandemic hits that sweet spot again and again.

Tags: pandemic
Subscribe
  • Post a new comment

    Error

    default userpic

    Your IP address will be recorded 

  • 11 comments
I am impressed with your enterprise, sir. And with the designers' intuition and playtesting. If I were judging your science fair project I would add credit for your digging into "games won" versus "hard games" (though I don't know if it fits the Official Scientific Method Rubric).

Have you estimated, if roles do have different performances by like 2% in success rate, how many games would it take to detect that? I'm not sure how to... let's see, if you always played from just two role sets, XYZA versus XYZB, then it would be distinguishing two binomial distributions. But comparing "everying with A" and "everything without A" is more complicated, because the 'everything' has (11 roles choose 3) possibilities so you have a blend of um 165 individual distributions with A.

Playing just XYZA versus XYZB (with A and B picked because you think one's powerful and the other's not) might get you more statistical power from a given number of games played? The idea being it isolates the hypothesized ±2% coming from A/B, rather then also add in possible ±2% for each of the other three slots. (The other stuff can be seen through eventually, but with 165 possibilities it seems like it could take many hundreds or even thousands of games.) Pity it would be so boring.
Oh hm, for the boredom problem, you could generate matched pairs. Pick three random roles xyz, play two games, xyzA and xyzB. That should control the 'other' roles between your A and B conditions, but give more variety. The games in the pair don't have to be played back-to-back, either.
I can do some estimating. Let me take one example: In 4P5E, the overall success ratio is 84.1% +/- 3.0% after 578 games and the worst-performing Role is the Troubleshooter at 77.1% +/- 6.2% after 179 games. Let's assume those percentages don't change as I keep playing. How many games would it take to have a statistically significant difference at p < .05? (I recognize this is not exactly the question you asked.)

I'm pretty sure that the confidence interval is proportional to 1 / sqrt(N), though I don't remember that well enough to be certain. I want to shrink a gap of (3.0% + 6.2%) down to 7%. So total number of games is ((3.0% + 6.2%) / 7%)^2 * 578 = 998 games.

That might be tractable. I'm halfway there.

But if the Troubleshooter were only 2% worse than average (but the confidence intervals were the same), it'd take 12,230 games to prove it. That's more than I'm likely to play.

And I'm also pondering the question, "if it takes 1000 games to prove that the Troubleshooter is worse than average, does it matter that it's worse than average?" On the one hand, 1000 games seems to be in the "too many to care". But a 7% difference would mean that the Troubleshooter loses about 1 game in 14 that another team might have won, which seems it might be care-worthy.
sqrt(N) is the expected "distance from home" of an N-step +/-1 random walk, so that sounds like the right asymptotic...

Off on a side track, I'm puzzled about whether the whole "noise added by the other three slots" thing I was on about is a real thing. The argument against is, isn't "given Role A, do we win?" *some* Bernoulli process? So what if it's done by random choice of other roles and then depending on those -- it still boils down to success some P% of the time.

Let's see, concretely, it would be whether the confidence interval on e.g. the success is what you'd get for a p=0.841 binomial, or whether it's actually a wider CI? If success is a plain old binomial distribution and we've got 486 / 578 = 84.1%, I get the 95% interval is +/- 3.0%. Is that calculation the same way you got your +/- 3.0% too?

I think the "argument against" is right. But I'd have to expand an example to convince myself. And I still don't get what's wrong with the intuition for there being a real effect because of summing non-IID Bernoullis.

I am not completely following this comment, I fear.

I think that it is the way I got my +/- 3.0%. I used a spreadsheet formula CONFIDENCE(0.05,Overall Standard Error,Overall Games Played), so I haven't looked into the specifics of the implementation.

Once I sort out some glitches, I will share my spreadsheet.
I've thought about this some more, and I think I have some more perspective about the "noise added by the other three slots".

i think that if you want to answer the question "is A better than B", the best way is to play a series of XYZA and XYZB games. For one thing, this eliminates all the games that have both or neither of A and B, which do not add comparative power.

But to answer the question "is A better / worse than average", I think it's legitimate to compare all games using A against all games. But it might take longer for the confidence intervals to shrink than a more tightly-focused experiment.
That matches my intuition. XYZA / XYZB testing *ought* to be more effective and less 'noisy' than ???A / ???B testing.

But then I have a counter-argument I don't know how to poke a hole in: yes playing ???A involves first picking roles randomly and then playing a game, but if you put that whole process in a black box and wait for the success/failure light to blink, it's just a biased coin flip. So is ???B, and XYZA and XYZB. So that says the statistical testing and the confidence interval should work along the same lines for all.

I don't know which to believe. I know that I'm likely to err in making a statistical argument, but I'm likely to in my statistical intution too!
I have similar mistrust. I wish my friend with a PhD in statistics wasn't busy with more important things.

I think that your counter-argument says that the ???A / ???B testing should still give valid results (which I agree with), but I *think* that your counterargument does not speak to how fast those confidence intervals should converge.

Here's a current statistical quirk I'm considering:
Suppose I divide the sample population into two groups, A and not-A. (For simplicity, assume those groups are the same size.)
P(win|A) is measured at 93% +/- 3%.
P(win|not-A) is measured at 87% +/- 3%.

From this, we can conclude that P(win|A) > P(win|not-A).
But P(win) is measured at 90% +/- 2ish%.
So we can't conclude that P(win|A) > P(win).

But since P(win) = P(win|A or not-A), those two statements "P(win|A) > P(win|not-A)" and "P(win|A) > P(win)" must be both true or false.

I think that this means that I could strengthen my analysis by testing P(win|A) vs P(win|not-A) instead of P(win|A) vs P(win).
With checking P(win|A) vs P(win|not-A), I see one difference that is significant at p < .05: in 4P5E, P(win|Troubleshooter) = 77.9% +/- 6.0%, p(win | not-Troubleshooter) = 87.3% +/- 3.3%.

But I don't have a clear model of why that should be true with 4 players but not with 3 or 5. And with 13 Roles x 3 numbers of players, you'd expect a couple of things to appear significant at p < .05 that aren't actually true. So I remain suspicious that Troubleshooter is a little weak, but I don't think this is proven yet.
"I *think* that your counterargument does not speak to how fast those confidence intervals should converge."

I believe it should be identical, the CIs bounding a biased coin-flip IID process, whether the underlying "coin" is a ???A or a fixed XYZA -- insofar as their probability is the same, which we believe is close. I.e. I think a biased coin-flip is 'featureless' except for its probability parameter.


Your quirk seems reasonable the way you've laid it out there -- that a "one extreme / other extreme" experiment is clearer than a "one extreme / everything".
Also, good question! Thank you for making me think harder about this.