There’s been a long-voiced accusation that you can’t trust the ‘exclusive’ Day 1 reviews from professional (as opposed to unpaid amateur) gaming review sites. After all, those sites probably had to make a deal with someone to get the valuable exclusive, promised to give a game a high score, right? Or even if there isn’t outright bribery going on, the game critic reviewing a title at or just on launch isn’t really playing the game but rushing through it as quickly as possible in order to finish what they need for a front page review and the potential eyeball boost it can bring.
This issue was highlighted recently with the debacle that was SimCity, where the early reviews – played on exclusive servers hosted for the very purpose of showing the game off to game critics – in no way reflected the play experience at release. Polygon came off particularly badly, revising its score three times from a high of 9.5 on launch day to a low of 4 in just three days (and later a fourth revision about a month later up to 6.5). Here’s a chart of SimCity’s review scores over time:
That event got me thinking: how often does the launch review differ from the reviews that are written later? Armed with a spreadsheet of over 8000 titles and their review scores listed on Metacritic that I’d previously played around with, it seemed like a good opportunity to find out.
Everyone Loves Data Analysis!
So, how to test it? I needed to know the dates that the reviews were released. Unfortunately not every review had that information recorded by Metacritic, so that particular requirement removed a lot of titles. However, 1830 titles DID have some kind of review dates listed, which is still a pretty healthy sample.
Next step was to compare the average of the first review scores (which I’ll call Day 1 reviews) versus the average of those reviews that came later. This bit is dependent on Metacritic getting the review release date right, but I’m assuming that they did. Also, I didn’t check if any of these reviews were ‘exclusive’ to a games site first or they released ahead of an embargo, just that they were the first review(s) released by date for that title.
The third and final step was a test for statistical significance. Given the expected low number of reviews for some titles – I’d expected that many games may only have one Day 1 review – I knew that this kind of analysis wouldn’t be possible for a lot of games. For titles where a significance test could be applied it meant differences that exceeded expected standard error rates could be picked up. (For those who care about such things, since we are comparing two averages I used a T-Test set for one-tail testing at the 95% confidence level and assuming unequal sample variance among the two review averages).
Possibly Just Showing You What You Already Expect to See
So, are Day 1 review scores positively biased, returning higher scores than reviews that come later? It certainly looks that way.
Of the 1830 titles that could be analysed, Day 1 review score averages were higher than subsequent review average scores in 66% of titles (or for 1205 titles). In contrast, they were lower for 33% of games (or 594 titles) and equal for the remainder (2%, or 31 games).
This is a very broad test – after all, the Day 1 average review score might only be 0.01% higher than the rest of the reviews for it to be considered ‘higher’ here – but the rate of 2:1 higher versus lower Day 1 average reviews is indicative of a positive skew. If it was closer to an equal number of higher Day 1 average reviews to lower Day 1 average reviews, I’d conclude that there wasn’t a skew going on. Instead we see that Day 1 review averages are twice as likely to be higher than the average of subsequent reviews than lower.
To show that in a bit more depth, let’s look at the range of diffence in 5 percentage point increments. To throw out some of the larger differences caused by games with only a few review scores but a wide variation between them, I’ve limited the following to games with ten or more reviews, leaving 1196 titles.
Although the largest single column here are for Day 1 review scores that are only zero to five points higher than the subsequent reviews that follow – not necessarily a big difference – it is clear that Day 1 review averages skew towards the positive side of the scale.
One question that came to mind at this point was around how much higher / lower the differences were between Day 1 review averages and all other reviews averages. In broad terms, and using titles with 10 reviews or more to try to cut out some of the larger difference scores, the Day 1 review average is typically 7 percentage points higher than the average of all subsequent reviews.
But does significance testing back this up? Of the 1830 games being examined here, only 624 titles have two or more Day 1 reviews, so we can only significance test this group. Significance testing shows that 17% of this sample have significantly higher Day 1 review score averages (or 109 titles) compared to only 2% (or 14 titles) having significantly lower Day 1 review score averages. Again, this is solid evidence that Day 1 review scores are often higher than subsequent review scores. Among this sub-sample, we still see that overall 69% of titles (431 games) achieve higher Day 1 review averages (both significant and non-significant) compared to the average review scores they achieve later.
For those interested, the five titles with most statistically significant differences were:
- Brink (Xbox 360) – Day 1 Average: 80%; All Subsequent Reviews Average: 68%
- WWE All Stars (Xbox 360) – Day 1 Average: 90%; All Subsequent Reviews Average: 75%
- Dead Space 3 (Xbox 360) – Day 1 Average: 89%; All Subsequent Reviews Average: 78%
- Splatterhouse (Xbox 360) – Day 1 Average: 75%; All Subsequent Reviews Average: 63%
- Tropico 4 (PC) – Day 1 Average: 88%; All Subsequent Reviews Average: 77%
The most significantly lower title for Day 1 reviews compared to subsequent reviews is Captain America: Super Soldier (Xbox 360), which received an average review score of 48% on Day 1 that rose to 62% for all subsequent reviews.
So what does this mean?
First off, it seems that Day 1 reviews are more likely to be positive than the reviews that come out later. Although some might see this as evidence of game critics taking secret envelopes from publishers / studios so they have Doritos and Mountain Dew money (or some other formalised corruption, like minimum review score requirements around exclusive reviews), it could be a number of other things.
Knowing the power of good early reviews, publishers / studios may be prioritising ‘easy’ review sites as the place to showcase their games first. I haven’t looked at this review information by site to confirm or refute this. (A flipside to this line of thought might be that the ‘softest’ reviewing sites get their reviews up first.) However, in most cases it are the ‘big’ sites that get the exclusive / Day 1 reviews because publishers / studios want the most attention for their newly released title. Plus reviews that come from one site might be written by multiple reviewers, meaning that unless the same game critic is writing all Day 1 reviews, this positive skew bias is influencing a lot of people.
So perhaps there is a systemic bias at work here. Having thought about it, there would be some commonality between experiences for Day 1 reviews. Sometimes these games will be played on pre-release code, meaning that when the reviewer experiences a bug, the developer / studio can claim that, “This issue will be fixed at launch, so you can’t mark us down for that.” Even ‘gold version’ code can still receive a Day 1 patch. For online games, special dedicated servers will have been established that may be optimised for that title, or may not run into problems when there are low server loads (such as only having the game critics online).
If the game critics are sequestered somewhere to maintain review confidentiality, they may be offered a number of perks to improve the overall experience. Every game critic will say that gifts don’t effect review scores, despite decades worth of evidence from the medical profession (who have years of training about how to make a choice) that gifts do influence decision-making. And even if they aren’t being sent to hotels for PR events, game critics can receive a lot of game-related swag from publishers / studios.
But then there’s also the pressure of time. Game critics have a limited amount of time to write the Day 1 review. To meet the deadline, they need to speed through a title which means they can end up glossing over a game’s flawed features. Maybe the inventory system is terrible, but the critic rarely goes into their inventory, or maybe the weapons aren’t balanced (especially in multiplayer) but the critic doesn’t do much playing around with weapon combos or even touch the multiplayer option. Or maybe they turn the difficulty down to easy to finish the game and spent significantly less time in a normal-or-harder difficulty where game play issues are more noticeable.
It is a completely difference experience to play a 12 hour game in two six hour sessions versus twelve 1 hour sessions. A game critic pushing to get that Day 1 review up is forced to compress the game into marathon chunks that may hide game play issues, where a more casual player is exposed to them more often as they spread their play sessions out.
(You could argue that the reverse is true – marathon sessions might highlight particular game play flaws – but Day 1 reviews being positively skewed would indicate this happens less often.)
Ignoring the Experts
So, what to do with this information? My personal take-out is to start dropping all Day 1 reviews by about a ‘grade’ to adjust for this Day 1 positive skew. Sure, this may mean I’ll be under-valuing some titles based on their reviews, but I can always see what the later reviews are saying before making up my mind about a purchase, can’t I?
Self-identified flaws with this analysis include:
- Metacritic converts scores from a range of ranking mechanisms so that it can be looked at on a 0 to 100 scale. Perhaps that conversion rate is positively biasing some of these results.
- I’m comparing averages that are sometimes based on a very small number of reviews (including averages based on one review). Averages from a small base can be distorted by outliers. However, given the nature of game reviewing, where only a small number of sites get pre-release reviewing opportunities, I don’t see a way of avoiding this.
- Although I’ve checked and rechecked my formulas, I may have made a mistake somewhere. I plan to release the data and if anyone picks something up, I’ll update the above to correct it.
- There’s a chance that this positive skew is something that exists because of the data set used and not something that would have been found if I’d used a different data set, or if more of the 8000 titles in the original dataset had review dates in place. That’s possible, but I think it’s a pretty robust dataset for this kind of analysis. It would be interesting for someone else to take another list of titles for a platform I didn’t cover and then see if they got the same results.
Obviously, the following original data set was sourced from Metacritic and I make no claim to the authorship of that information. The aim of this research is to analyse Metacritic’s data as a partial examination of its composed metascore number. This should be considered fair use and fair dealing with the data set. No monetary benefit exists to me from my analysis of this data.
A range of Xbox 360, Playstation 3 and PC titles have been included in this analysis – all up 8668 titles were collected for analysis, but only 4256 games in this list had a metascore created and could be analysed. Results for platforms such as the Wii, Wii U, Vita, DS etc have not been examined.
UPDATE 8 June 2013: Fixed up a chart that was showing incorrect figures because I’d sorted the data and not realised the chart was effected. Although the numbers changed, the overall findings didn’t.