Some time ago I thought about bringing the enormous userbase of Mechanical Turk along to rate questions and videos here on 101questions. We could show them your photos and videos and they would ask their questions or simply move to the next if they were bored. Consequently, we could more quickly find more perplexing photos and videos.
As a test, I pulled ten photos out from our database that corresponded to the ten decile marks of the 101questions bank of photos and videos. This selection of photos represents our full range, in other words, from very boring to very perplexing.
Then I showed them to 100 Mechanical Turk users and paid them to answer with a question or a skip. Here are the results.
Photo | 101qs Rating | Turk Rating |
---|---|---|
Ticket Roll | 81 | 71 |
Dueling Discounts | 66 | 83 |
Dominos | 56 | 73 |
Rally | 52 | 87 |
Mural | 48 | 92 |
Sunflower | 43 | 69 |
War! | 39 | 74 |
Dash | 34 | 69 |
Shot put | 28 | 92 |
River | 23 | 90 |
Let me graph that for you.
The correlation between our ratings and theirs is basically non-existent and if it exists it’s negative. (ie. the more popular an image is on our site, the less popular it is with Turkers.)
More damning, here’s a distribution of our users and theirs.
Our users ask questions at every kind of rate. 8% of our users ask questions 10% of the time, 20% of the time, etc., all the way to 100% of the time. 27% of our users ask questions less than 10% of the time and boredom-skip the rest.
Then you have Turk, where the distribution is almost flipped. 40% of Turkers ask questions all the time. 0% of Turkers skip like we do at the left end of our distribution. The modes are switched. The fact that the Turkers were paid while our users spend their own currency to be here (their time) may explain why our users are so much more discriminating. Whatever that reason, this small test has convinced me that Turkers aren’t a useful proxy for our own userbase.
2014 Mar 19. The data.