AndyGB
AndyGB
  • Threads: 18
  • Posts: 59
Joined: Feb 7, 2013
June 7th, 2013 at 11:34:17 AM permalink
"Quizzes are fun, but gambling on them is more fun." -my wife, Mrs. AndyGB

Our daily newspaper here prints a quiz 5 days a week with a semi-random topic or theme. Some of the topics are standard trivia topics such as "Science," "Potpourri," "Works of Shakespeare," etc. Others are a bit less common such as "Poets that start with B," "Songs of 1937," and so on. And others seem unique such as "Literary works and the author you'd think would have written them."

Some of the categories repeat with some frequency. There seems to be at least one "Science" set each week, and one "Potpourri" every week or two. So say I want to make a bet each day on what topic the next day's quiz will be. I feel like in order to do this, I need to know the frequency that the various repeating categories come up, compared to the frequency of the rarer categories, compared to the frequency of the one-off categories. Then I can estimate the odds of one category coming up, or a set of two categories coming up vs another one, or three categories against the field of all possible categories or etc. She's right, it does sound fun!

It seems obvious that recording the categories for a week won't give me a reliable indicator of how often various categories come up. How long do I need to record the categories before I feel, say 80% confident that I have a good handle on the set of all possible categories and how often various general categories repeat? What is throwing me off is the unknown number of possibilities. There is literally an unlimited number of categories in theory, but the reality is there are probably a handful that repeat frequently, a bigger handful that repeat but less frequently, and some that never repeat at all.

What's the easiest way to approach this? And how do I estimate my confidence given a certain size data set?

Thanks! AndyGB
pacomartin
pacomartin
  • Threads: 649
  • Posts: 7895
Joined: Jan 14, 2010
June 7th, 2013 at 11:39:20 AM permalink
Quote: AndyGB

It seems obvious that recording the categories for a week won't give me a reliable indicator of how often various categories come up. How long do I need to record the categories before I feel, say 80% confident that I have a good handle on the set of all possible categories and how often various general categories repeat?



You have no way of knowing how the newspaper editor picks the categories. If they are not done by a random process, but only by editor's choice, then no amount of data collection will tell you anything.

Even if they used a dice, and the results were not uniform (i.e. if I roll a sum of 7 then I always pick "Science"), it will take years of data collection to figure out what the true odds.
rdw4potus
rdw4potus
  • Threads: 80
  • Posts: 7237
Joined: Mar 11, 2010
June 7th, 2013 at 11:57:32 AM permalink
I'd be surprised if the options didn't fit a pattern: monday is science day, tuesday is literature day, wednesday is entertainment day, thursday is geography day, friday is potpourri day, etc.

But, even if that isn't the case, I'd probably shorten the list of items by creating some sort of "the field" to include all of the non-frequently repeated subjects. Even if the odds assigned aren't exactly accurate, that eliminates big long-shots and non-repeated items.
"So as the clock ticked and the day passed, opportunity met preparation, and luck happened." - Maurice Clarett
kubikulann
kubikulann
  • Threads: 27
  • Posts: 905
Joined: Jun 28, 2011
June 11th, 2013 at 8:53:09 AM permalink
Try fitting a Pareto distribution. Nothing garanteed, but very often the relative frequency of ranked items follows something like it (or a variant).
Then, rather quickly, you'll have approximated the relative frequencies. How quickly? You never can tell, sorry.

Now for pattern! As said, the editor can screw any model you build by changing it on a whim, so suppose independence and stability (they have a process and it does not change). Then, ultimately, what you have is a Markov chain. The techniques of Markov processes and time series may be used to compute autocorrelation, for example. But, but, but... is it worth the effort?
Reperiet qui quaesiverit
  • Jump to: