One for the stats geeks
Sep. 17th, 2009 01:55 pmMy stats and related classes were too long ago.
Given a set of numbers -- integers,in this case, but I don't think it matters -- I want to detect whether they're all clustered (180, 180, 183, 179, 181), clustered with a small number of outliers (180, 179, 360, 182, 184), or spread (180, 300, 200, 250. 275). Number of input points could be anywhere from 2 to 50, maybe more. Do I want to use population standard deviation, sample standard deviation, population variance, sample variance, or something else?
Given a set of numbers -- integers,in this case, but I don't think it matters -- I want to detect whether they're all clustered (180, 180, 183, 179, 181), clustered with a small number of outliers (180, 179, 360, 182, 184), or spread (180, 300, 200, 250. 275). Number of input points could be anywhere from 2 to 50, maybe more. Do I want to use population standard deviation, sample standard deviation, population variance, sample variance, or something else?
no subject
Date: 2009-09-17 09:09 pm (UTC)If you want to describe the overall amount of scatter, so you can compare with a different sample, then use sample SD [ =STDEV ( ) on Excel ], but that will only tell you a number - it won't tell you the *shape* of the data.
More detail -- I can help if I can see the data, and help find more sophisticated tools.
Sample variance is just the square of STDEV, so it's no big deal to switch between but it doesn't tell you any more. You don't use the population standard deviation because it's not theoretically correct -- also because it doesn't account properly for the samples being different sizes.
no subject
Date: 2009-09-17 11:41 pm (UTC)I have the statistical functions at the bottom of this page (http://www.postgresql.org/docs/8.3/interactive/functions-aggregate.html) readily available, and can do general arithmetic as well.
no subject
Date: 2009-09-17 11:49 pm (UTC)(I assume you've already investigated various "audio fingerprint" methods to try to deal with the problem, presuming you have the audio content?)
Anyway. If you want it all to happen with a single SQL query, then you're probably stuck with taking the mean and the stdev and arbitrarily deciding what's "close enough".
A more iterative approach is to read each record in, and either add it to an existing cluster or create a new cluster (using some sort of tunable max distance).
Unfortunately, I suspect that none of these options will really give you everything you want, and you'll be stuck with either a 60% solution, or the need to write a UI to let humans enter information which then needs to be persisted. (MM ended up doing both -- and if you think the problem is fun with popular music, wait until you try it against classical music...)
Good luck!
no subject
Date: 2009-09-17 11:54 pm (UTC)no subject
Date: 2009-09-18 12:06 am (UTC)You can mix both methods:
1. create set of all candidate matches ordered by length, and add it as the initial member of a queue of sets to be classified
2. pull the next set off the queue
3. find the mode of the set
4. create three sets
a. those shorter than mode-diff
b. those between mode-diff and mode+diff
c. those longer than mode+diff
(where "diff" is your "how close is close enough" metric -- 5 seconds, 2% of song length, whatever)
5. set 4b is now classified. add sets 4a and 4c to the queue (if they have any members)
6. if there are any sets left in the queue, return to step 2
no subject
Date: 2009-09-18 05:28 pm (UTC)numbers vs. shapes
Date: 2009-09-17 09:22 pm (UTC)1. You can look at the "moment" of the data, which will express how narrow a cluster around the mean. A larger moment means that the data is more "spread out".
2. I don't know if there's an easy way for a machine to decide if a distribution is binomial or not. It's usually pretty easy for a human to spot, but that's about it. You can try binning your data, interpolating, and then looking for maxima; if they're far enough apart (to whatever tolerance you care to use), then they probably represent distinct peaks.
Oh, it looks like you aren't really worried about bimodal after all (just re-read the description). Bimodal can confuse the issue, but if you look at the moment, it will at least show up as "spread" vs. "clustered". There are some good formulae here:
http://en.wikipedia.org/wiki/Moment_(mathematics)
Although, with only 2 points, I'm not sure how many of these words are even applicable. What does it mean for two points to have an outlier?
Re: numbers vs. shapes
Date: 2009-09-17 09:24 pm (UTC)Edit to add: The differences between the sorted values, in case it wasn't obvious. Sorry about that.
Re: numbers vs. shapes
Date: 2009-09-17 11:43 pm (UTC)Re: numbers vs. shapes
Date: 2009-09-17 11:44 pm (UTC)Re: numbers vs. shapes
Date: 2009-09-17 11:46 pm (UTC)no subject
Date: 2009-09-19 12:28 pm (UTC)For unimodal distributions then you often look to see if they are normally distributed. Normal distributions can have arbitary mean and standard deviation - the first two moments, the third moment measures the skew and the fourth the kurtosis (whether the distribution is more clustered or whether it is more spread out than the normal distribution, so is a measure of clustering if the distribution has a single peak.
Howeve you might be expecting the numbers to be unifirmly distributed (e.g. route numbers on a bus, passing a stop) in which case the test statistics would be different.
For the time/linear-spatial cases it is more complex, but for random process we would expect the differences between successive values to be negatively exponentially distributed. A neg-exp distribution is a special case of the gamma distribution, the gamma has 2 parameters, the mean and alpha. For the neg-exp alpha=1. If alpha <1 then the arrival times are clustered and if alpha >1 then the arrivals are more even spaced than random.