One for the stats geeks
Sep. 17th, 2009 01:55 pmMy stats and related classes were too long ago.
Given a set of numbers -- integers,in this case, but I don't think it matters -- I want to detect whether they're all clustered (180, 180, 183, 179, 181), clustered with a small number of outliers (180, 179, 360, 182, 184), or spread (180, 300, 200, 250. 275). Number of input points could be anywhere from 2 to 50, maybe more. Do I want to use population standard deviation, sample standard deviation, population variance, sample variance, or something else?
Given a set of numbers -- integers,in this case, but I don't think it matters -- I want to detect whether they're all clustered (180, 180, 183, 179, 181), clustered with a small number of outliers (180, 179, 360, 182, 184), or spread (180, 300, 200, 250. 275). Number of input points could be anywhere from 2 to 50, maybe more. Do I want to use population standard deviation, sample standard deviation, population variance, sample variance, or something else?
no subject
Date: 2009-09-17 11:49 pm (UTC)(I assume you've already investigated various "audio fingerprint" methods to try to deal with the problem, presuming you have the audio content?)
Anyway. If you want it all to happen with a single SQL query, then you're probably stuck with taking the mean and the stdev and arbitrarily deciding what's "close enough".
A more iterative approach is to read each record in, and either add it to an existing cluster or create a new cluster (using some sort of tunable max distance).
Unfortunately, I suspect that none of these options will really give you everything you want, and you'll be stuck with either a 60% solution, or the need to write a UI to let humans enter information which then needs to be persisted. (MM ended up doing both -- and if you think the problem is fun with popular music, wait until you try it against classical music...)
Good luck!
no subject
Date: 2009-09-17 11:54 pm (UTC)no subject
Date: 2009-09-18 12:06 am (UTC)You can mix both methods:
1. create set of all candidate matches ordered by length, and add it as the initial member of a queue of sets to be classified
2. pull the next set off the queue
3. find the mode of the set
4. create three sets
a. those shorter than mode-diff
b. those between mode-diff and mode+diff
c. those longer than mode+diff
(where "diff" is your "how close is close enough" metric -- 5 seconds, 2% of song length, whatever)
5. set 4b is now classified. add sets 4a and 4c to the queue (if they have any members)
6. if there are any sets left in the queue, return to step 2
no subject
Date: 2009-09-18 05:28 pm (UTC)