nolly: (Default)
[personal profile] nolly
My stats and related classes were too long ago.

Given a set of numbers -- integers,in this case, but I don't think it matters -- I want to detect whether they're all clustered (180, 180, 183, 179, 181), clustered with a small number of outliers (180, 179, 360, 182, 184), or spread (180, 300, 200, 250. 275). Number of input points could be anywhere from 2 to 50, maybe more. Do I want to use population standard deviation, sample standard deviation, population variance, sample variance, or something else?

Date: 2009-09-17 11:49 pm (UTC)
From: [identity profile] tkil.livejournal.com
Funny coincidence department: song de-duping was something I worked on when I was at MusicMatch.

(I assume you've already investigated various "audio fingerprint" methods to try to deal with the problem, presuming you have the audio content?)

Anyway. If you want it all to happen with a single SQL query, then you're probably stuck with taking the mean and the stdev and arbitrarily deciding what's "close enough".

A more iterative approach is to read each record in, and either add it to an existing cluster or create a new cluster (using some sort of tunable max distance).

Unfortunately, I suspect that none of these options will really give you everything you want, and you'll be stuck with either a 60% solution, or the need to write a UI to let humans enter information which then needs to be persisted. (MM ended up doing both -- and if you think the problem is fun with popular music, wait until you try it against classical music...)

Good luck!

Date: 2009-09-17 11:54 pm (UTC)
From: [identity profile] nolly.livejournal.com
For the purpose of the current project, audio fingerprinting would be overkill (and slow). Also, I have another set of data I'd like to do something similar with, where that would be totally irrelevant, thus the more generic phrasing initially. If we were on pg 8.4 instead of 8.3, some of this would be easier, I think, but we've not upgraded yet.

Date: 2009-09-18 12:06 am (UTC)
From: [identity profile] tkil.livejournal.com
If 8.4 adds the "analytic functions" (sliding windows and the like), then yeah, it might help. Those aren't that hard to open-code, though, presuming you have access to either stored procedures or DBI-type interfaces.

You can mix both methods:

1. create set of all candidate matches ordered by length, and add it as the initial member of a queue of sets to be classified

2. pull the next set off the queue

3. find the mode of the set

4. create three sets
  a. those shorter than mode-diff
  b. those between mode-diff and mode+diff
  c. those longer than mode+diff
(where "diff" is your "how close is close enough" metric -- 5 seconds, 2% of song length, whatever)

5. set 4b is now classified. add sets 4a and 4c to the queue (if they have any members)

6. if there are any sets left in the queue, return to step 2

Date: 2009-09-18 05:28 pm (UTC)
From: [identity profile] nolly.livejournal.com
84 adds windowing functions and common table expressions, and the result is Turing-complete SQL. After talking it over with my boss yesterday evening, I've got something that works for my purposes.

Profile

nolly: (Default)
nolly

December 2011

S M T W T F S
     123
45678910
11121314 151617
18192021 222324
25262728293031

Style Credit

Expand Cut Tags

No cut tags
Page generated Feb. 24th, 2026 04:05 pm
Powered by Dreamwidth Studios