nolly: (Default)
[personal profile] nolly
My stats and related classes were too long ago.

Given a set of numbers -- integers,in this case, but I don't think it matters -- I want to detect whether they're all clustered (180, 180, 183, 179, 181), clustered with a small number of outliers (180, 179, 360, 182, 184), or spread (180, 300, 200, 250. 275). Number of input points could be anywhere from 2 to 50, maybe more. Do I want to use population standard deviation, sample standard deviation, population variance, sample variance, or something else?

Date: 2009-09-17 09:09 pm (UTC)
From: [identity profile] hobbitbabe.livejournal.com
The first thing you always want to do is plot them. This will show you the presence of clusters and help identify outliers.

If you want to describe the overall amount of scatter, so you can compare with a different sample, then use sample SD [ =STDEV ( ) on Excel ], but that will only tell you a number - it won't tell you the *shape* of the data.

More detail -- I can help if I can see the data, and help find more sophisticated tools.

Sample variance is just the square of STDEV, so it's no big deal to switch between but it doesn't tell you any more. You don't use the population standard deviation because it's not theoretically correct -- also because it doesn't account properly for the samples being different sizes.

numbers vs. shapes

Date: 2009-09-17 09:22 pm (UTC)
From: [identity profile] tkil.livejournal.com
Your other commenter is correct, but a few more things to mention:

1. You can look at the "moment" of the data, which will express how narrow a cluster around the mean. A larger moment means that the data is more "spread out".

2. I don't know if there's an easy way for a machine to decide if a distribution is binomial or not. It's usually pretty easy for a human to spot, but that's about it. You can try binning your data, interpolating, and then looking for maxima; if they're far enough apart (to whatever tolerance you care to use), then they probably represent distinct peaks.

Oh, it looks like you aren't really worried about bimodal after all (just re-read the description). Bimodal can confuse the issue, but if you look at the moment, it will at least show up as "spread" vs. "clustered". There are some good formulae here:

http://en.wikipedia.org/wiki/Moment_(mathematics)

Although, with only 2 points, I'm not sure how many of these words are even applicable. What does it mean for two points to have an outlier?

Re: numbers vs. shapes

Date: 2009-09-17 09:24 pm (UTC)
From: [identity profile] tkil.livejournal.com
Oh, and something else you might want to investigate is the use of differences, rather than the raw numbers. That would also answer the intuitive problems with the "2 datapoints" situation.

Edit to add: The differences between the sorted values, in case it wasn't obvious. Sorry about that.
Edited Date: 2009-09-17 09:50 pm (UTC)

Date: 2009-09-17 11:41 pm (UTC)
From: [identity profile] nolly.livejournal.com
Visualization isn't really an option, or I wouldn't be asking. I have a lot of song records, containing some info about the songs, and I'm looking for duplicates. if the durations are close together, they're probably the same, just with an extra second or two of fade/silence at the end; if they're far apart, they're probably meaningfully different edits, even if the name and artist are the same.

I have the statistical functions at the bottom of this page (http://www.postgresql.org/docs/8.3/interactive/functions-aggregate.html) readily available, and can do general arithmetic as well.
Edited Date: 2009-09-17 11:41 pm (UTC)

Re: numbers vs. shapes

Date: 2009-09-17 11:43 pm (UTC)
From: [identity profile] nolly.livejournal.com
With two points, I only care about "about the same" or "very different" -- what I'm looking for is records of duplicated songs. See my response above for more detail -- visualization isn't really an option; this is a SQL query.

Re: numbers vs. shapes

Date: 2009-09-17 11:44 pm (UTC)
From: [identity profile] tkil.livejournal.com
FWIW, I thought I was answering the whole time with a view to automatic / unattended answers, so...

Re: numbers vs. shapes

Date: 2009-09-17 11:46 pm (UTC)
From: [identity profile] nolly.livejournal.com
Hi, I just got out of a 2.5 hour meeting. *grin* I apparently read more into "Your other commenter is correct" than you intended -- that's the reason I mentioned visualization not being an option.

Date: 2009-09-17 11:49 pm (UTC)
From: [identity profile] tkil.livejournal.com
Funny coincidence department: song de-duping was something I worked on when I was at MusicMatch.

(I assume you've already investigated various "audio fingerprint" methods to try to deal with the problem, presuming you have the audio content?)

Anyway. If you want it all to happen with a single SQL query, then you're probably stuck with taking the mean and the stdev and arbitrarily deciding what's "close enough".

A more iterative approach is to read each record in, and either add it to an existing cluster or create a new cluster (using some sort of tunable max distance).

Unfortunately, I suspect that none of these options will really give you everything you want, and you'll be stuck with either a 60% solution, or the need to write a UI to let humans enter information which then needs to be persisted. (MM ended up doing both -- and if you think the problem is fun with popular music, wait until you try it against classical music...)

Good luck!

Date: 2009-09-17 11:54 pm (UTC)
From: [identity profile] nolly.livejournal.com
For the purpose of the current project, audio fingerprinting would be overkill (and slow). Also, I have another set of data I'd like to do something similar with, where that would be totally irrelevant, thus the more generic phrasing initially. If we were on pg 8.4 instead of 8.3, some of this would be easier, I think, but we've not upgraded yet.

Date: 2009-09-18 12:06 am (UTC)
From: [identity profile] tkil.livejournal.com
If 8.4 adds the "analytic functions" (sliding windows and the like), then yeah, it might help. Those aren't that hard to open-code, though, presuming you have access to either stored procedures or DBI-type interfaces.

You can mix both methods:

1. create set of all candidate matches ordered by length, and add it as the initial member of a queue of sets to be classified

2. pull the next set off the queue

3. find the mode of the set

4. create three sets
  a. those shorter than mode-diff
  b. those between mode-diff and mode+diff
  c. those longer than mode+diff
(where "diff" is your "how close is close enough" metric -- 5 seconds, 2% of song length, whatever)

5. set 4b is now classified. add sets 4a and 4c to the queue (if they have any members)

6. if there are any sets left in the queue, return to step 2

Date: 2009-09-18 05:28 pm (UTC)
From: [identity profile] nolly.livejournal.com
84 adds windowing functions and common table expressions, and the result is Turing-complete SQL. After talking it over with my boss yesterday evening, I've got something that works for my purposes.

Date: 2009-09-19 12:28 pm (UTC)
From: [identity profile] a-musing-amazon.livejournal.com
I think you need to first think about what the generating process is - are they straight-forward values from a distribution? If so what sort of distribution are you expecting (e.g. unimodal, multi-modal or uniform on a range) or is it a time/spatial ordered distribution (such as locations of autos on a road, or arrival times at a checkout or in 2-d, trees in a forest, times to failure of a machine). For also need to know whether they are a sample or a census i.e. the full population?

For unimodal distributions then you often look to see if they are normally distributed. Normal distributions can have arbitary mean and standard deviation - the first two moments, the third moment measures the skew and the fourth the kurtosis (whether the distribution is more clustered or whether it is more spread out than the normal distribution, so is a measure of clustering if the distribution has a single peak.

Howeve you might be expecting the numbers to be unifirmly distributed (e.g. route numbers on a bus, passing a stop) in which case the test statistics would be different.

For the time/linear-spatial cases it is more complex, but for random process we would expect the differences between successive values to be negatively exponentially distributed. A neg-exp distribution is a special case of the gamma distribution, the gamma has 2 parameters, the mean and alpha. For the neg-exp alpha=1. If alpha <1 then the arrival times are clustered and if alpha >1 then the arrivals are more even spaced than random.


Page generated Feb. 24th, 2026 02:47 pm
Powered by Dreamwidth Studios