Please enable JavaScript to view this site.

 When to use automatic outlier removal

The problem with outliers

Nonlinear regression, like linear regression, assumes that the scatter of data around the ideal curve follows a Gaussian or normal distribution. This assumption leads to the familiar goal of regression: to minimize the sum of the squares of the vertical or Y-value distances between the points and the curve. However, experimental mistakes can lead to erroneous values – outliers. Even a single outlier can dominate the sum-of-the-squares calculation, and lead to misleading results.

Is it 'cheating' to remove outliers?

Some people feel that removing outliers is ‘cheating’. It can be viewed that way when outliers are removed in an ad hoc manner, especially when you remove only outliers that get in the way of obtaining results you like. But leaving outliers in the data you analyze is also ‘cheating’, as it can lead to invalid results.

Here is a Bayesian way to think about systematic approaches to removing outliers. When a value is flagged as an outlier, there are two possibilities.

A coincidence occurred, the kind of coincidence that happens in few percent of experiments even if the entire scatter is Gaussian (depending on how aggressively you define an outlier).

A ‘bad’ point got included in your data.

Which possibility is more likely?

It depends on your experimental system.

If your experimental system generates a  ‘bad’ point in a few percent of experiments, then it makes sense to eliminate the point as an outlier. It is more likely to be a ‘bad’ point than a ‘good’ point that just happened to be far from the curve.

If your system is very pure and controlled, so ‘bad’ points occur very rarely, then it is more likely that the point is far from the curve due to chance (and not mistake) and you should leave it in. Alternatively in that case, you could set Q to a lower value in order to only detect outliers that are much further away.