Understanding non-normal data - part 1

When you are learning about six sigma you will learn that a lot of the calculations and graphs are based on normal data. It is considered to be the most important distribution in statistics.  So what if your data is non-normal? Don't panic! The key is to understand what is behind the non-normal data.

In this series of understanding non-normal data we will see the 4 different types of non-normal distributions and their potential causes: 

  • Skewness
  • Multiple modes
  • Kurtosis
  • Granularity

in part 1 we will go more into detail on what is meant by normal data and skewness. We further explain which potential causes can create skewness so that you know what to look for.

understanding normal data

first of all, we have to understand what is normal data. The characteristics of a  normal  distribution (Gaussian curve) are:  

  • The Mean, Median, and Mode are at the same point, exactly in the center
    • The curve is symmetric diving the area is two symmetric half sides
    • The total area under the curve is equal to 1.
  • It is a unimodal distribution (only one peak)
  • The curve runs indefinitely in both directions, approaching but never touching the horizontal axis.
  • ›All processes will exhibit a normal curve shape if you have pure random variation (white noise).

for the standard normal distribution (Z curve) following extra characteristic is valid:

  • The distribution has a Mean of 0 and a Standard Deviation of 1.

This means that if one of the above conditions is not met (enough) we do have non-normal data.

Looking at the histogram of your data or perform an Anderson Darling test or fat pencil test, could give you the necessary insight if your data is normal or not.

Types of non-normal distributions

Understanding non-normal data is the key to solving your problem. The first thing you have to do is to understand what type of non-normal data you have and what the potential causes could be for this distribution. Data may follow non-normal distributions for a variety of reasons or there may be multiple sources of variation causing data that would otherwise be normal to appear not normal

skewness

We have seen that normal data is symmetrical. If one of the tails is different from the other one we speak of skewness. Wikipedia defines skewness as followed: "skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean."

A negative skew means that there is a longer tail on the left side of the distribution while a positive skew means that the tail is longer on the right side.  It is not always that simple. The skewness indicates how sides are balancing out around the mean. If you have a thin long tail on one side and a fat short tail on the other side the value could be 0. skewness can be seen well at unimodal distributions, but it becomes more complex with another type of distribution. It could even be that the skewness is undefined.

Causes of skewness: Natural and artificial limits

Natural Limits occur when there is a physical limit of the variable you’re measuring.  This limit is also referred to as an Upper or Lower Bound.  Some examples:

  • When drilling a hole in a stone wall, the hole can not be smaller than the drill bit diameter (Lower Bound = drill bit diameter)
  • When measuring defects on a product, there can not be less than 0 defects ( lower bound = 0)
  • When measuring the purity of a sample, the purity cannot exceed 100% pure (Upper bound = 100%)

It does not have to be always a hard limit, sometimes the limits are more blurry.

• Sports results are a good example. Lets have a look to the 400meter man results at the Olympic Games in London 2012:

It is easier to be a few seconds slower than the main group than to be a few seconds ahead of the main group. in the histogram, it is clear that there is a positive skewness.

another example would be the age of persons on which they die. There is no hard limit, the average age goes up every year. But it is more difficult to keep healthy and get older than the main group than to die at a younger age due to illness or accidents. This will result in a curve with negative skewness.

Artificial limits will have the same effect as natural limits. Some examples:

  • A fisher who would sort out fishes that are too small to sell and throw them back into the sea.
  • A car which is driving with the speed limiter put on

Causes of skewness: mixtures & interaction

Another potential cause of skewness could be mixtures of two or more normal curves. The figure below illustrates this. While the blue and orange curves have both a normal distribution, the yellow curve which is the sum of the orange and blue curve has positive skewness.

the mixtures could come from combining different machines, processes, operators, .... into one graph.

Interaction has the same effect as seen with mixtures. Interactions are commonly found in the chemical & medical sector. Interactions occur when two inputs interact with each other to have a larger or smaller impact on Y than either would by themselves

 Let's look at a fictive example.  It is well known in the medical sector that some painkillers can not be combined since they interact with each other and counteract each other. Imagine that when we give a specific amount of pain stiller "A" or "B" to test subjects the pain relief they experience will follow a normal distribution.  If the person does only take  "A" or "B" they will fall in the green curve. While the persons who combine painkiller "A" and  "B" will have on average less pain relief due to the counteractions. Both curves combined will result in the yellow negative skewed curve.

Causes of skewness: non-linear relationship

A Non-linear relationship between parameters is also a potential cause of skewness.  This is illustrated by the figure below:

While the input X  has a normal distribution, the output Y has a non-normal distribution due to the non-lineair relationship. This means that if you are measuring and monitoring Y in stead of X you will see a non-normal distribution. Some examples:

  • Radius of a sphere (X) and the volume of the sphere (Y)
  • ›Years you owned a car (X) and the value of the car (Y)
  • ›Number of people helping (X) and the time to finish the work (Y)

 Causes of skewness: Non-random patterns across time

 the last cause of skewness is a non-random pattern across time of a normal distribution. the mechanism is similar to the above described mechanisms.

If you have a  process that has a normal distribution in a short time period, and the process has only pure random variations (white noise) over time - thus it is also normally distributed- then the total will also be normally distributed. this is shown on the left side of the below picture.

Imaging a manual process where during the weekends due to lack of motivation and supervision the process takes longer to complete. The process follows thus a non-random pattern. If you combine the normal distributed daily curves you will get a non-normal total curve. Just as we have seen with the mixture/interaction causes.

summary

You do not have to panic if your data is non-normal. In fact, it is sometimes easier to find the root cause of your problem if you have non-normal data. The key is to understand what your data is telling you. If you see a deviation in the tail of your curves then we speak of skewness. Potential causes are natural and artificial limits, mixtures and interactions, non-linear relationships, and non-random patterns across time.

In part 2 we will continue with examining the Multiple modes, Kurtosis and Granularity

This article was updated on November 30, 2022