Understanding Heteroscedasticity in Regression Analysis

Understanding Heteroscedasticity in Regression Analysis

In regression analysis, heteroscedasticity (sometimes spelled heteroskedasticity) refers to the unequal scatter of residuals or error terms. Specfically, it refers to the case where there is a systematic change in the spread of the residuals over the range of measured values.

Heteroscedasticity Because ordinary least squares (OLS), regression assumes that the residuals originate from a population with homoscedasticity. This means that they have constant variance.

Read More: how to fix heteroskedasticity

When heteroscedasticity is present in a regression analysis, the results of the analysis become hard to trust. Specifically, heteroscedasticity increases the variance of the regression coefficient estimates, but the regression model doesn’t pick up on this.

This makes it much more likely for a regression model to declare that a term in the model is statistically significant, when in fact it is not.

This tutorial will explain how to detect heteroscedasticity and what causes it. It also explains possible solutions to the problem.

How to Detect Heteroscedasticity

The simplest way to detect heteroscedasticity is with a fitted value vs. residual plot.

A scatterplot can be created when you have fitted a regression line with a set data. It shows the fitted values and the residuals.

The scatterplot below shows a typical fitted value vs. residual plot in which heteroscedasticity is present.

See also  How to Patch a Vinyl Pool Liner

Top Useful: Remedies for Hair Damaged by Perm

As the fitted values increase, the residuals get more dispersed. This “cone” shape is a telltale sign of heteroscedasticity.

What Causes Heteroscedasticity?

Heteroscedasticity occurs naturally in datasets where there is a large range of observed data values. For example:

  • Consider a dataset that includes the annual income and expenses of 100,000 people across the United States. For individuals with lower incomes, there will be lower variability in the corresponding expenses since these individuals likely only have enough money to pay for the necessities. For individuals with higher incomes, there will be higher variability in the corresponding expenses since these individuals have more money to spend if they choose to. Some higher-income individuals will choose to spend most of their income, while some may choose to be frugal and only spend a portion of their income, which is why the variability in expenses among these higher-income individuals will inherently be higher.
  • Consider a dataset that includes the populations and the count of flower shops in 1,000 different cities across the United States. It may not be uncommon for one or two flower shops in small cities. But in cities with larger populations, there will be a much greater variability in the number of flower shops. These cities might have between 10 and 100 shops. This means when we create a regression analysis and use population to predict number of flower shops, there will inherently be greater variability in the residuals for the cities with higher populations.

Some datasets are more vulnerable to heteroscedasticity.

See also 

How to Fix Heteroscedasticity

There are three common ways to fix heteroscedasticity:

1. Transform the dependent variable

One way to fix heteroscedasticity is to transform the dependent variable in some way. The log of the dependent variable is one common transformation.

For example, if we are using population size (independent variable) to predict the number of flower shops in a city (dependent variable), we may instead try to use population size to predict the log of the number of flower shops in a city.

Reading: Video Tutorial: How To Repair a Garden Soaker Hose

Heteroskedasticity can be caused by using the log of the dependent variable rather than the original dependent variables.

2. Redefine the dependent variable

Another way to fix heteroscedasticity is to redefine the dependent variable. It is common to use a rate instead of the raw value to redefine the dependent variable.

For example, instead of using the population size to predict the number of flower shops in a city, we may instead use population size to predict the number of flower shops per capita.

In most cases, this reduces the variability that naturally occurs among larger populations since we’re measuring the number of flower shops per person, rather than the sheer amount of flower shops.

3. Use weighted regression

Another way to fix heteroscedasticity is to use weighted regression. This regression assigns a weight for each data point based upon the variance of its fitted values.

This basically gives smaller weights to data points with higher variances. This shrinks their squared residuals. This can solve heteroscedasticity problems by using the right weights.

See also  Try these quick tips to fight lag and improve ping in Valorant

Conclusion

Heteroscedasticity is a fairly common problem when it comes to regression analysis because so many datasets are inherently prone to non-constant variance.

It can be quite easy to spot heteroscedasticity using a fitted values vs. residue plot.

The problem of heteroscedasticity is often eliminated by transforming the dependent variables, redefining them, or using weighted analysis.

Useful for You: How Much Does it Cost to Fix a Radiator? 

Leave a Reply

Your email address will not be published. Required fields are marked *