Data Science

Linear Regression for Data Science

Linear Regression for Data Science

The term regression is used when trying to find relationships between variables.

Linear regression finds the equation that minimizes the difference between all observed values ​​and their fitted values. More specifically, linear regression finds the minimum sum of squared residuals possible in a dataset.

Machine learning and statistical modeling use this relationship to predict the outcome of an event.
Least squares: Linear regression uses the least squares method. The idea is to draw a line through all plotted data points. Lines are placed to minimize the distance to all data points.


The distance is called "residual" or "error".DS Regression Table
The output of linear regression can be summarized in a regression table.

The contents of the table are:
Information about the model
Coefficients of the linear regression function
regression statistics
Statistics of coefficients from linear regression functions
Statisticians say that a regression model fits the data well when the difference between the observed and predicted values ​​is small and unbiased. In this context, unbiased means that the adjusted values ​​are neither systematically too high nor too low anywhere in the observation space.

However, you should evaluate residual plots before evaluating numerical goodness-of-fits such as R-squared. A residual plot can reveal a biased model much more effectively than a numerical output by showing problematic patterns in the residuals. If the model is biased, the results are unreliable. If the residual plot looks good, evaluate R-squared and other statistics. 

DS Regression Info

Dep variable: Abbreviation for "dependent variable".
model: OLS stands for Ordinary Least Squares. This is the type of model that uses the least squares method.
Date and time: The output displays the date and time calculated in Python. DS regression coefficient

Coef is an abbreviation for coefficient. 
The output of the linear regression function.

DS Regression P-Value

I would like to test whether the coefficients of a linear regression function have a large effect on the dependent variable.

There are four components that describe the coefficient statistics.
std err stands for standard error
t is the "t value" of the coefficient.
P>|t| is called “p-value”.

Tests if the true value of the coefficient is equal to zero (irrelevant). A statistical test for this is called a hypothesis test. A low P-value (<0>0.05) means that we cannot conclude that the explanatory variable influences the dependent variable
A high P-value is also known as a non-significant P-value.

Hypothesis test

Hypothesis testing is a statistical technique for testing whether the results are valid.
A hypothesis test has two statements. null and alternative hypotheses.

The null hypothesis can be abbreviated and written as H0
The alternative hypothesis can be abbreviated and written as HA

Mathematically, it can be written as:

H0: = 0
C: ≠ 0
H0: intersection = 0
C: intersection ≠ 0

DS Regression R-Squared

R-squared and adjusted R-squared describe how well the linear regression model fits the data points.
R-Squared is a goodness-of-fit measure for linear regression models. This statistic shows the percentage of the variance of the dependent variable that the independent variables together explain. R-Squared measures the strength of the relationship between the model and the dependent variable on a convenient 0-100% scale.

After fitting a linear regression model, we need to determine how well the model fits the data. Does it explain the changes in the dependent variable well? There are some important fit statistics in regression analysis. In this paper, we examine R-squared (R2), highlight some of its limitations, and discover some surprises. For example, small R-squared values ​​are not always a problem, and high R-squared values ​​are not always good.

R-Squared values ​​are always between 0 and 1 (0% to 100%).

A high R-Squared value means that many data points are close to the line of the linear regression function. A low R-squared value means that the linear regression function line does not fit the data well. 

DS Linear Regression Case

To apply linear regression, you need to follow these steps:

Import the statsmodels.formula.api library as smf. 
Statsmodels is a Python statistical library. 
Use the full data set. 
Create a model based on ordinary least squares using smf.ols() . Note that explanatory variables must first be written inside parentheses. Use the full data set.
Calling .fit() gives you variable results. 
Call summary() to get a table with linear regression results. It contains a lot of information about the regression model. 
 

Top course recommendations for you

    Measures of Dispersion
    1 hrs
    Beginner
    2.1K+ Learners
    4.53  (70)
    Principal Component Analysis
    1 hrs
    Intermediate
    3.2K+ Learners
    4.41  (101)
    Measures of Central Tendency
    2 hrs
    Beginner
    3.3K+ Learners
    4.4  (219)
    Inferential Statistics
    1 hrs
    Beginner
    4.1K+ Learners
    4.55  (216)
    Autocorrelation in Data Science
    1 hrs
    Beginner
    2.6K+ Learners
    4.47  (116)
    Introduction to Scikit Learn
    2 hrs
    Beginner
    4.5K+ Learners
    4.31  (258)
    Analysis of Variance
    1 hrs
    Intermediate
    3.9K+ Learners
    4.54  (216)