Data Science

Statistics for Data Science

Statistics for Data Science

Statistics is the branch of mathematics that uses quantified models, representations, and summaries to analyze sets of real-world experimental data or studies. Statistics is the study of how data is collected, examined, analyzed, and drawn to conclusions. In this tutorial, we'll examine percentiles for statistics. Descriptive statistics summarize important characteristics of a dataset, such as: for example. :

  • count
  • total
  • standard deviation
  • percentile
  • average

This is a good starting point to familiarize yourself with the data.

You can summarize your data by using the describe() function in Python.

Stat Percentiles

In statistics, the percentile is a term that describes how a score compares to others within the same group. There is no universal definition of percentile, but it is commonly expressed as a percentage of a set of data values ​​below a certain value.

25%, 50%, and 75% percentiles

25th percentile – Also known as the 1st or bottom quartile. The 25th percentile is the value where 25% of the responses are below this value, and 75% of the responses are above this value. 50th percentile – Also known as the median. The median bisects the dataset. Half of the responses are below the median, and half are above the median.

75th percentile – Also known as the 3rd or top quartile. The 75th percentile is the value where 25% of the responses are above this value, and 75% of the responses are below this value. Percentiles are used in statistics to show numbers that represent values ​​below a certain percentage value.

Stat Standard Deviation

The standard deviation is a number that represents how spread out the observations are.
Mathematical functions have trouble predicting exact values ​​when the observations are "scattered". Standard deviation is a measure of uncertainty.

The square root of the variance is the standard deviation. var. The standard deviation is more specific and shows the exact distance from the mean.

A low standard deviation means that most numbers are close to the mean (mean).

A higher standard deviation means that the values ​​are spread out over a wider range.

You can use Numpy's std() function to find the standard deviation of a variable. The symbol sigma often represents standard deviation - σ

Stat Variance

Variance is another number that indicates how spread the values ​​are. In fact, taking the square root of the variance gives you the standard deviation. Conversely, multiplying the standard deviation by itself gives the variance.

Variance is the square of the deviation of a variable from its mean. It basically measures the spread of random data in a set by its mean or median. A small variance value indicates that the data are not clustered together, and a large value indicates that the data in a given set are far from the mean.

Stat Correlation

Correlation measures the relationship between two variables. I said the purpose of the function is to transform the input (x) to the output (f(x)) and predict the value. We can also say that the function uses the relationship between the two variables to make a prediction.

correlation coefficient
A correlation coefficient measures the relationship between two variables.

The correlation coefficient is never less than -1 or greater than 1.

1 = perfect linear relationship between variables
0 = no linear relationship between variables
-1 = perfect negative linear relationship between variables

statistical correlation matrix

A matrix is ​​an array of numbers arranged in rows and columns. 
A correlation matrix is ​​a simple table that shows the correlation coefficients between variables.

Statistical correlation and causation
Correlation measures the numerical relationship between two variables.
A high correlation coefficient (close to 1) does not mean that you can confidently conclude the actual relationship between the two variables.
There is an important difference between correlation and causation.

Correlation is a number that measures how closely related data are
Causality is the conclusion that x causes y.

A typical example:

Ice cream sales go up on the beach in summer
At the same time, drowning accidents are increasing.
Does this mean increased ice cream sales are directly responsible for the increase in drownings? In other words:
Can Ice Cream Sales Predict Drowning Accidents?

The answer is - probably not.

These two variables may be unintentionally correlated.
 

Featured Degree & Certificate Programs

Top course recommendations for you

    Statistical Analysis
    1 hrs
    Beginner
    16K+ Learners
    4.5  (786)
    Linear Discriminant Analysis Applications
    1 hrs
    Intermediate
    2.5K+ Learners
    4.36  (92)
    Time Series Analysis Stock Market Prediction Python
    2 hrs
    Beginner
    13.7K+ Learners
    4.46  (386)
    Hypothesis Testing
    2 hrs
    Beginner
    8.8K+ Learners
    4.53  (380)
    What is Forecasting?
    1 hrs
    Beginner
    5.8K+ Learners
    4.44  (198)
    k-fold Cross Validation
    1 hrs
    Intermediate
    1.7K+ Learners
    4.62  (74)
    Feature Engineering
    2 hrs
    Intermediate
    2.8K+ Learners
    4.59  (210)