Introduction to Statistics – Mindofweb / Mickael Salabi

W1 Overview

Qualitative vs Quantitative data
- Quantitative is measured / Numerical
- Qualitative is only descriptive / Categorical
Population # Sample size
Mean ~ average
Median ~ middle number of the serie
- (if even, the median is the mean/average of the 2 middle numbers)
Mode ~ most common number in the serie

Range is difference between lowest and highest values in a distribution
Dispersion is important to get a full understanding of the data
Formulas to calculate standard deviation
Empirical rule helps explain the spread in a distribution
Variance (expressed in squared units) while standard deviation is expressed as the scaled unit

Quartiles are middles between the range and median values in the distribution
- Q1 is 25th Percentile, Q2 50th Percentile, Q3 75th Percentile
IQR or Interquartile Range is the range between the 2 quartiles around the median value
Upper fence = Q3 + (1.5 * IQR)
Lower fence = Q1 – (1.5 * IQR)
- Also referenced lower/upper inner fences
- lower/upper outer fences are same with 3 instead of 1.5 in formulas
Outlier and extreme values

Pearson correlation is the measure of the strength and direction of the linear relationship between two variables (only relevant when linear relationship)
0 indicates there is correlation. Ranges between -1 and +1, the greater absolute value, the stronger linear relationship. Positive coefficient indicates same direction(as one increases, the other increases as well) of the 2 variables, while negative indicates opposite directions

Cause & Effect, Causation indicates that one event is the result of the other occurence of the other event. There is a causal relationship between the two events.
Spurious relationships
Lurking and confounding variables can make it difficult to conclude that it was the explanatory variables alone that affected the observed changes in the response variable.

2 dimensions graph of Y vs X. Linear relationship when there is an approximate straight line. To be more precise, the line of best fit must be calculated.
- y = mx + b
- slope (m) and y-intercept(b) are the two values needed
  - b = mean(y) – m x (mean)x

Least squares
Residual = Fitted Value – Observed Value
Linear regression is an approach to modeling the linear relationship between a target variable and one or more explanatory variables. It should produce residuals that have a mean of zero, have a constant variance, and are not correlated with themselves or other variables. If these assumptions are true, then the ordinary least squares regression procedure will create the best possible estimates.
Simple linear regression has only one explanatory variable
It indicates whether the model as a whole has statistically significant predictive capability.
When you “test the null hypothesis” it means you are assessing the probability that there is no relationship between the explanatory variables and the target variable.

Check slides (open source R commands, null/alternative hypothesis, F-Test/F-statistic, significance of individual variables, R-squared)
- Y-Intercept – The value Y is predicted to have when all the explanatory variables are equal to zero
F-statistic
- It is the ratio of the mean regression sum of squares divided by the mean error sum of squares.
Assumption 1 Linear relationship
Assumption 2 No or low multicollinearity
- Multicollinearity occurs when the explanatory variables are highly correlated with each other.
Assumption 3 No autocorrelation
- Linear regression analysis requires that there is little or no “autocorrelation” in the residuals.
Assumption 4 Homoscedasticity
- the residuals are equal across the regression line
- You can check this assumption by plotting the residuals against the fitted values. Heteroscedasticity appears as a cone shape where the spread of the residuals increases in one direction.
Q&A – The control helps eliminate “confounding” factors.

P(B and A) = P(A) * P(B|A)
probability of pulling an ace from a pack of cards having already picked out an ace
- 4/52 * 3/51 = 1/221 = 0.45%

P(A|B) = P(B|A)*P(A) / P(B)
Calculate the probability of an event in response to a change in relevant factors

Summarize, Compare, Forecast data
Summarize – 4 major types of descriptive statistics
- Measures of frequency, central tendency, dispersion or variation, and position
Compare – two means or two distributions
Forecast – use historical data to predict the future
- Linear regression allows to understand how much of the change in one variable can be explained by the change in another.
Test claims and hypotheses
- Hypothesis testing allows you to test whether a claim about a sample or population is true with an assigned level of confidence.
Check probabilities – deal with uncertainty
- Describe the likelihood of each of the possible events

Percentages can be misleading
- Need to know all the exact numbers before acceptation
Think about the other KPIs
Determine the research method (control vs variation)
3D pie charts is a poor way of displaying data because of the front view
Bar charts can be confusing when the axes fail to reach zero
Bubble chars can be confusing because of the disproportional radius sizes
Changing the denominator renders incorrectness
Be accustomed to say “it depends” when verifying the veracity of claims
Many statistics have been calculated from different populations

MS Excel, R, Python
SAP Hana, Predictive Analytics, Analytics Cloud, Data Intelligence
Python 2.0 in 2000, Python 3.0 in 2008 with no back compatibility
SAC one stop solution for statistical development
- integrated analytics as a service
SAP Hana Predictive Analysis Library (PAL)
- Application function library (AFL) with functions that can be called from within SAP HANA SQLScript
PAL includes predictive analysis algorithms in the following data mining categories
- Clustering, Classification, Regression, Association, Time Series, Data Pre-processing, Statistics, Social Network Analysis
SAP Data Intelligence is a cloud solution focusing on
- Developing artificial intelligence projects
- Extracting value from distributed data source
- Using open-source technology (R, Python, TensorFlow)

Distribution curves graph the frequency of variables such as height, weight, test scores, etc.
Distribution curves can be used to compare the means of two or more variables to check the significance of their differences
Linear Regression
- Numerically calculate how much of the change in one variable’s value is explained by the second variable
- Identify a statistical relationship between two variables, which may (or may not) represent a predictive relationship
- Build a scatterplot to visually evaluate whether a relationship between the variables exists

src: openSAP course / introduction to Statistics