# Introduction to Statistics

## W1 Overview

• Qualitative vs Quantitative data
• Quantitative is measured / Numerical
• Qualitative is only descriptive / Categorical
• Population # Sample size
• Mean ~ average
• Median ~ middle number of the serie
• (if even, the median is the mean/average of the 2 middle numbers)
• Mode ~ most common number in the serie

## W2U5 Measures of Dispersion

• Range is difference between lowest and highest values in a distribution
• Dispersion is important to get a full understanding of the data
• Formulas to calculate standard deviation
• Empirical rule helps explain the spread in a distribution
• Variance (expressed in squared units) while standard deviation is expressed as the scaled unit

## W2U6 Outliers

• Quartiles are middles between the range and median values in the distribution
• Q1 is 25th Percentile, Q2 50th Percentile, Q3 75th Percentile
• IQR or Interquartile Range is the range between the 2 quartiles around the median value
• Upper fence = Q3 + (1.5 * IQR)
• Lower fence = Q1 – (1.5 * IQR)
• Also referenced lower/upper inner fences
• lower/upper outer fences are same with 3 instead of 1.5 in formulas
• Outlier and extreme values

## W3U1 Correlation as a Statistical Measure

• Pearson correlation is the measure of the strength and direction of the linear relationship between two variables (only relevant when linear relationship)
• 0 indicates there is correlation. Ranges between -1 and +1, the greater absolute value, the stronger linear relationship. Positive coefficient indicates same direction(as one increases, the other increases as well) of the 2 variables, while negative indicates opposite directions

## W3U2 Correlation Vs Causation

• Cause & Effect, Causation indicates that one event is the result of the other occurence of the other event. There is a causal relationship between the two events.
• Spurious relationships
• Lurking and confounding variables can make it difficult to conclude that it was the explanatory variables alone that affected the observed changes in the response variable.

## W3U3 Scatter Plots and Line of Best Fit

• 2 dimensions graph of Y vs X. Linear relationship when there is an approximate straight line. To be more precise, the line of best fit must be calculated.
• y = mx + b
• slope (m) and y-intercept(b) are the two values needed
• b = mean(y) – m x (mean)x

## W3U4 Linear Regression

• Least squares
• Residual = Fitted Value – Observed Value
• Linear regression is an approach to modeling the linear relationship between a target variable and one or more explanatory variables. It should produce residuals that have a mean of zero, have a constant variance, and are not correlated with themselves or other variables. If these assumptions are true, then the ordinary least squares regression procedure will create the best possible estimates.
• Simple linear regression has only one explanatory variable
• It indicates whether the model as a whole has statistically significant predictive capability.
• When you “test the null hypothesis” it means you are assessing the probability that there is no relationship between the explanatory variables and the target variable.

## W3U5 Interpreting Results

• Check slides (open source R commands, null/alternative hypothesis, F-Test/F-statistic, significance of individual variables, R-squared)
• Y-Intercept – The value Y is predicted to have when all the explanatory variables are equal to zero
• F-statistic
• It is the ratio of the mean regression sum of squares divided by the mean error sum of squares.
• Assumption 1 Linear relationship
• Assumption 2 No or low multicollinearity
• Multicollinearity occurs when the explanatory variables are highly correlated with each other.
• Assumption 3 No autocorrelation
• Linear regression analysis requires that there is little or no “autocorrelation” in the residuals.
• Assumption 4 Homoscedasticity
• the residuals are equal across the regression line
• You can check this assumption by plotting the residuals against the fitted values. Heteroscedasticity appears as a cone shape where the spread of the residuals increases in one direction.
• Q&A – The control helps eliminate “confounding” factors.

## W4U1 Introduction to Probability

• probability = events / number of outcomes (in the “sample space”)
• P(A) – probability of event A
• P(AnB) – probability of event A and B
• P(AuB) – A or B
• P(A|B) – probability of A given prior event B
• probability of throwing 7 with 2 dices

## W4U2 Conditional probability

• P(B and A) = P(A) * P(B|A)
• probability of pulling an ace from a pack of cards having already picked out an ace
• 4/52 * 3/51 = 1/221 = 0.45%

## W4U3 Bayes Theorem

• P(A|B) = P(B|A)*P(A) / P(B)
• Calculate the probability of an event in response to a change in relevant factors

## W6U1 Real World

• Summarize, Compare, Forecast data
• Summarize – 4 major types of descriptive statistics
• Measures of frequency, central tendency, dispersion or variation, and position
• Compare – two means or two distributions
• Forecast – use historical data to predict the future
• Linear regression allows to understand how much of the change in one variable can be explained by the change in another.
• Test claims and hypotheses
• Hypothesis testing allows you to test whether a claim about a sample or population is true with an assigned level of confidence.
• Check probabilities – deal with uncertainty
• Describe the likelihood of each of the possible events

## W6U2 Critically Evaluating Reports

• Need to know all the exact numbers before acceptation
• Think about the other KPIs
• Determine the research method (control vs variation)
• 3D pie charts is a poor way of displaying data because of the front view
• Bar charts can be confusing when the axes fail to reach zero
• Bubble chars can be confusing because of the disproportional radius sizes
• Changing the denominator renders incorrectness
• Be accustomed to say “it depends” when verifying the veracity of claims
• Many statistics have been calculated from different populations

## W6U4 Tools and References

• MS Excel, R, Python
• SAP Hana, Predictive Analytics, Analytics Cloud, Data Intelligence
• Python 2.0 in 2000, Python 3.0 in 2008 with no back compatibility
• SAC one stop solution for statistical development
• integrated analytics as a service
• SAP Hana Predictive Analysis Library (PAL)
• Application function library (AFL) with functions that can be called from within SAP HANA SQLScript
• PAL includes predictive analysis algorithms in the following data mining categories
• Clustering, Classification, Regression, Association, Time Series, Data Pre-processing, Statistics, Social Network Analysis
• SAP Data Intelligence is a cloud solution focusing on
• Developing artificial intelligence projects
• Extracting value from distributed data source
• Using open-source technology (R, Python, TensorFlow)

## W6 Q&A

• Distribution curves graph the frequency of variables such as height, weight, test scores, etc.
• Distribution curves can be used to compare the means of two or more variables to check the significance of their differences
• Linear Regression
• Numerically calculate how much of the change in one variable’s value is explained by the second variable
• Identify a statistical relationship between two variables, which may (or may not) represent a predictive relationship
• Build a scatterplot to visually evaluate whether a relationship between the variables exists

src: openSAP course / introduction to Statistics

local_offerevent_note January 3, 2020

account_box