Introduction to Statistics

W1 Overview

  • Qualitative vs Quantitative data
    • Quantitative is measured / Numerical
    • Qualitative is only descriptive / Categorical
  • Population # Sample size
  • Mean ~ average
  • Median ~ middle number of the serie
    • (if even, the median is the mean/average of the 2 middle numbers)
  • Mode ~ most common number in the serie

W2U5 Measures of Dispersion

  • Range is difference between lowest and highest values in a distribution
  • Dispersion is important to get a full understanding of the data
  • Formulas to calculate standard deviation
  • Empirical rule helps explain the spread in a distribution
  • Variance (expressed in squared units) while standard deviation is expressed as the scaled unit


W2U6 Outliers

  • Quartiles are middles between the range and median values in the distribution
    • Q1 is 25th Percentile, Q2 50th Percentile, Q3 75th Percentile
  • IQR or Interquartile Range is the range between the 2 quartiles around the median value
  • Upper fence = Q3 + (1.5 * IQR)
  • Lower fence = Q1 – (1.5 * IQR)
    • Also referenced lower/upper inner fences
    • lower/upper outer fences are same with 3 instead of 1.5 in formulas
  • Outlier and extreme values


W3U1 Correlation as a Statistical Measure

  • Pearson correlation is the measure of the strength and direction of the linear relationship between two variables (only relevant when linear relationship)
  • 0 indicates there is correlation. Ranges between -1 and +1, the greater absolute value, the stronger linear relationship. Positive coefficient indicates same direction(as one increases, the other increases as well) of the 2 variables, while negative indicates opposite directions


W3U2 Correlation Vs Causation

  • Cause & Effect, Causation indicates that one event is the result of the other occurence of the other event. There is a causal relationship between the two events.
  • Spurious relationships
  • Lurking and confounding variables can make it difficult to conclude that it was the explanatory variables alone that affected the observed changes in the response variable.


W3U3 Scatter Plots and Line of Best Fit

  • 2 dimensions graph of Y vs X. Linear relationship when there is an approximate straight line. To be more precise, the line of best fit must be calculated.
    • y = mx + b
    • slope (m) and y-intercept(b) are the two values needed
      • b = mean(y) – m x (mean)x


W3U4 Linear Regression

  • Least squares
  • Residual = Fitted Value – Observed Value
  • Linear regression is an approach to modeling the linear relationship between a target variable and one or more explanatory variables. It should produce residuals that have a mean of zero, have a constant variance, and are not correlated with themselves or other variables. If these assumptions are true, then the ordinary least squares regression procedure will create the best possible estimates.
  • Simple linear regression has only one explanatory variable
  • It indicates whether the model as a whole has statistically significant predictive capability.
  • When you “test the null hypothesis” it means you are assessing the probability that there is no relationship between the explanatory variables and the target variable.


W3U5 Interpreting Results

  • Check slides (open source R commands, null/alternative hypothesis, F-Test/F-statistic, significance of individual variables, R-squared)
    • Y-Intercept – The value Y is predicted to have when all the explanatory variables are equal to zero
  • F-statistic
    • It is the ratio of the mean regression sum of squares divided by the mean error sum of squares.
  • Assumption 1 Linear relationship
  • Assumption 2 No or low multicollinearity
    • Multicollinearity occurs when the explanatory variables are highly correlated with each other.
  • Assumption 3 No autocorrelation
    • Linear regression analysis requires that there is little or no “autocorrelation” in the residuals.
  • Assumption 4 Homoscedasticity
    • the residuals are equal across the regression line
    • You can check this assumption by plotting the residuals against the fitted values. Heteroscedasticity appears as a cone shape where the spread of the residuals increases in one direction.
  • Q&A – The control helps eliminate “confounding” factors.


W4U1 Introduction to Probability

  • probability = events / number of outcomes (in the “sample space”)
  • P(A) – probability of event A
  • P(AnB) – probability of event A and B
  • P(AuB) – A or B
  • P(A|B) – probability of A given prior event B
  • probability of throwing 7 with 2 dices

W4U2 Conditional probability

  • P(B and A) = P(A) * P(B|A)
  • probability of pulling an ace from a pack of cards having already picked out an ace
    • 4/52 * 3/51 = 1/221 = 0.45%

W4U3 Bayes Theorem

  • P(A|B) = P(B|A)*P(A) / P(B)
  • Calculate the probability of an event in response to a change in relevant factors


W6U1 Real World

  • Summarize, Compare, Forecast data
  • Summarize – 4 major types of descriptive statistics
    • Measures of frequency, central tendency, dispersion or variation, and position
  • Compare – two means or two distributions
  • Forecast – use historical data to predict the future
    • Linear regression allows to understand how much of the change in one variable can be explained by the change in another.
  • Test claims and hypotheses
    • Hypothesis testing allows you to test whether a claim about a sample or population is true with an assigned level of confidence.
  • Check probabilities – deal with uncertainty
    • Describe the likelihood of each of the possible events


W6U2 Critically Evaluating Reports

  • Percentages can be misleading
    • Need to know all the exact numbers before acceptation
  • Think about the other KPIs
  • Determine the research method (control vs variation)
  • 3D pie charts is a poor way of displaying data because of the front view
  • Bar charts can be confusing when the axes fail to reach zero
  • Bubble chars can be confusing because of the disproportional radius sizes
  • Changing the denominator renders incorrectness
  • Be accustomed to say “it depends” when verifying the veracity of claims
  • Many statistics have been calculated from different populations


W6U4 Tools and References

  • MS Excel, R, Python
  • SAP Hana, Predictive Analytics, Analytics Cloud, Data Intelligence
  • Python 2.0 in 2000, Python 3.0 in 2008 with no back compatibility
  • SAC one stop solution for statistical development
    • integrated analytics as a service
  • SAP Hana Predictive Analysis Library (PAL)
    • Application function library (AFL) with functions that can be called from within SAP HANA SQLScript
  • PAL includes predictive analysis algorithms in the following data mining categories
    • Clustering, Classification, Regression, Association, Time Series, Data Pre-processing, Statistics, Social Network Analysis
  • SAP Data Intelligence is a cloud solution focusing on
    • Developing artificial intelligence projects
    • Extracting value from distributed data source
    • Using open-source technology (R, Python, TensorFlow)


W6 Q&A

  • Distribution curves graph the frequency of variables such as height, weight, test scores, etc.
  • Distribution curves can be used to compare the means of two or more variables to check the significance of their differences
  • Linear Regression
    • Numerically calculate how much of the change in one variable’s value is explained by the second variable
    • Identify a statistical relationship between two variables, which may (or may not) represent a predictive relationship
    • Build a scatterplot to visually evaluate whether a relationship between the variables exists

src: openSAP course / introduction to Statistics

local_offerevent_note January 3, 2020

account_box Mickael