Learning Notes: Practical Statistics for Data Scientists



Before start

Why this book you must read if you want be data scientist?

Two goal of this book:

  • To lay out, in digestable, navigable, and easily referenced form, key concepts from statistics that are relavent to data science
  • To explain which concepts are important and useful from data science perspective, which are less so, and why.

With this book, you’ll learn:

  • Why exploratory data analysis is a key preliminary step in data science
  • How random sampling can reduce bias and yield a higher quality dataset, even with big data
  • How the principles of experimental design yield definitive answers to questions
  • How to use regression to estimate outcomes and detect anomalies
  • Key classification techniques for predicting which categories a record belongs to
  • Statistical machine learning methods that “learn” from data
  • Unsupervised learning methods for extracting meaning from unlabeled data

What do new statistics perspectives I learn by reading this book?

  • Alc and penalized regression are two different way to overcome overfitting, but with different way, see the summary of chapter 4
  • multilinear explaination need to consider all the relationship of each other variable, which will occassionally affect the explaination of variables.
  • hat-values and cook's distance can be used to determine which point is influencal values, those influencal values should be cancled.
  • Bootstrap: sampling methods, can be used to estimate the statistics and estimate models prediction erros, different with k-fold cv, permutation tests and bootstrap are two main types of resampling procedures.
  • The bootstrap is used to assess the reliability of estimate;permutation tests are used to test hypothesis.

Chapter 1 Exploratory Data Analysis


In this book, data can be grouped as rectangular and nonrectangular data structures. For rectangular data which is the typical frame of referece for an data analysis is rectangular data like a spreadsheet and database table, in python the basic rectangular data structure is dataframe. For nonrectangular data structures are graph data science or spatial data structure.

Next is the most important part of this chapter. we will talk about the four metrics used for exploring or knowing data. They are data of location, variability, skewness as well as kurtosis.

1.1 Estimate of location

which will describe the central tendency of the data, estimate of where most of the data is located. Mean is commonly used in this, including mean, trimmed mean, and weighted mean. Three different means have their different purposes.

  • Mean
    $$Mean = \frac{\sum_{i}^{n}x_i}{n}$$
  • Trimmed Mean, the p smallest and largest values were dropped, which can estimate the influence of exterme data
    $$Trimmed Mean = \frac{\sum_{i=p+1}^{n-p}x_i}{n-2p}$$
  • Weighted mean:

    $$Weighted mean = \frac{\sum_{i=1}^{n}w_ix_i}{w_i}$$

    Two main motivations to use a weighted mean:

    1. Some values are intrinsically more variable than others, and highly variable observations are given a lower weight.
    2. The data collected does not equally represent the different groups that we are interested in measuring
    3. Median or weighted median

    For n distinct ordered elements \(x_1,x_2...x_n\) with positive weights \(w_1,w_2...w_n\) such that \(\sum_{i=1}^{n} w_i = 1\),the weighted median is the element \(x_k\)satisfying \(\sum_{i=1}^{k-1}w_i <=1/2 and \sum_{i=k+1}^{n}w_i<=1/2\).


1.2 Estimate of variability

  • Mean absolution deviation:

    $$Mean absolution deviation = \frac{\sum_{i=1}^{n}}{|x_i-\bar{x}|}$$

  • Variance

$$variance = s^2 = \frac{\sum(x-\bar{x})^2}{n-1}$$
  • standard deviation
$$standard deviation = s = \sqrt{variance}$$

The most common metric to diagnotic the variability is variance and standard deviation, here has one point need to be explained is that denominator of a variance is n-1, this book explained this: If you use the intuitive denominator of n in the variance formula, you will underestimate the true value of the variance and the standard deviation in the population. This is referred to as a biased estimate. However, if you divide by n – 1 instead of n, the standard deviation becomes an unbiased estimate.

To fully explain why using n leads to a biased estimate involves the notion of degrees of freedom, which takes into account the number of constraints in computing an estimate. In this case, there are n – 1 degrees of freedom since there is one constraint: the standard deviation depends on calculating the sample mean.

As to the metrics of estimate of location of data, the trimmed mean and median are robust to outliers and extreme values, but as to the metrics of estimating of variability the variance, standard deviation, the mean absolute deviation is not robust to the exterme values, a really robust estimate of variability is the median absolute deviation from the median:

$$median absolute deviation = median(|x_1-m|,|x_2-m|...|x_N-m|)$$

Where m is the median.不太明白为啥这个可以表示variability啊

1.3 skewness and kurtosis

This part is intentionly blank

1.4 Data distribution

Vasulization is critical in exploring the data process, following is some plot type could be used in our daily works.

  • Boxplot
  • Frequency table and histograma
  • Density distribution
  • bar plot
  • Pie plot

1.5 Correlation

The standard way to visualize the relationship between two measured data variables is with a scatterplot。 And the numberial format to calculate the relationship coefficient is pearson's correlation coefficient, given by:

$$r=\frac{\sum_{i=1}^{N}(x_i-\bar{x})(y_i-\bar{y})}{(N-1)s_xS_y}$$

this correlation coefficient is sensitive to outliers in the data. but Software packages offer robust alternatives to the classical correlation coefficient. When it comes to two or more variables, some other ways can be used to do this:

  • For Two numberic variables: Hexagonal Binning and Contours
  • For Two categorical variables: crosstable
  • For categorical and numeric data: boxplot or violin plot
  • For multiple variables: facet

Summary Chapter 1
This chapter includes the definition of mean, trimmed mean, weighted mean and median, and weighted median. When we talk about estimate of location, such as mean, median or some variation of mean and median, we must to remind ourself when will we use each others and why we did not use another, It will be better if we can provide the fact conditions. The median is not the only robust estimate of location, in fact , a trimmed mean is widely used to avoid the influence of outliers.
variability: variance and standard deviation are most widespread used in report, but they are not robust to outliers. other metrics like median/percentile are robust.
Data distribution : data distrubution can display the skewness and kurtosis of data. Plus location and variability, form the four key moment of a distribution.
correlation: use pearson correlation / or scatter plot
exploring two or more variables:
two numberical data: use heat map, countor,
numberical versus caterical: boxplot or violin plot
two categorical: cross table
numberical vs categorical: boxplot and violin plot
multiple varicable: facet

Chapter 2. Data and sampling distributions


2.1 Resampling——Bootstrap

what is bootstrap? why this should be introduced?

The bootstrap method is a resampling technique used to estimate statistics on a population by sampling a dataset with replacement.

The algorithm for a bootstrap resampling of the mean is as follows, for a sample of size n:

1. Draw a sample value, record, replace it
2. Repeat n times
3. Record the mean of the n resampled values(get 1 data by this way)
4. Repeat steps 1-3 R times(get R data by this)
5. Use The R data to:

    1. calculate the standard deviation 
    2. produce a histogram or boxplot
    3. find a confidence interval

Advantages of bootstrap

  • does not necessarily involve any assumptions about the data or the sample statistic being normally distributed.
  • with replacement

Key ideas

  • The bootstrap (sampling with replacement from a data set) is a powerful tool for assessing the variability of a sample statistic

Application senes

  • I have a group of data with only 20 numbers, I want to know the confidence interval of the mean(or other median, etc.) of the data.
  • estimate the skill of machine learning models on unseen data

K-cross validation versus bootstrap
see this link gives the examples of how to use bootstrap methods to test the metrics of a machine learning models.

In practice, it is often not so much of a difference between k-fold validation and bootstrap.
This link told us that the difference between cross validation and bootstraping to estimate modeling prediction error.CV tends to be less biased but K-fold CV has fairly large variance. On the other hand, bootstrapping tends to drastically reduce the variance but gives more biased results (they tend to be pessimistic).
For large sample sizes, the variance issues become less important and the computational part is more of an issues. I still would stick by repeated CV for small and large sample sizes.
The examples to bootstrap for estimating models:

[0.1, 0.2, 0.3, 0.4, 0.5, 0.6] this is the observation

train = [0.2, 0.1, 0.2, 0.6] this is the sample from above observation with sample size of 4
test = [0.3, 0.4, 0.5] this is the out of bag sample, until now train and test dataset have been sampled by using bootstrap method 
model = fit(train)
statistic = evaluate(model, test) this can calculate the statistics

Until now, this is the one iteration. And you can repeat this 30 times and get the statistics and calculate the mean of it.

Extension: sampling methods introduction
Bootstrap is one kind of sampling methods. There are some other sampling methods such as k-cross validation,permutation tests, and bootstrap

what is the purpose of sampling?(from wikipedia)

  • Estimating the precision of sample statistics (medians, variances, percentiles) by using subsets of available data (jackknifing) or drawing randomly with replacement from a set of data points (bootstrapping): this is basic function
  • Exchanging labels on data points when performing significance tests (permutation tests, also called exact tests, randomization tests, or re-randomization tests)???
  • Validating models by using random subsets (bootstrapping, cross validation): this point has be explained by last paragraph

2.2 Resampling——Permutation Test

One sentence to describe permutation testing
To judge whether an observed difference between samples might occur by chance

Difference of bootstrap and permutation test
The bootstrap is used to assess the reliability of estimate, typically involving one group;
Permutation tests are used to test hypotheses, typically involving two or more groups.

Permutation test definition:
In a permutation procedure, two or more samples are involved, typically the groups in an A/B or other hypothesis test. Permute means to change the order of a set of values. The first step in a permutation test of a hypothesis is to combine the results from groups A and B (and, if used, C, D, …) together. We then test that hypothesis by randomly drawing groups from this combined set, and seeing how much they differ from one another. The permutation procedure is as follows:

  1. Combine the results from the different groups in a single data set
  2. Shuffle the combined data, then randomly draw (without replacing) a resample of the same size as group A
  3. From the remaining data, randomly draw (without replacing) a resample of the same size as group B
  4. Do the same for groups C, D, and so on.
  5. Whatever statistic or estimate was calculated for the original samples (e.g., difference in group proportions), calculate it now for the resamples, and record; this constitutes one permutation iteration.
  6. Repeat the previous steps R times to yield a permutation distribution of the test statistic.

In the book, they have given an example of permutation testing about web stickness, if you are still confused about this ,plz read the book at page 118.

Two types of permutation test
the difference of two permutation testing is the preceding random shuffling procedure.

  • Exhaustive permutation testing: In this permutation testing, insteal of just randomly shuffling and dividing the data, we actually figure out all the possible ways it could be divided.This is practical only for relatively small sample sizes.

    Here is a question: how did this permutation method divide the data , which has been dicussed here. the conclusion is that even though the name is called permutaiton methods, the python libararies also use the combination methods to divide data. and the results is same regardless of combination or permutation.

  • Bootstrap permutation: In a bootstrap permutation test, the draws outlined in steps 2 and 3 of the random permutation test are made with replacement instead of without replacement.

Resampling and sampling sampling是抽样,抽取的是总体中的一部分,再用这个抽样样本的属性估计总体,可以用来估计总体参数,但是怎么知道这个估计值的可变性和不确定性如何判断呢,这时候就需要用重采样的方法从样本中多次估计总体参数,这就是resampling重采样。link

2.3 Statistical Significance and P-Values

Statistical significance is how statisticians measure whether an experiment (or even a study of existing data) yields a result more extreme than what chance might produce. If the result is beyond the realm of chance variation, it is said to be statistically significant.

Chapter 4. Regression and Prediction


4.1 Simple Linear Regression

The Equtation is:

$$Y = b_0 + b_{1} X $$

The fitted values, also refered to the predicted values, are denoted by \(\hat{Y}_{i}\), there are given by:

$$\hat{Y}_{i} = \hat{b}_0+\hat{b}_1X_{i}$$

Try to minimize the following RSS and return \(b_0\) and \(b\_1\), RSS is residual sum of squares, in Chinese is 残差平方和:

$$RSS = \sum_{i=1}^{n}(Y_i-\hat{Y}_{i})^2$$

4.2 Multiple Linear Regression

The basic fitted values are given by:

$$\hat{Y_i} = \hat{b_0} + \hat{b_1}X_{1,i} + \hat{b_2}X_{2,i} +...+ \hat{b_p}X_{p,i}$$

Next three metrics to assess the models: RMSE,RSE,\(R^2\)

The most important performance metric from a data science perspective is root mean squared error, or RMSE , This measures the overall accuracy of the model, and is a basis for comparing it to other models.

$$RMSE = \sqrt{\frac{\sum_{i=1}^{n}(y_i-\hat{y_i}))}{n}}$$

Another metrics is residual standard error, RSE, which is given by:

$$RSE = \sqrt{\frac{\sum_{i=1}^{n}(y_i-\hat{y_i}))}{n-p-1}}$$

The only difference is that the denominator is the degrees of freedom, instead of number of records. In practice, for linear regression, the difference between RMSE and RSE is very small, particularly for big data applications.
Another useful metric that you will see in software output is the coefficient of determination, also called \(R^2\), which is given by:

$$R^2 =1- \sqrt{\frac{\sum_{i=1}^{n}(y_i-\hat{y_i}))}{\sum_{i=1}^{n}(y_i-\bar{y_i}))}}$$

Warning: In addition to the t-statistic, R and other packages will often report a p-value (Pr(>|t|) in the R output) and F-statistic. Data scientists do not generally get too involved with the interpretation of these statistics, nor with the issue of statistical ignificance. Data scientists primarily focus on the t-statistic as a useful guide for whether to include a predictor in a model or not. High t-statistics (which go with p-values near 0) indicate a predictor should be retained in a model, while very low t-statistics indicate a predictor could be dropped. See “P-Value” for more discussion.

Model Selection ---AIC and penalized regression

AIC can be used to chose the model, and determine which predictor should be dropped or not.AIC has the form:

$$AIC = 2P+nlog(RSS/n)$$

where p is the number of variables and n is the number of records. The goal is to find the model that minimizes AIC.
The way to minimize the AIC: which is called stepwide regression, which successively adds and drops predictors to find the models that lower AIC. The stepwide regression includes forward selection and backward selection.

Penalized regression is similar in spirit to AIC, instead of explicitly searching through a discrete set of models, the model-fitting equations incorporates a constriant that penalizes the model for many variables. Rather than eliminating predictor variables entirely, penalized regression applies the penalty by reducing coefficients, in some cases to near zero. Common penalized regression methods are ridge regression and lasso regression.

Key ideas

  • The most important metrics to evaluate a model are root mean squared error (RMSE) and R-squared (\(R^2\))
  • The standard error of the coefficients can be used to measure the reliability of a variable’s contribution to a model.
  • Stepwise regression is a way to automatically determine which variables should be included in the model
  • Weighted regression is used to give certain records more or less weight in fitting the equation

4.3 Prediction Using Regression

A little bit confused about this part, generally introducting some concept about the confidence intervals and prediction intervals about the prediction coefficient.

4.4 Factor Variables in Regression

When we are using regression for prediction, it is inevitable that some variables are factor variables such as phone brand and people's location. But the problem is regression requires numberical inputs. The solution to this problem will vary based on the different factor variables type.

  1. when the factor variables without much level: This is the most common condition, and the approach is to convert a variable into a set of binary dummy variable. The feature with k level can be converted to k columns as new features by using dummy function in pd.get_dummies.In the regression setting, a factor variable with P distinct levels is usually represented by a matrix with only P – 1 columns.(WHY??) , this was explained by following in this book:
    This is because a regression model typically includes an intercept term. With an intercept, once you have defined the values for P – 1 binaries, the value for the Pth is known and could be considered redundant.
  2. when the factor variables with many levels: Sometimes the feature can produce a huge number of binary dummies like zip code in the US, encountering this situation, we need to explore the data and find the relationship between preditor and the outcome, and the to determine whether useful information is contained in this categories.If yes, two ways are recommended, the first one is group the zip code according to other variables such as sales price(such as group the zipcode into five groups based on the house price form high to low). The second one(which is even better) is forming the zip codes group using the residuals from an initial model.(how to implement this?)
  3. when the factor variables are ordered factors: it is really easy, just convert the ordered factors to numberic variable, by that the information contained in the ordering can be saved.

Alao, apart from those approaches metioned above, some other ways to encode factor variables can be known by:
Different factor codings: There are several different ways to encode factor variables, known as contrast coding systems. For example, deviation coding, also know as sum contrasts, compares each level against the overall mean. Another contrast is polynomial coding, which is appropriate for ordered factors; see the section “Ordered Factor Variables”. With the exception of ordered actors, data scientists will generally not encounter any type of coding besides reference coding or one hot encoder.

4.4.1 Interpreting the regression equation

Generally speaking, the most important use of regression is to predict some dependent variables. But in some cases, however, gaining insight from the equation itself to understand the nature of the relationship between the predictor and the outcome can be of value.
Three types of this were metioned:

  • Having correlated predictor can make it difficult to interpret the relationship between the predictor and the outcome, one example listed in this book is the coefficient of bedroom of the house price is negative, meaning that the price will be lower with the increase of bedbook, it seems illogical.The reason to this is the bedroom is correlated with house size which has significant effect on the house price. Another extreme case of correlated variables produces multicollinearity, perfect multicollinearity occurs when one predictor variable can be expressed as a linear combination of others.This problem must be addressed untill the multicollinearity is gone.
    Notes:Multicolinearity is not such a problem for nonregression methods like trees, clustering, and nearest-neighbors, and in such methods it may be advisable to retain P dummies (instead of P-1). That said, even in those methods, nonredundancy in predictor variables is still a virtue.

  • Confounding variables: with correlated variables, the problems is caused by the different variables have the same relationship with the outcome.But with the confounding variables, the problem is an important variable is not included in the regression equation, for example, the regression equation to predict house price does not include the location of the house.

  • Interactions and main effect:
    Do not understand the whole part, and this part is intentionly blank.

Key ideas:

  • Because of correlation between predictors, care must be taken in the interpretation of the coefficients in multiple linear regression.
  • Multicollinearity can cause numerical instability in fitting the regression equation.
  • A confounding variable is an important predictor that is omitted from a model and can lead to a regression equation with spurious relationships.
  • An interaction term between two variables is needed if the relationship between the variables and the response is interdependent

4.5 Testing the Assumptions: Regression Diagnostics

This part talks about the assumotion of linear regression, and gives some advices on how to diagnostics those assumptions. This link is also a good place to introduce the linear regression assumption and its diagnostics(four principal assumptions mentioned in this page). Check it if you want learn deeper.(really recommended this)

  • outlier:No statistical theory that can separates outliers from nonoutliers, but there are some rules of thumb for determine how the distant from the bulk of the data. such as with the boxplot, outliers are pointed out which is blow or above the box boundaries. Standardized residual can also be used to
  • Influential Values: A value whose absence would significantly change the regression equation is termed an infuential observation. But how do we know which record is influential observation? Fortunately, statisticians have developed several metrics to determine the influence of a single record on a regression. Two metrics were introduced in this book one is hat-value , another is cook's distance. A common measure of leverage is the hat-value; values above \(2(P+1)/n\) indicates a high-levearge data value. Cook's distance defines influence as a combination of leverage and residual size.An observation has high influence if cook's distance exceeds \(4/(n-P-1)\).( what is hat-values and cook's distancereally mean)

Oh,no, standardized residual, hat-value as well as cook's distance, so many concepts to understand, is there any way to incorporate those in a plot or equation. The answer is yes, an influence plot or bubble plot can combine those in a single plot. On this plot, the hat-values are plotted in x-axis, and the residual are plotted in y-axis, the size of points is related to the value of cook's distance.
Actually, identifying influential observations is not nesseary based on different purposes. For purposes of fitting a regression that reliably predict future data, identifying influential observations is only useful in small data sets, while in many records it is unlikely that one observation can cause extreme influence on the fitted regression. But for purposes of anomaly detection, identifying observations can be very useful.

4.5.1 Heteroskedasticity, Non-Normality and Correlated Errors

This part is intentionally blank.
In general, however, the distribution of residuals is not critical in data science.

4.5.2 Partial Residual Plots and Nonlinearity

Partial residual plots are a way to visualize how well the estimated fit explains the relationship between a predictor and the outcome. Along with detection of outliers, this is probably the most important diagnostic for data scientists.

$$partial residual = residual + \hat{b_i}X_i$$

Note that why called partial, this is because that residual will sum all residual together, but partial residual focused on the certain predictor. This is why the partial residual can be uesd to explain the relationship between a predictor and the outcome, untill now, it is likely that you do not know how does this explain the relationship? The answer is partial residual plot. where the \(X_i\) on the x-axis and the partial residuals on the y-axis.

From this pic, the relationship between SqFtTotLiving and the sales price is evidently nonlinear(this against the assumption of linearity metioned in this link). This suggests that, instead of a simple linear term for SqFtTotLiving, a nonlinear term should be considered.(maybe polynomial and spline regression)

key ideas:

  • The partial residuals plot can be used to qualitatively assess the fit for each regression term, possibly leading to alternative model specification.
  • Single records (including regression outliers) can have a big influence on a regression equation with small data, but this effect washes out in big data
  • While outliers can cause problems for small data sets, the primary interest with outliers is to identify problems with the data, or locate anomalies.

4.6 Polynomial and Spline Regression

The relationship between the response and a predictor variable is not necessarily linear.Two non-linear regression was discussed in this chapter. Polynomial and splines.

Polynomial

Nothing much to notes

Splines

Nothing much to notes here, but I think splines is much super than polynomial.

Generalized additive models

Generalized additive models (GAM) automate the process of specifying the knots in splines

Summay Chapter 4:
In this chapter, linear and non-linear regression were discussed. And for linear regression, more content focused on the multilinear regression, while for non-linear regression, polynomial and splines functions were introduced.
Before we are conducting the regression, out input must be numberial, so the factor variables must be dummied or grouped by other variables. When all data are numberial, we could find the regression coefficient by least squares(最小二乘). After producing the coefficient, how do we assess our models? RMSE and RSE and \(R^2\) are the answer. When we get the coefficient amd assess the models, We naturally want to check if the coefficient is interpretable, sometimes due to the correlated variables and confounding variables, the regression results are not actually interpretable. Another work we must consider is that the linear regression assumption, we need to test the assumptions, ways to diagnostics those assumptions were mentioned in the according paragraph.
At last, I want to say the content about splines and polynomial, I did not pay much attention to those even though I thought non-linear regression is much more common than the linear one.

New tips and tricks I learned by reading this:

  • a predictor can be checked if it has the linear relationship with the outcome by partial residual.
  • we can test the assumptions, to check if the data meets the requires of linear regression.
  • influential values can be pointed out by influence-plot.

Keep going and learning!!

Chapter 5. Classification

This chapter three algorithms to implement classification are discussed: Naive bayes, discriminant analysis and logistic regression. After that, metrics to assess classification models were talked.

5.1 Naive Bayes

看了,但是感觉没法吸收太多, bayes是我一直的心痛所在,看来我是需要想办法克服这个难题了

5.2 Discriminant Analysis

看了,但是总觉得实用性不大

Logistic Regression

summary:

感觉这一章节里面logistic regression 是被重点介绍的

blogroll

social