Powered by GitBook

Following are a collection of chart types I used, together with the python code to generate them.

Each chart has its favourite data type.

Differentiate between univariate, bivariate and multivariate analysis.

These are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can be referred to as univariate analysis.

If the analysis attempts to understand the difference between 2 variables at time as in a scatterplot, then it is referred to as bivariate analysis. For example, analysing the volume of sale and a spending can be considered as an example of bivariate analysis.

Analysis that deals with the study of more than two variables to understand the effect of variables on the responses is referred to as multivariate analysis.

1. What is cross-validation?

It’s a model validation technique for assessing how the results of a statistical analysis will generalize to an independent data set. Mainly used in settings where the goal is prediction and one wants to estimate how accurately a model will perform in practice. The goal of cross-validation is to define a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting, and get an insight on how the model will generalize to an independent data set.

Examples: leave-one-out cross validation, K-fold cross validation

2. How to do it right?

the training and validation data sets have to be drawn from the same population
predicting stock prices: trained for a certain 5-year period, it’s unrealistic to treat the subsequent 5-year a draw from the same population
common mistake: for instance the step of choosing the kernel parameters of a SVM should be cross-validated as well

Bias-variance trade-off for k-fold cross validation:

Leave-one-out cross-validation: gives approximately unbiased estimates of the test error since each training set contains almost the entire data set (n−1n−1observations).

But: we average the outputs of n fitted models, each of which is trained on an almost identical set of observations hence the outputs are highly correlated. Since the variance of a mean of quantities increases when correlation of these quantities increase, the test error estimate from a LOOCV has higher variance than the one obtained with k-fold cross validation

Typically, we choosek=5k=5ork=10k=10, as these values have been shown empirically to yield test error estimates that suffer neither from excessively high bias nor high variance.

Is it better to design robust or accurate algorithms?
The ultimate goal is to design systems with good generalization capacity, that is, systems that correctly identify patterns in data instances not seen before
The generalization performance of a learning system strongly depends on the complexity of the model assumed
If the model is too simple, the system can only capture the actual data regularities in a rough manner. In this case, the system has poor generalization properties and is said to suffer from underfitting
By contrast, when the model is too complex, the system can identify accidental patterns in the training data that need not be present in the test set. These spurious patterns can be the result of random fluctuations or of measurement errors during the data collection process. In this case, the generalization capacity of the learning system is also poor. The learning system is said to be affected by overfitting
Spurious patterns, which are only present by accident in the data, tend to have complex forms. This is the idea behind the principle of Occam’s razor for avoiding overfitting: simpler models are preferred if more complex models do not significantly improve the quality of the description for the observations
Quick response: Occam’s Razor. It depends on the learning task. Choose the right balance
Ensemble learning can help balancing bias/variance (several weak learners together = strong learner)
How to define/select metrics?
Type of task: regression? Classification?
Business goal?
What is the distribution of the target variable?
What metric do we optimize for?
Regression: RMSE (root mean squared error), MAE (mean absolute error), WMAE(weighted mean absolute error), RMSLE (root mean squared logarithmic error)…
Classification: recall, AUC, accuracy, misclassification error, Cohen’s Kappa…

Common metrics in regression:

Mean Squared Error Vs Mean Absolute Error RMSE gives a relatively high weight to large errors. The RMSE is most useful when large errors are particularly undesirable.
The MAE is a linear score: all the individual differences are weighted equally in the average. MAE is more robust to outliers than MSE.
RMSE=1n∑ni=1(yi−ŷ i)2‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾√RMSE=1n∑i=1n(yi−y^i)2
MAE=1n∑ni=1|yi−ŷ i|MAE=1n∑i=1n|yi−y^i|
Root Mean Squared Logarithmic Error
RMSLE penalizes an under-predicted estimate greater than an over-predicted estimate (opposite to RMSE)
RMSLE=1n∑ni=1(log(pi+1)−log(ai+1))2‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾√RMSLE=1n∑i=1n(log⁡(pi+1)−log⁡(ai+1))2
Wherepipiis the ith prediction,aiaithe ith actual response,log(b)log(b)the natural logarithm ofbb.
Weighted Mean Absolute Error
The weighted average of absolute errors. MAE and RMSE consider that each prediction provides equally precise information about the error variation, i.e. the standard variation of the error term is constant over all the predictions. Examples: recommender systems (differences between past and recent products)
WMAE=1∑wi∑ni=1wi|yi−ŷ i|WMAE=1∑wi∑i=1nwi|yi−y^i|

Common metrics in classification:

Recall / Sensitivity / True positive rate:
High when FN low. Sensitive to unbalanced classes.
Sensitivity=TPTP+FNSensitivity=TPTP+FN
Precision / Positive Predictive Value
High when FP low. Sensitive to unbalanced classes.
Precision=TPTP+FPPrecision=TPTP+FP
Specificity / True Negative Rate
High when FP low. Sensitive to unbalanced classes.
Specificity=TNTN+FPSpecificity=TNTN+FP
Accuracy
High when FP and FN are low. Sensitive to unbalanced classes (see“Accuracy paradox”)
Accuracy=TP+TNTN+TP+FP+FNAccuracy=TP+TNTN+TP+FP+FN
ROC / AUC
ROC is a graphical plot that illustrates the performance of a binary classifier (SensitivitySensitivityVs1−Specificity1−SpecificityorSensitivitySensitivityVsSpecificitySpecificity). They are not sensitive to unbalanced classes.
AUC is the area under the ROC curve. Perfect classifier: AUC=1, fall on (0,1); 100% sensitivity (no FN) and 100% specificity (no FP)
Logarithmic loss
Punishes infinitely the deviation from the true value! It’s better to be somewhat wrong than emphatically wrong!
logloss=−1N∑ni=1(yilog(pi)+(1−yi)log(1−pi))logloss=−1N∑i=1n(yilog⁡(pi)+(1−yi)log⁡(1−pi))
Misclassification Rate
Misclassification=1n∑iI(yi≠ŷ i)Misclassification=1n∑iI(yi≠y^i)
F1-Score
Used when the target variable is unbalanced.F1Score=2Precision×RecallPrecision+RecallF1Score=2Precision×RecallPrecision+Recall

11. Explain what a false positive and a false negative are. Why is it important these from each other? Provide examples when false positives are more important than false negatives, false negatives are more important than false positives and when these two types of errors are equally important

False positive
Improperly reporting the presence of a condition when it’s not in reality. Example: HIV positive test when the patient is actually HIV negative
False negative
Improperly reporting the absence of a condition when in reality it’s the case. Example: not detecting a disease when the patient has this disease.

When false positives are more important than false negatives:

In a non-contagious disease, where treatment delay doesn’t have any long-term consequences but the treatment itself is grueling
HIV test: psychological impact

When false negatives are more important than false positives:

If early treatment is important for good outcomes
In quality control: a defective item passes through the cracks!
Software testing: a test to catch a virus has failed
Explain what regularization is and why it is useful. What are the benefits and drawbacks of specific methods, such as ridge regression and lasso?
Used to prevent overfitting: improve the generalization of a model
Decreases complexity of a model
Introducing a regularization term to a general loss function: adding a term to the minimization problem
Impose Occam’s Razor in the solution

Ridge regression:

We use an penalty when fitting the model using least squares
We add to the minimization problem an expression (shrinkage penalty) of the for
: tuning parameter; controls the bias-variance tradeoff; accessed with cross-validation
A bit faster than the lasso

The Lasso:

We use an L1 penalty when fitting the model using least squares
Can force regression coefficients to be exactly: feature selection method by itself

5. Explain what a local optimum is and why it is important in a specific context, such as K-means clustering. What are specific ways of determining if you have a local optimum problem? What can be done to avoid local optima?

A solution that is optimal in within a neighboring set of candidate solutions
In contrast with global optimum: the optimal solution among all others
K-means clustering context:
It’s proven that the objective cost function will always decrease until a local optimum is reached.
Results will depend on the initial random cluster assignment
Determining if you have a local optimum problem:
Tendency of premature convergence
Different initialization induces different optima
Avoid local optima in a K-means context: repeat K-means and take the solution that has the lowest cost

6. Assume you need to generate a predictive model using multiple regression. Explain how you intend to validate this model

Validation usingR2R2:

% of variance retained by the model
Issue:R2R2is always increased when adding variablesAnalysis of residuals:
Heteroskedasticity (relation between the variance of the model errors and the size of an independent variable’s observations)
Scatter plots residuals Vs predictors
Normality of errors
Etc. : diagnostic plots

Out-of-sample evaluation: with cross-validation

8. What is latent semantic indexing? What is it used for? What are the specific limitations of the method?

Indexing and retrieval method that uses singular value decomposition to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text
Based on the principle that words that are used in the same contexts tend to have similar meanings
“Latent”: semantic associations between words is present not explicitly but only latently
For example: two synonyms may never occur in the same passage but should nonetheless have highly associated representations

Used for:

Learning correct word meanings
Subject matter comprehension
Information retrieval
Sentiment analysis (social network analysis)

Here’s a great
tutorial
on it.

9. Explain what resampling methods are and why they are useful

repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model
example: repeatedly draw different samples from training data, fit a linear regression to each new sample, and then examine the extent to which the resulting fit differ
most common are: cross-validation and the bootstrap
cross-validation: random sampling with no replacement
bootstrap: random sampling with replacement
cross-validation: evaluating model performance, model selection (select the appropriate level of flexibility)
bootstrap: mostly used to quantify the uncertainty associated with a given estimator or statistical learning method

10. What is principal component analysis? Explain the sort of problems you would use PCA for. Also explain its limitations as a method

Statistical method that uses an orthogonal transformation to convert a set of observations of correlated variables into a set of values of linearly uncorrelated variables called principal components.

Reduce the data fromnntokkdimensions: find thekkvectors onto which to project the data so as to minimize the projection error.

Algorithm:
1) Preprocessing (standardization): PCA is sensitive to the relative scaling of the original variable
2) Compute covariance matrixΣΣ
3) Compute eigenvectors ofΣΣ
4) Choosekkprincipal components so as to retainxx% of the variance (typicallyx=99x=99)

Applications:
1) Compression

Reduce disk/memory needed to store data
Speed up learning algorithm. Warning: mapping should be defined only on training set and then applied to test set
Visualization: 2 or 3 principal components, so as to summarize data

Limitations:

PCA is not scale invariant
The directions with largest variance are assumed to be of most interest
Only considers orthogonal transformations (rotations) of the original variables
PCA is only based on the mean vector and covariance matrix. Some distributions (multivariate normal) are characterized by this but some are not
If the variables are correlated, PCA can achieve dimension reduction. If not, PCA just orders them according to their variances

12. What is the difference between supervised learning and unsupervised learning? Give concrete examples

Supervised learning: inferring a function from labeled training data
Supervised learning: predictor measurements associated with a response measurement; we wish to fit a model that relates both for better understanding the relation between them (inference) or with the aim to accurately predicting the response for future observations (prediction)
Supervised learning: support vector machines, neural networks, linear regression, logistic regression, extreme gradient boosting
Supervised learning examples: predict the price of a house based on the are, size.; churn prediction; predict the relevance of search engine results.
Unsupervised learning: inferring a function to describe hidden structure of unlabeled data
Unsupervised learning: we lack a response variable that can supervise our analysis
Unsupervised learning: clustering, principal component analysis, singular value decomposition; identify group of customers
Unsupervised learning examples: find customer segments; image segmentation; classify US senators by their voting.

14. What are feature vectors?

n-dimensional vector of numerical features that represent some object
term occurrences frequencies, pixels of an image etc.
Feature space: vector space associated with these vectors
When would you use random forests Vs SVM and why?
In a case of a multi-class classification problem: SVM will require one-against-all method (memory intensive)
If one needs to know the variable importance (random forests can perform it as well)
If one needs to get a model fast (SVM is long to tune, need to choose the appropriate kernel and its parameters, for instance sigma and epsilon)
In a semi-supervised learning context (random forest and dissimilarity measure): SVM can work only in a supervised learning mode

16. How do you take millions of users with 100’s transactions each, amongst 10k’s of products and group the users together in meaningful segments?

Some exploratory data analysis (get a first insight)
Transactions by date
Count of customers Vs number of items bought
Total items Vs total basket per customer
Total items Vs total basket per area
Create new features (per customer):

Counts:

Total baskets (unique days)
Total items
Total spent
Unique product id

Distributions:

Items per basket
Spent per basket
Product id per basket
Duration between visits
Product preferences: proportion of items per product cat per basket
Too many features, dimension-reduction? PCA?
Clustering:
PCA
Interpreting model fit
View the clustering by principal component axis pairs PC1 Vs PC2, PC2 Vs PC1.
Interpret each principal component regarding the linear combination it’s obtained from; example: PC1=spendy axis (proportion of baskets containing spendy items, raw counts of items and visits)
How do you know if one algorithm is better than other?
In terms of performance on a given data set?
In terms of performance on several data sets?
In terms of efficiency?

In terms of performance on several data sets:

“Does learning algorithm A have a higher chance of producing a better predictor than learning algorithm B in the given context?”
“Bayesian Comparison of Machine Learning Algorithms on Single and Multiple Datasets”, A. Lacoste and F. Laviolette
“Statistical Comparisons of Classifiers over Multiple Data Sets”, Janez Demsar

In terms of performance on a given data set:

One wants to choose between two learning algorithms
Need to compare their performances and assess the statistical significance

One approach (Not preferred in the literature):

Multiple k-fold cross validation: run CV multiple times and take the mean and sd
You have: algorithm A (mean and sd) and algorithm B (mean and sd)
Is the difference meaningful? (Paired t-test)

Sign-test (classification context):
Simply counts the number of times A has a better metrics than B and assumes this comes from a binomial distribution. Then we can obtain a p-value of theHoHotest: A and B are equal in terms of performance.

Wilcoxon signed rank test (classification context):
Like the sign-test, but the wins (A is better than B) are weighted and assumed coming from a symmetric distribution around a common median. Then, we obtain a p-value of theHoHotest.

Other (without hypothesis testing):

AUC
F-Score
See question 3

18. How do you test whether a new credit risk scoring model works?

Test on a holdout set
Kolmogorov-Smirnov test

Kolmogorov-Smirnov test:

Non-parametric test
Compare a sample with a reference probability distribution or compare two samples
Quantifies a distance between the empirical distribution function of the sample and the cumulative distribution function of the reference distribution
Or between the empirical distribution functions of two samples
Null hypothesis (two-samples test): samples are drawn from the same distribution
Can be modified as a goodness of fit test
In our case: cumulative percentages of good, cumulative percentages of bad

19. What is: collaborative filtering, n-grams, cosine distance?

Collaborative filtering:

Technique used by some recommender systems
Filtering for information or patterns using techniques involving collaboration of multiple agents: viewpoints, data sources.
A user expresses his/her preferences by rating items (movies, CDs.)
The system matches this user’s ratings against other users’ and finds people with most similar tastes
With similar users, the system recommends items that the similar users have rated highly but not yet being rated by this user

n-grams:

Contiguous sequence of n items from a given sequence of text or speech
“Andrew is a talented data scientist”
Bi-gram: “Andrew is”, “is a”, “a talented”.
Tri-grams: “Andrew is a”, “is a talented”, “a talented data”.
An n-gram model models sequences using statistical properties of n-grams; see: Shannon Game
More concisely, n-gram model:P(Xi|Xi−(n−1)...Xi−1)P(Xi|Xi−(n−1)...Xi−1): Markov model
N-gram model: each word depends only on then−1n−1last words

Issues:

when facing infrequent n-grams
solution: smooth the probability distributions by assigning non-zero probabilities to unseen words or n-grams
Methods: Good-Turing, Backoff, Kneser-Kney smoothing

Cosine distance:

How similar are two documents?
Perfect similarity/agreement: 1
No agreement : 0 (orthogonality)
Measures the orientation, not magnitude

Given two vectors A and B representing word frequencies:
cosine-similarity(A,B)=⟨A,B⟩||A||⋅||B||

Why is naive Bayes so bad? How would you improve a spam detection algorithm that uses naive Bayes?
Naïve: the features are assumed independent/uncorrelated
Assumption not feasible in many cases
Improvement: decorrelate features (covariance matrix into identity matrix)

results matching ""

No results matching ""