Top 15 Questions to test a Datascience Enthusiast

  1. What is logistic regression? Explain it with a real world example.

This regression type is referred to as logit model and it is a technique to predict a binary outcome from a linear combination of predictor variables. An apt example could be to predict whether a fraud even will occur or not (1 = Occur and 0 = No). The predictor variable here could be age group, monthly salary, work experience etc.

  1. What are recommender systems?

Recommender systems helps to predict the preference or ratings that a user would give to a product. They are widely used in movies, news, research articles, products etc. Retails giants like Walmart, Target, Amazon etc. use them to understand customer purchase preferences.

  1. What is the difference in between univariate, bivariate and multivariate analysis?

These can be classified as descriptive statistical analysis technique which can be differentiated based on the number of variables involved. For example, the bar graph of most sold products involves only one variable (number of sales) and this can be referred to as univariate analysis.

Whereas when establishing a relationship of number of sales with customer’s age groups can be considered as an example of bivariate analysis.

The analysis that deals with the study of more than two variables to understand the effect of variables on the responses can be referred to as multivariate analysis. For example, Principal Component Analysis, Factor Analysis etc.

  1. Which model evaluation technique can be used when evaluating a linear regression model with a continuous output variable?

As we know linear regressions model’s output is always a continuous value and in this scenario we use mean squared error metric to evaluate the model performance.

  1. What do you understand by the term Normal Distribution?

Data is usually distributed in various ways with a bias to the let or to the right or it can be all over the place. However, there are chances that data is distributed around a central value without any skewness and is in a perfect bell shaped curve and when the random variables are distributed in the form of a symmetrical bell shaped curve we can say that the data is normally distributed.

  1. How to treat outlier values?

Outlier values can be identified by deploying a univariate or any other graphical analysis techniques and can be defined as any value, which is beyond the range of -1.5 x IQR to 1.5 x IQR . If the outlier values is few then they can be assessed individually. However, in case of large number of outliers the values can be substituted with either the 99th or the 1st percentile values. All extreme values are not outlier values. Common ways to treat outlier values are:

  1. Remove the values
  2. Transforming and binning the values
  3. Impute the values by using mean, median, mode imputation methods

    7. What is feature engineering?

Feature engineering can be defined as the science and art of extracting more information from existing data without adding a new data.

  1. What are some common methods for Variable Transformation?

Though there are various methods that can be used to transform the variables. However, some of the most common variable transformation methods are:

Logarithm: This is one of the most common variable transformation method used to change the shape of the distribution of the variable. It is used to reduce the right skewness of variables and cannot be applied to zero or negative values.

Binning: It is used to categorize variables. It is performed on original values, percentile or frequency. This can be deployed when defining categories like age groups, claims amounts etc.

  1. What are the most common model evaluation techniques?

There are various ways to evaluate the model performance. However, the most common techniques are:

  1. Confusion Matrix
  2. AUC – ROC
  3. Gain and Lift Chart
  4. RMSE (Root Mean Squared Error)
  5. What is Box Cox Transformation?

Box Cox transformations are used to obtain a normal distribution from a non-normal data with a constant variance. The statisticians George Box and David Cox developed a procedure to identify an appropriate exponent (Lambda =1) to use to transform data into normal shape.

  1. How can you deal with different types of seasonality in time series modeling?

Seasonality in time series occurs when time series shows a repeated pattern over time. E.g. stationary absenteeism increasing during the holiday season, number of claims filed increasing during the end of the year.

Seasonality results in making the time series non-stationary due to the average value of the variables at different time periods. Differentiating a time series is generally known as the best methods of removing seasonality from a time series. It can be defined as a numerical difference between a particular value and a value with a periodic lag (i.e. 12, if monthly seasonality is present)

  1. What is dimensionality reduction?

It is the process of reducing the number of random variables under consideration, by deploying techniques like PCA (Principal Component Analysis), Variance Inflation Factor (though it is not a technique but is a statistic that points towards multicollinearity amongst predictor variables), Factor Analysis.

  1. Why Naive Bayes is so “naïve”?

Naïve Bayes is a classification technique based on the Bayes Theorem with an assumption of independence among predictors. In other words, Naïve Bayes classifier assumes that the presence of a particular feature in a class is independent of any other feature and this assumption makes this classification technique so “naive”.

  1. What is anomaly detection in time series?

Anomaly detection is the process of finding patterns in data that do not conform to a model of “normal behavior”. Most common approaches for detecting such changes either usage of manually calculated thresholds, or mean and standard deviation to determine when data deviates significantly from the mean.

  1. What is combinatorics in data science?

It is a branch of mathematics concerning the study of finite or countable discrete structures. Combinatorics is used frequency in computer science to obtain formulas and estimates while analyzing algorithms. A mathematician who studies combinatorics is called as combinatorialist or a combinatorics.

  1. What is the difference in between Mean Absolute Error and Mean Squared Error?

MAE (Mean Absolute Error) and MSE (Mean Squared Error) are used in predictive modeling, MAE is more robust to outlier than MSE. MAE assigns equal weight to the data whereas MSE emphasizes on the extreme values.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s