Blog, Questions and Answers

Data Scientist Interview Question: What is Regularization and why it is useful?

What is Regularization and why it is useful?

In Machine Learning, very often the task is to fit a model to a set of training data and use the fitted model to make predictions or classify new (out of sample) data points. Sometimes model fits the training data very well but does not well in predicting out of sample data points. A model may be too complex and overfit or too simple and underfit, either way giving poor predictions.

What is Regularization?

Regularization is a way to avoid overfitting by penalizing high regression coefficients, it can be seen as a way to control the trade-off between bias and variance in favor of an increased generalization. In simple terms, it reduces parameters and simplifies the model or selects the preferred level of model complexity so it is better at predicting-generalizing.

To apply regularization two things are required:

  • A way of quantifying how a good model is eg. cross-validation
  • A tuning parameter which enables changing the complexity of the model

How does Regularization work?

In order to find the best model, the common method in machine learning is to define a loss function that describes how well the model fits the data. The ultimate goal is to minimize this loss function. Regularization is the process of adding a tuning parameter to a model, this is most often done by adding a constant multiple to an existing weight vector. The model predictions should then minimize the mean of the loss function calculated on the regularized training set.

Most often used regularization methods:

  • Ridge Regression(L2)
  • Lasso (L1) – “Least Absolute Selection and Shrinkage Operator”
  • ElasticNet

 

Example code of L1 regularization using Python:

 

from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.cross_validation import train_test_split
import numpy as np

data = datasets.load_iris()
X = data['data']
y = data['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=0)

for Coef in np.arange(0.1, 1,0.1):
 clf = LogisticRegression(penalty='l1', C=Coef)
 clf.fit(X_train, y_train)
 print('C:', Coef)
 print('Accuracy:', clf.score(X_test, y_test))
 print('')

Was the above useful? Please share with others on social media.

If you want to look for more information, check some free online courses available at   coursera.orgedx.org or udemy.com.

Recommended reading list:

 

Data Science Interviews Exposed

Data Science Interviews Exposed offers data science career advice and REAL interview questions to help you get the six-figures salary jobs! A data science job is extremely rewarding. It empowers to you make real impact in the world! And besides, it offers competitive salaries, and it develops your creative as well as quantitative skills. No wonder the data science job is rated as one of the sexist jobs in 21st century.
The Data Science Handbook: Advice and Insights from 25 Amazing Data Scientists

The Data Science Handbook contains interviews with 25 of the world s best data scientists. We sat down with them, had in-depth conversations about their careers, personal stories, perspectives on data science and life advice. In The Data Science Handbook, you will find war stories from DJ Patil, US Chief Data Officer and one of the founders of the field. You ll learn industry veterans such as Kevin Novak and Riley Newman, who head the data science teams at Uber and Airbnb respectively.
Getting a Big Data Job For Dummies

Hone your analytic talents and become part of the next big thing
Getting a Big Data Job For Dummies is the ultimate guide to landing a position in one of the fastest-growing fields in the modern economy. Learn exactly what "big data" means, why it's so important across all industries, and how you can obtain one of the most sought-after skill sets of the decade. This book walks you through the process of identifying your ideal big data job, shaping the perfect resume, and nailing the interview, all in one easy-to-read guide.
A collection of Data Science Interview Questions Solved in Python and Spark: Hands-on Big Data and Machine Learning (A Collection of Programming Interview Questions) (Volume 6)

Developing Analytic Talent: Becoming a Data Scientist

Harvard Business Review calls it the sexiest tech job of the 21st century. Data scientists are in demand, and this unique book shows you exactly what employers want and the skill set that separates the quality data scientist from other talented IT professionals. Data science involves extracting, creating, and processing data to turn it into business value.
Practical Statistics for Data Scientists: 50 Essential Concepts

Statistical methods are a key part of of data science, yet very few data scientists have any formal statistics training. Courses and books on basic statistics rarely cover the topic from a data science perspective. This practical guide explains how to apply various statistical methods to data science, tells you how to avoid their misuse, and gives you advice on what's important and what's not.
Python Data Science Handbook: Essential Tools for Working with Data

For many researchers, Python is a first-class tool mainly because of its libraries for storing, manipulating, and gaining insight from data. Several resources exist for individual pieces of this data science stack, but only with the Python Data Science Handbook do you get them all—IPython, NumPy, Pandas, Matplotlib, Scikit-Learn, and other related tools.
Doing Data Science: Straight Talk from the Frontline

Now that people are aware that data can make the difference in an election or a business model, data science as an occupation is gaining ground. But how can you get started working in a wide-ranging, interdisciplinary field that’s so clouded in hype? This insightful book, based on Columbia University’s Introduction to Data Science class, tells you what you need to know.