Top 9 Data Science Projects for a Beginner in 2020 (Shared Article from The Medium)

Written by Rashi Desai  for the original article click here. 

The ultimate project list for new skills and strengthen your portfolio.

With countries gradually opening up in baby steps and with a few more weeks to be in the “quarantine”, take this time in isolation to learn new skills, read books, and improve yourself.

While the intellectuals keep saying “it’s not a race to be productive”, for those interested in data analytics, data science or anything related to data, I thought let’s make a list of top 9 data science projects to do during your spare time, in no particular order!

1. Credit Card Fraud Detection

The number of credit card owners is projected close to 1.2 billion by 2022. To ensure security of credit card transactions, it is essential to monitor fradualent activities. Credit card companies shall be able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.

A credit card dataset contains a mix of fraud as well as non-fraudulent transactions and the target is to predict if a given test transaction is fraudulent or not.

Algorithms to be used:

Since the target variable is categorical, the problem can be solved with a line of machine learning algorithms as —

  1. Logistic regression
  2. Decision trees
  3. Neural network or so.

2. Customer Segmentation

Customer Segmentation is the process of splitting a customer base into multiple groups of individuals that share a similarity in ways a product is or can be marketed to them such as gender, age, interests, demographics, economic status, geography, behavioral patterns, spending habits and much more..

Customer Segmentation is one the most important applications of unsupervised learning. Using clustering techniques, companies can identify the several segments of customers allowing them to target the potential user base.

Companies use the clustering process to foresee or map customer segments with similar behavior to identify and target potential user base.

Algorithms to be used:

K-means clustering, heirarchical clustering are the top clustering methods. Some of the other clustering algorithms are:

  1. Partitioning method
  2. Fuzzy clustering
  3. Density-based clustering
  4. Model-based clustering

Furthermore, once the data is collected, companies can gain a deeper understanding of customer preferences and requirements for discovering valuable segments that would reap them maximum profit. This way, they can strategize their marketing techniques more efficiently and minimize the possibility of risk to their investment.

3. Sentiment Analysis

Sentiment, defined as a view of or attitude toward a situation or event; an opinion is a vital topic in the field of Data Science. It has become by far one of the hottest topics in the field given its utmost relevance in today’s age of social media and the number of business problems it can solve.

With the help of sentiment analysis, you can find out the nature of opinion reflected in documents, websites, social media timelines, etc. Humans are ought to have a range of sentiments from happy, sad, angry, positive or negative, depressed, hatred, love , and more..

In today’s time, any data-driven organization would have to imbibe outcomes from sentiment analysis model to determine the attitude of its consumers and target customers towards the products or services.

Twitter sentiment analysis is a model that HAS to be run all time. Some of the intelligence agencies perform sentiment analysis to

Algorithms to be used:

  1. Naive Bayes
  2. Decision trees
  3. Package Tidytext

4. Speech Emotion Recognition

Of the activities humans can do, a lot is governed by speech and the emotions attached to a scene, a product or experience.

SER, an acronym for Speech Emotion Recognition ca be a compelling Data Science project to do this summer. It attempts to perceive human emotions from the speech (voice samples). Moreover, for sighting human emotion, different sound files are used as the dataset. SER essentially focuses on feature extraction to extract emotion from audio recordings.

While working on the project in Python, you would also shelf up knowledge on the package Librosa, used for analyzing music and audio.

Vox Celebrity Dataset can be a good starting point to perform Speech Emotion Recognition.

Algorithms to be used:

  1. Convolutional Neural Network (CNN)
  2. Recurrent neural networks (RNN)
  3. Neural Network (NN)
  4. Gaussian mixture model (GMM)
  5. Support Vector Machine (SVM)

5. Predictive Analytics

The purpose of predictive analytics is to make predictions about unknown events of the future.

It encompasses a variety of statistical techniques from predictive modeling, machine learning, and data mining, analyze current and historical facts to identify risks and opportunities.


  1. Loan Prediction Data: Predict if a loan will get approved or not
  2. Forecasting HVAC needs: Combine weather forecast with building system
  3. Customer Relationship Management
  4. Clinical decision support systems
  5. Customer and Employee Retention: churn rates
  6. Project Risk Management

6. Time series Analysis and Modeling

Time series is a series of data points indexed, listed or graphed in time order.

Time Series is one of the most commonly used techniques in data science with a wide range of applications from weather forecasting, predicting sales, analyzing year trends, predicting tractions, website traffic, competition position, etc.

Business houses, time and again wor kon time series data to analyze numbers of the future.

From time series analysis, we can look into ads watched per hour, in-game currency spend per day, change in product trends, etc.

7. Regression Analysis

The purpose of regression analysis is to predict an outcome based on a historical data.

Regression analysis is a robust statistical test that allows examination of the relationship between two or more variables of interest. While there are many types of regression analysis, at the core, all examine the influence of one or more independent variables on a target (dependent) variable.


  1. Walmart sales data: Predict the sales of a store
  2. Boston housing data: Predict the median value of owner-occupied homes
  3. Wine Quality prediction: Predict the quality of the wine
  4. Black Friday Sales prediction : Predict purchase amount for a household

Algorithms to be used:

Depends on the nature of target variable: numeric or categorical

  1. CART — Factor target
  2. Decision Trees — Factor target
  3. Linear Regression — Numeric target
  4. Logistic Regression — Factor target

8. Recommender Systems

A recommendation system is a platform that uses filtering process and provides its users with various contents based on their preferences and likings.

A recommendation system takes the information about the user as an input and returns recommendations from evaluation of parameters using a Machine Learning model. Recommendation systems are all around you from Amazon to Zappos; a quintessential machine learning algorithm to know for data scientists.

For example, Netflix provides you with the recommendations of movies or shows that are similar to your browsing history or the ones that have been watched in the past by other users having similar browsing as yours.

There are two types of recommendation systems —

  1. Content-Based Recommendation System: Provides recommendations with repsect to the data that a user provides. Based on that data, a user profile is generated, which is then used to make suggestions to the user. As the user provides more inputs or takes actions on the recommendations, the engine becomes more and more accurate.
  2. Collaborative Filtering Recommendation: Provides recommendations in respect with the other users who might have a similar viewing history or preferences.

9. Exploratory Data Analysis

Exploratory Data Analysis (EDA) is actually the first step in a data analysis process. Here, you make sense of the data you have, figure out what questions you want to ask, how to frame them, best manipulate it to get the answers needed.

EDA exposes a broad look of patterns, trends, outliers, unexpected results and so on in existing data using visual and quantitative methods. There are tons of projects that can be done with Exploratory Data Analysis. Here I’ve listed for reference or as a good starting point.


  1. Global Suicide Rates (dataset)
  2. Summer Olympic Models (dataset)
  3. World Happiness Report (dataset)
  4. Nutrition Facts for McDonald’s Menu (dataset)
By CJ Sanchez (He/Him)
CJ Sanchez (He/Him) Career Coach CJ Sanchez (He/Him)