# 10 datasets for beginners

As a beginner, learning Machine Learning and Data Science can be a mountain of a task. Thankfully there exist a few datasets which help you in building confidence and honing your skills!

Here are 10 datasets that I think are suited for beginners –

## 1. Beginner’s Classification Dataset

It’s as the name suggests. This dataset is for beginners and deals with a classification problem.

This beginner-friendly binary classification dataset contains a .csv file with pre-cleaned data – ideal for beginners who want to test out new algorithmic approaches to classification problems. The dataset also comes with a notebook that can help you visualize the decision boundary between the two classes.

Check the dataset out for yourself.

## 2. Car Price Prediction

This dataset provides practice for Multiple Linear Regression, data correction, feature encoding, data visualization, and feature selection.

Using multiple feature variables, you are to understand which factors significantly affect a car’s price and use these features to predict a car’s price.

Check the dataset out for yourself.

## 3. Iris Dataset

The Iris flowers dataset is the most popular dataset out there. So popular that sklearn has it built inside of it.

The dataset consists of 150 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. This dataset became a typical test case for many statistical classification techniques in machine learning such as support vector machines.

To import this dataset from sklearn just check this link out.

## 4. Kepler Exoplanet Search Results

The Kepler Space Observatory is a NASA-built satellite that was launched in 2009. The telescope is dedicated to searching for exoplanets in star systems besides our own, with the ultimate goal of possibly finding other habitable planets besides our own. The original mission ended in 2013 due to mechanical failures, but the telescope has nevertheless been functional since 2014 on a “K2” extended mission.

This dataset is a cumulative record of all observed Kepler “objects of interest” — basically, all of the approximately 10,000 exoplanet candidates Kepler has taken observations on. The dataset would be an excellent practice for working with classification models like Random Forest and XGBoost.

Check the dataset out for yourself.

## 5. Heart Failure Prediction Dataset

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, accounting for 31% of all deaths worldwide. Four out of 5CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict possible heart disease.

This dataset is another type of classification problem. Check the dataset out for yourself.

## 6. Fake news and real news

With the advancements in technology, news has become much more accessible than it ever was. This also caused a huge increase in the number of fake news.

Training a model to identify fake news would be a very powerful tool to have. And this dataset allows you to take a step towards this.

## 7. Netflix Movies and TV Shows

Netflix is one of the most popular media and video streaming platforms. They have over 8000 movies or tv shows available on their platform, as of mid-2021, they have over 200M Subscribers globally.

This tabular dataset consists of listings of all the movies and tv shows available on Netflix, along with details such as – cast, directors, ratings, release year, duration, etc.

This dataset can be helpful in practicing Data Visualization. So try using Plotly and Tableau for this! Click here to view the dataset.

## 8. Students’ Performance in Exams

As the name suggests, this data set consists of the marks secured by the students in various subjects. This is another dataset that’s suited for Data Visualization.

## 9. The MNIST database of handwritten digits

The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.

It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal effort on preprocessing and formatting.