10 datasets for beginners

As a beginner, learning Machine Learning and Data Science can be a mountain of a task. Thankfully there exist a few datasets which help you in building confidence and honing your skills!

Here are 10 datasets that I think are suited for beginners –

1. Beginner’s Classification Dataset

It’s as the name suggests. This dataset is for beginners and deals with a classification problem.

This beginner-friendly binary classification dataset contains a .csv file with pre-cleaned data – ideal for beginners who want to test out new algorithmic approaches to classification problems. The dataset also comes with a notebook that can help you visualize the decision boundary between the two classes.

Check the dataset out for yourself.

2. Car Price Prediction

This dataset provides practice for Multiple Linear Regression, data correction, feature encoding, data visualization, and feature selection.

Using multiple feature variables, you are to understand which factors significantly affect a car’s price and use these features to predict a car’s price.

Check the dataset out for yourself.

3. Iris Dataset

The Iris flowers dataset is the most popular dataset out there. So popular that sklearn has it built inside of it.

The dataset consists of 150 samples from each of three species of Iris (Iris Setosa, Iris virginica, and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. This dataset became a typical test case for many statistical classification techniques in machine learning such as support vector machines.

To import this dataset from sklearn just check this link out.

4. Kepler Exoplanet Search Results

The Kepler Space Observatory is a NASA-built satellite that was launched in 2009. The telescope is dedicated to searching for exoplanets in star systems besides our own, with the ultimate goal of possibly finding other habitable planets besides our own. The original mission ended in 2013 due to mechanical failures, but the telescope has nevertheless been functional since 2014 on a “K2” extended mission.

This dataset is a cumulative record of all observed Kepler “objects of interest” — basically, all of the approximately 10,000 exoplanet candidates Kepler has taken observations on. The dataset would be an excellent practice for working with classification models like Random Forest and XGBoost.

Check the dataset out for yourself.

5. Heart Failure Prediction Dataset

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, accounting for 31% of all deaths worldwide. Four out of 5CVD deaths are due to heart attacks and strokes, and one-third of these deaths occur prematurely in people under 70 years of age. Heart failure is a common event caused by CVDs and this dataset contains 11 features that can be used to predict possible heart disease.

This dataset is another type of classification problem. Check the dataset out for yourself.

6. Fake news and real news

With the advancements in technology, news has become much more accessible than it ever was. This also caused a huge increase in the number of fake news.

Training a model to identify fake news would be a very powerful tool to have. And this dataset allows you to take a step towards this.

Click here to view the dataset.

7. Netflix Movies and TV Shows

Netflix is one of the most popular media and video streaming platforms. They have over 8000 movies or tv shows available on their platform, as of mid-2021, they have over 200M Subscribers globally.

This tabular dataset consists of listings of all the movies and tv shows available on Netflix, along with details such as – cast, directors, ratings, release year, duration, etc.

This dataset can be helpful in practicing Data Visualization. So try using Plotly and Tableau for this! Click here to view the dataset.

8. Students’ Performance in Exams

As the name suggests, this data set consists of the marks secured by the students in various subjects. This is another dataset that’s suited for Data Visualization.

Click here to check it out.

9. The MNIST database of handwritten digits

The MNIST database of handwritten digits, available from this page, has a training set of 60,000 examples, and a test set of 10,000 examples. It is a subset of a larger set available from NIST. The digits have been size-normalized and centered in a fixed-size image.

It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal effort on preprocessing and formatting.

Click here to check the dataset out.

10. CelebFaces Attributes

A popular component of computer vision and deep learning revolves around identifying faces for various applications from logging into your phone with your face or searching through surveillance images for a particular suspect.

This dataset is great for training and testing models for face detection, particularly for recognizing facial attributes such as finding people with brown hair, smiling, or wearing glasses. Images cover large pose variations, background clutter, and diverse people, supported by a large number of images and rich annotations. The dataset has 202,599 face images of various celebrities and 10,177 unique identities, but the names of identities are not given.

Click here to view the dataset.

Similar Posts

Leave a Reply

Your email address will not be published.