Prepare the Data
Since ML models are trained on
data, the quality of that data will impact their overall quality and usefulness. Many open source datasets allow you to quickly jump into exploratory data analysis and
feature selection. However, it is important to note that training data may first have to be collected, cleansed, structured, transformed, enriched and
validated.
Practice Goals:
For this practice session, data for training the model is available but random noise in the form of bad samples has been added. You will have to inspect the dataset and clean it up. You're also tasked with gathering new or splitting out existing samples for testing the trained model, and making sure the files and folders are well-organized.
Hands-On Activities:
- Download the Weather dataset.
- Create top-level folder structure: [train] [val]
- Within [train] folder create structure: [cloudy] [rain] [shine] [sunrise]
- Place all training samples in their respective sub-folders.
- Clean the data by removing bad training samples.
- Discover the shape of the training data.
- Gather new or split existing samples for testing and place them in [val]
- Conduct a final review of the prepared data.