Prepare the Data

Prepare the Data

Since ML models are trained on data, the quality of that data will impact their overall quality and usefulness.  Many open source datasets allow you to quickly jump into exploratory data analysis and feature selection.  However, it is important to note that training data may first have to be collected, cleansed, structured, transformed, enriched and validated.


Practice Goals:

For this practice session, data for training the  model is available but random noise in the form of bad samples has been added.  You will have to inspect the dataset and clean it up.  You're also tasked with gathering new or splitting out existing samples for testing the trained model, and making sure the files and folders are well-organized.  


Hands-On Activities:

  • Download the Weather dataset.
  • Create top-level folder structure:  [train] [val]
  • Within [train] folder create structure: [cloudy] [rain] [shine] [sunrise]
  • Place all training samples in their respective sub-folders.
  • Clean the data by removing bad  training samples.
  • Discover the shape of the training data.
  • Gather new or split existing samples for testing and place them in [val]
  • Conduct a final review of the prepared data.