Experiment 2 – Data Exploration

Why Data Exploration is Necessary

Machine learning as a data science tool is just that – a tool. A common misconception is that a data set can be thrown at a machine learning algorithm and produce results. This is especially true with real industry data, which, by its very nature, is imperfect and contains lots of noise. Unlike the prototypical datasets used in machine learning tutorials and demonstrations (e.g., Iris dataset or “toy” dataset), industry data requires a substantial amount of exploration, cleaning and preprocessing before it can be used to train a machine learning algorithm for segmentation, classification or prediction purposes. A complete and thorough understanding of the data at hand is necessary to make informed decisions about which techniques to use, and what adjustments need to be made throughout the project. Still, even the best solutions are often subject to change and evolve.

Following a process of due diligence, while time-consuming, will make the rest of the problem more manageable, and the solution more meaningful. The basic outline of a comprehensive step includes:

  1. Understanding the data
  2. Univariable study
  3. Multivariate study
  4. Basic cleaning
  5. Test assumptions

Understanding the Data

Upon further consideration of the initial feature set used in the trial run, it was concluded that many of the features used could be eliminated or aggregated. The first pass through the decision tree classifier, with strong training classification accuracy but poor testing performance, demonstrated overfitting of the model to the training set. While this is a red flag cautioning the need for more samples overall, it also serves as a warning that the predictor features are outweighing the number of target samples. The decision tree is thus “memorizing” the large number of features in the training data and not generalizing to the test set. One immediate solution is to reduce the number of features, even before feature selection is performed. This is where an understanding of the features can help tremendously. It was decided that the concentric circles should not be divided into quadrants. This makes sense from a practical standpoint, as well. Only by extreme coincidence would the success or failure of every retail store depend on the data from one specific quadrant. Rather than allowing the model to use meaningless attributes as predictors, the quadrants could simply be represented by the sum, average and standard deviation of each concentric circle.

Some questions to ask when considering the data:

  • Do I think about this variable when thinking about the problem? (e.g., when I choose which retail location to shop at, am I concerned with the property values between 3-5 miles of the store, but only in the northwest quadrant?)
  • How important do I expect this variable to be in predicting/classifying? (e.g. what would be the expected impact of including or excluding this feature?
  • Is there a logical reason why it should be included? Does it make sense? (e.g. what is the reasoning behind including total number of businesses in the area?)
  • Is this information already described in any other variable? (e.g., if ‘Beer Sales’, ‘Wine Sales’, and ‘Liquor Sales’ combined equals ‘Total Sales’, do we really also need ‘Total Sales’ as a feature?)

Univariable Study

Multivariate Study

corrmat

Bivariable Study