Exploratory Analysis with Decision Trees

e1pcavideo

As a trial exploration of the collected data, a pilot experiment was performed by fitting raw data from the retail store analysis to a decision tree model, assessing classification accuracy and feature importance, and visualizing target segmentation.

(UPDATE): As an addendum to the pilot analysis, further studies were conducted to assess and compare the performance of decision tree and random forest models after manipulation of the original dataset. Additional studies include:

  • Removing wholesale location samples, only using retail-only samples
  • Reducing number of features
  • Increasing number of samples
  • Reducing number of classes
  • Adding census data to feature set

Part 2 can be found here.

Data Collection

The predictors used included:

Property Value Data Bar/Restaurant Data Liquor Store Data Sales Tax Data Geoanalysis Data
Sum of Property Values Sum, Average, and Standard Deviation of: Number of Liquor Stores Number of Businesses Water
Average Property Value        Total Alcohol Sales Freeway
       Wine Sales Road
       Beer Sales Major Road
       Liquor Sales Parks
       Cover Charge Urban
Downtown
Civic Building

Concentric circles were arranged around each target retailer at 5, 10, and 15 mi radii and subdivided into four quadrants (NE, SE, SW, NW), for which each predictor would be calculated and considered. For example, average property value would be calculated for 12 distinct subsections (3 different distances, 4 different quadrants). The sum and average for each concentric circle were also included (e.g., sum of property values at a distance of 5 miles, 10 miles, 15 miles). This brought the feature set to a total of 135 predictors.

Data Preparation

The initial feature set included 62 target stores from Collin County, Dallas County, Denton County and Harris County. The targets included stores that were retail only, as well as stores that were both wholesale/retail. Classes were determined by total delinquency amount for one delinquency period which covered 12/15/2017 – 12/31/2017. Class distinction by delinquency amounts were as follows: Class 1: >$300k, Class 2: $200k-$300k, Class 3: $100k-$200k, Class 4: <$100k. This division of classes, while seemingly arbitrary, was chosen for ease of interpretation. However, this method did not create optimally equal class sizes, which can increase classification error.

The 62 x 135 feature set was imported from SQL Server into a Jupyter notebook using pyodbc and converted into a Pandas data frame structure. All null values were investigated in the raw data and confirmed to be a true zero value rather than an error or missing data. No further pre-processing was performed on the data set for the initial trial run.

Data Visualization

Scatterplots are commonly used to explore how the feature set segments into separate classes. In order to visualize segmentation of high-dimensional data in a 2D or 3D subspace, some sort of feature selection is required.

Because we do not know the separability of the features without initial exploration of the features involved, creating 3D scatterplots of the raw dataset would provide minimal results at best. It is improbable that any segmentation would occur, as shown below.

Three different principal component analysis techniques were used to reduce data dimensionality:

  1. PCA – standard PCA is simply a linear dimensionality reduction using singular value decomposition of the dataset to project it to a lower dimensional space. For PCA to have an effect on data, the features must be linearly separable. PCA computes actual principal components out of the feature set which can then be visualized. Here, 3 components were found and mapped into a 3D subspace.
  2. kPCA – kernel PCA relies on linearly independent feature vectors, so there is no covariance on which to perform eigendecomposition explicitly, as in linear PCA. kPCA, unlike PCA, does not compute the actual principal components themselves, but rather projections of the original dataset onto lower-dimensional components.
e1pcavideo
PCA
e1kpcavideo
kPCA – sigmoidal
e1kpcarbfvideo
kPCA – RBF

A common pitfall in data analytics is the assumption that statistical tools or machine learning techniques, like principal component analyses, will make up for a lack of samples, an overabundance of features, or unclean data. As it has been demonstrated through this trial exploratory run of a decision tree and two types of feature selection, more data is needed, and the data used must be thoroughly examined and cleaned.

After the data set has been properly prepared to be manipulated and analyzed, then more accurate decisions can be made with respect to model selection, feature selection methods, etc. The methodology can continue to evolve as adjustments are made to the approach. Thus is the nature of problem solving with machine learning.

Models

Decision Tree Model

Prior to fitting the model, the data was split into train and test subsets using a random 80/20 split, resulting in 49 training examples and 13 test examples.

A single decision tree classifier was fit using the training data using the DecisionTreeClassifier  from scikit-learn’s machine learning library. Rather than using Gini impurity (default metric) as the split criterion, entropy was chosen for its inherent aptness in exploratory analysis. The model’s hyperparameters were tuned for maximum accuracy, as discussed in the section titled ‘Model Tuning.’

Node Impurity Measures

A visualization of the decision trees for each method of splitting (entropy vs Gini impurity) further demonstrates the necessity of data pre-processing, model selection and increased sample size.

Gini impurity is the probability of a random sample being classified incorrectly if we pick a label according to the distribution in a branch. The Gini index for a node (q) is defined as:gini1

where k is the number of classes (in this case, k=4). It reaches a max value when the samples are equally distributed among all classes, and has a zero value when all samples belong to one class. Ideally the Gini index would reach 0.0 at every terminal leaf in the tree.

The gini index was calculated for each node in the first pass decision tree (see below). The gini impurity reached 0.0 on only two terminal leaves, suggesting the node splits were very impure, and that the rules created by the decision tree were unstable, not easily generalized, and most likely overfit to the the training data.

treeE1gini
Decision tree using Gini impurity

Entropy, a metric for information gain, measures the impurity of a given node through a log calculation:

entropy

where is the number of classes. If all the samples belong to one class, then entropy is zero. The maximum entropy in the case of k = 4 classes would be 2 (log2(4) = 2). We can see here that the maximum entropy value is 1.916 and largely remains above 1.0, meaning the classes were not properly separated and the rule-based logic of the decision tree was unstable even for the training set.

treeE1ent
Decision tree using entropy information gain

Feature Importance

The decision tree model found 5-6 features to be the most discriminative, as shown below, the most important being the average property value at distance = 10 mi, quadrant 1. Both methods also used aD5Q1 and a form of total mixed beverage sales at distance 10, quadrant 4 as delineating features. While there is some stability in the features used, they seem fairly arbitrary. Why is the average property value the most important, but only in quadrant 1 at distance 10? Further investigation and due diligence would be necessary to deem this accurate. Otherwise, these results suggest that the decision tree classifier is simply using whatever it can to create rules, whether they are logical or not.

e1 variable importance
Feature importance using entropy criterion
e1 variable importance gini
  Feature importance using gini impurity criterion

Model Tuning

The model was tuned for maximum accuracy by adjusting the maximum depth of the tree, the minimum number of samples per leaf, and the minimum number of samples per split. Model tuning should have the following effects:

  • Maximum Depth – indicates number of delineating features. Increasing the maximum depth should increase the accuracy of the training set but can result in overfitting and poor generalization to test data
  • Minimum samples per leaf – limit to split a node when the number of samples in a child node is lower than the minimum samples per leaf. Decreasing minimum samples causes more precise decisions, but prone to overfitting and poor generalization
  • Minimum samples per split – limit to stop further splitting of nodes when the number of observations in the node is lower than the minimum split size. Decreasing the minimum (with few samples and many predictors) is prone to overfitting

Accuracy

Prediction accuracy was expectedly low for a first pass of raw data and for a small sample size. Prediction accuracy peaked at 93.9% for the training data, and 46% for the testing data.

dt_acc

Bias vs. Variance

When tuning a machine learning algorithm, often times the “bias and variance decomposition” is mentioned. Specific algorithms usually express a tradeoff between bias and variance, and controlling this tradeoff is key to optimal tuning.

Bias refers to the error that is introduced when approximating a real-life problem by a much simpler model. It relates to the ability of your model to approximate the data, and so a high bias is related to underfitting. No matter how many samples are introduced, it is impossible to produce an accurate prediction.

Variance refers to the amount by which the estimate would change if it was estimated using a different training set. It relates to the stability of the model in response to new training examples. High variance is related to over-fitting, and so increasing the number of training samples is a common solution.

dt learning curve

The figure above shows a learning curve of the decision tree model used. Training and validation accuracy are plotted as the number of training examples varies. The model expresses high variance and an extremely high degree of overfitting, as demonstrated by the >50% accuracy difference between the training and validation sets. A logical solution would be to simply collect more data and increase the number of samples. Fortunately, the model shows a lower bias, meaning that the model can approximate the data. After more data is collected and explored, selection of a better model for the data may provide more accurate results.

Next Steps

  • Sample Size needs to be increased to properly train the decision tree. With 62 instances and 135 features, it is improbable that any amount of data cleaning would allow for the successful segmentation or classification of the data set.
  • Thorough data exploration, including univariate and bivariate analysis, missing value treatment, outlier detection, and proper identification of data correlations. Before any type of principal components can be run, it needs to be determined whether or not there is any linear separability of features.
  • Until the data set reaches 500+ samples, decision tree-based models should be abandoned for machine learning algorithms that provide more robust models with a lower number of observations