Project Comments

Project presentations looked good and there are some very interesting results coming out. But there is also room for generic improvements as listed below. Also there were quite a few "surprising" or contradictory results across projects. Remember that "surprising" results should be by definition rare. More often that not they are a sign of problems in the data pipeline somewhere. Machine learning algorithms are complex procedures that may or may not work well "out of the box" on any given dataset. Please look critically at your results to make sure they make sense. If something looks surprising, treat it as a potential problem and try to figure out why its happening. Some generic improvements:

  1. Reproducibility: Please give enough information about your approach that someone else could reproduce your results. List all the methods you applied, and all the choices you made. In particular:
  2. Parameters: ML and NLP algorithms have many parameters that can be adjusted. Sometimes the default values will work on a given dataset, but often they will produce very poor results. In particular be careful about:
    1. Trees and Forests have many settable parameters: Forests have: # trees, # feature trials per node, tree depth bounds, samples per leaf bounds, impurity type, min impurity gain, bootstrapping on/off. Make sure you get a feeling for how these affect performance and try to optimize over them, and report them.
    2. Even Logistic and Linear Regression are sensitive to regularization weights. Regularization when correctly tuned provides small benefits. But it can cause large degradation when mistuned (weights too large).
    3. Kernel methods: kernels functions have tunable parameters in addition to the base learning algorithm they are used with. List them.
    4. Naive Bayes is really a family of models depending on the type of feature and target variables and their distributions. Make clear what you are using.
    5. TF-IDF is a family of scaling transformations that can have different functional forms, especially in how they compute the IDF term. Make clear which one you used.
    6. k-NN is very sensitive to k, the metric used, and any input scaling so list those.
  3. Pre-Processing: Some input transformations can have a large effect on some models, and no effect on others. TF-IDF and PCA and even feature hashing are linear transformations of input features. Regression Algorithms (Logistic and Linear Regression) learn linear functions of their input, and should be minimally affected by linear transformations (unless there is drastic dimension reduction). The main way that linear transformations can influence regression models is through the regularization term, but those effects should be very small if regularization is being applied correctly. So large effects on those models from input transformations is a sign that something is not right.
    Random Forests and trees are blind to feature-wise scaling (e.g. TF-IDF), in fact they are insensitive to any monotonic (non-decreasing) function applied to each input feature.
    On the other hand, TF-IDF and PCA have a big effect on distance-based models like k-Means, k-NN and DBScan. Normally they give small improvements, or small losses. Large effects are a sign of problems somewhere.
    Be careful how you use TF-IDF and make sure that your "document" is a meaningful unit for capturing the "rareness" of features. Also TF-IDF can lead to exaggerated magnitudes for some rare features, and its usually best to use cosine distance, not euclidean distance, to compare documents. Equivalently, you can scale each document to euclidean length 1 before applying normal kNN or kMeans or DBSCAN.
    Remember that trees and forests sample uniformly from their input features, and have a very low probability of "hitting" input features of sparse datasets (text and other power-law data). Try reducing input feature sparseness with feature hashing or PCA before using RFs.
  4. Scaling Up: With one exception, it looks like all projects should be able to be completed on a single EC2 instance. We have additional resources if you think you need to scale up, but the instance you have is quite powerful while the overhead for moving from one node to a cluster are substantial. Don't naively assume that you will gain in either speed or scale by trying a cluster solution. At least run an optimized single-machine test (that's probably BIDMach for most algorthms) before spending substantially more resources.