Essentials – Machine Learning

Following are my personal notes that prepared me to take the AWS Certified Machine Learning – Specialty (MLS-C01) exam.

Terminology and Process

  • Artificial Intelligence (AI) is any system that is able to ingest human-level knowledge to automate and accelerate tasks performable by humans through natural intelligence. Al has two categories: narrow, where an Al imitates human intelligence in a single context, and general, where an Al learns and behaves with intelligence across multiple contexts.
  • Machine learning (ML) is the process of training computers, using math and statistical processes, to find and recognize patterns in data. After patterns are found, ML generates and updates training models to make increasingly accurate predictions and inferences about future outcomes based on historical and new data.
  • Deep Learning (DL) enables the machine to define the feature it needs to look for itself based on the data it is being provided. Not told to look for features. Popular for facial recognition.
  • Artificial Intelligence (subset of) -> Machine Learning (subset of) -> Deep Learning
  • What can ML do?
    • Make predictions.
    • Optimize utility functions. Drive efficiencies.
    • Extract hidden data structures. Accelerate decision making.
    • Classify data.
    • Enable automation.
  • ML Use cases:
    • Fraud detection (suspicion reviews, email spam etc.)
    • Content personalization.
    • Target marketing (cross-selling and up-selling)
    • Categorization to extract structure from unstructured data.
    • Customer service to provide predictive routing.
  • ML Categories:
    • Supervised. Labels are known. Predictive solutions:
      • Classification. Output variable is a category, one of two or one of many. Behavioral patterns solutions.
      • Regression. Output is a number or a value. Prioritization solutions (rankings and scores).
        • How many days till user stop using the app?
        • Predict home value.
    • Unsupervised. Labels are not known.
    • Reinforced. Model learns by interacting with an environment. Detect emerging properties. For example in autonomous driving.

Effective IA Strategy

The flywheel of data (positive feedback loop): more data means better analytics -> better analytics results in better products -> better products means more users -> that generates more data.

IA on AWS:

  • Services
    • Amazon Polly: text to life-like speech.
    • Amazon Lex: for building conversational interfaces, speech to text, detect intent, natural language understanding.
    • Amazon Recognition: image analysis to an application, detect inappropriate content.
    • Amazon Kendra: Intelligent search.
    • Amazon Comprehend: document analysis.
    • Amazon Textract: data and text extraction.
    • Amazon Lookout for Metrics and Amazon Forecast: Business metrics analysis .
  • Platforms (Amazon ML, Spark, Kinesis Batch, ECS)
    • Amazon EMR: easily run and scale Apache Spark, Hive, Presto, and other big data workloads.
  • Engines (MXNet, TenserFlow, Caffe, Theano, PyTorch, CNTK).
  • Infrastructure (GPU, CPU, IoT, Mobile).
    • EC2 P2 instances with powerful GPUs.

IA Impact

  • Automate manual effort-intensive efforts.
  • Engage audiences, costumers and employees.
  • Optimize product quality and customer experiences.

ML Process

Building ML applications is an iterative process and includes following general steps:

  • Formulate a problem. First ask: does this business problem require a machine learning solution?Not all problems are machine learning problems! Then:
    • Prepare your data.
    • Train the model.
    • Test the model.
    • Deploy your model.
  • Model is the output of an ML algorithm trained on a data set.
  • Training is the act of creating the model from past data.
  • Testing is measuring the performance of a model on test data.
  • Deployment is the process of integration a model into a production pipeline.
  • Problem definition consist of:
    • Observations
    • Labels
    • Features
      • Numeric
      • Categorical
        • Binary
        • Multiclass

Limitations

Machine learning modeling can be problematic for learning algorithms due to the ingestion of poor quality data. For example, the data may not include enough samples to represent a sufficiently broad scope of relevant variables. Reasons not to use ML:

  • Can be solved with traditional algorithms.
  • Does not require adapting to new data.
  • Requires 100% accuracy.
  • Requires full interpretability.

Accuracy vs Explainability

Accurate models are harder to explain as they are usually more complex. Less accurate models are easier to explain as they are usually more simple.

Explainability arises when model decision can’t be effectively described in human terms.

Uncertainty describe an imperfect outcome because models attempt to fit on a training data, which may have imperfect data. The “best” data may also be unknowable.

Simple Model vs Complex Model

The output of a simple ML model may be explainable and produce faster results, but the results may be inaccurate. The output of a complex ML model may be accurate, but the results may be difficult to communicate.

ML Data Readiness

Not all data is ready for ML solutions. Ask questions!

  • Can my data be used?
    • Is it available?
    • Is it accessible?
  • Should data be used?
    • Does my data respect user privacy?
    • Does my ML project have enough security?
  • Is my data high quality?
    • Relevant
    • Fresh
    • Representation. Does the data contain all available products?
    • Unbiased. Does the data tend to favor one area of a segment when building my machine learning model?

ML Project Impact

Expectations for the amount of time needed to deploy production models can take weeks or even months. Lifecycle consists of:

  1. Problem definition (1 week).
  2. Data exploration (1 week).
  3. Data preparation (2 weeks).
  4. Model exploration (3 week).
    1. Model training.
    2. Model testing.
  5. Evaluation (1 week).
  6. Production deployment (4 weeks).
  7. Model update. Will it change? Continuous monitoring of incoming data can help retrain your model on newer data if the data distribution has deviated significantly from the original training data distribution.

Good questions to ask:

  • Have similar problem in the past?
  • Have data been explored and faults found?
  • Is the performance of the model meeting business requirements?

Think about following questions early:

  • What is the likely computational cost of generating predictions with my model?
  • How quickly does my data change?
  • How significant are the changes needed to deploy?
  • Does the model’s performance meet the business need?

Organization Readiness for ML

  1. Find the right problem. Choosing worthy problems, such as those that aren’t solvable by traditional means, require rich data, or demand large amounts of labor, can lead to early wins and gains in organizational momentum.
  2. Fail forward. Failing forward with ML means using failure as an iterative opportunity to become fault-tolerant and find a successful direction in subsequent attempts.
  3. Scale beyond proofs of concept (POC). ML POCs in development can be used to solve ML scaling challenges across the business before making large production investments.

Data Evaluation Strategy

  1. Acceptable. Data is raw, unlabeled and requires work before can be used with ML.
  2. Good. Data is labeled, lives in separate sources and accessible to some teams/users.
  3. Best. Data is labeled, lives in source of truth and accessible to all teams/users.

Use data lake to combine data from multiple sources: data, databases, clickstream, IoT and sensor data, logs, backups, multimedia, and social media activity.

Culture of Learning and Collaboration

  • What is a data scientist? Data scientists design and build models from data, create and work on algorithms, and train models to predict and achieve business goals.
  • Ask this early: how will you use machine learning? This reinforces the importance of ML in their areas of ownership, and will encourage people to work together to come up with interesting opportunities.
  • Support roles are needed: engineering, science and business.

Start ML Journey

Common mistakes:

  • Viewing AI as a plug-and-play technology with immediate returns.
  • Thinking too narrowly about AI applications.

AI has the biggest impact when it’s developed by cross-functional teams with a mix of skills and perspectives.