Skip to content

CO3722 Data Science
CO3722 Lecture 10 - Case Study - Loan Prediction - P2


Lecture Documents

CO3722 Lecture 11.pdf


Next Steps

  • Build a Data Science Model
  • Algorithms include:
    • Linear/Logistic Regression
    • Classification
    • Predictive in Nature could be designed to seek new insights
  • Train, test and validate datasets required
  • Evaluate outputs and build stories

  • Remember the stakeholders
  • Questions that need to be answered?
  • Not a lot of time. Make valuable use of time. In particular a single user group to way up what is feasible and most value within the time-scale.
  • If presented how would it be communicated
  • How would it be explained?

Keep in mind users and the business.


Why Machine Learning?

  • With machine learning the following can be done much faster:
  • Determine whether images contain human faces - image recognition
  • Predict whether an ad is appealing or personal enough for a user to click on it - predictions
  • Create accurate YouTube video captions - speech recognition, speech-to-text translation
  • Whether a transaction is fraudulent
  • Whether an email is spam

Data Science

  • Asking intelligent questions
  • Finding patterns
  • Deriving insight
  • Decision making
  • Machine learning

  • Justify the machine learning model.
  • Potentiallly try two or more different approaches to backup reasoning.

Predictive Model Design Steps

Data Analysis:
- Data review, data cleaning including attending to missing data
- Convert inputs (Cateogircla variables)
Data Science:
- Import required modules - SciKit-Learn
- Classification using chosen model
- Cross validation and accuracy scores
- Metrics


  • reminders to quality control
  • domain knowledge from first assignment
  • any trends?
  • is the data of good quality, conform to rules, prove or disprove a working theory, bring any new insights to light, and if not is the data quality source the issue?
  • Reminders about hypothesis'
  • Revisit feature selection depending on the focus of dataset
  • Large datasets tend to have many features, a data science may look to try multiple combinations to see what the best result gives.

In Short...

  • Domain knowledge is just as important than data analysis skills
  • Asking the right questions is more important than elaborate algorithms
  • It's about the 'right' data

Supervised and Unsupervised Learning

graph LR
    A1[Machine Learning]
    A2[Supervised Learning - Develop predictive model based on both input and output data]
    A3[Unsupervised Learning - Group and interpret data based only on input data]
    A4[Classification]
    A5[Regression]
    A6[Clustering]

    A1-->A2
    A1-->A3

    A2-->A4
    A2-->A5
    A3-->A6

  • Regression Example in Lab Activity

Machine Learning Algorithms

Question

  • Linear Regression
  • Logistic Regression (Classification)
  • Decision Tree (Random Forest)
  • Unsupervised Learning (Clustering)

Linear Regression - Supervised Learning

  • 2 variables

Equation of a line:
$\(y=Mx+C\)$
M = Gradient
C = Intercept

Line of 'best' fit.

Linear Regression Example

import numpy as np
from sklearn.linear_model import 

Assignment 2 - Next Steps

  • Does the data make sense?
  • Is the data consistent?
  • What do you make of the data's distribution? Does it change over time? Is this expected?
  • Use visualisations here to help - Does the data need to be normalised first?
  • Is the data complete? Missing values or anomalies?
  • Do you understand the features? Any data transformations required? (Consider ordinal and nominal data)
  • Balanced or unbalanced? Require a 50/50 split. Consider undersampling and oversampling.
  • Any additional data that might be beneficial.

  • Metric for evaluation - what is meant by a 'good' model?

  • Splitting data - training and testing sets (import train_test_split).
  • What percentage of the dataset would be appropriate for training?

Important

Assignment 2 Brief Released!


[[CO3722 Lecture 12 - ]]