Skip to content

CO3722 Data Science
CO3722 Lecture 2 - Ethics


Lecture DocumentsΒΆ

CO3722 Lecture 3.pdf


Written NotesΒΆ

CO3722 Lecture 3 - Note 1.png


Learning ObjectivesΒΆ

  • Discuss Data Analysis Design
  • Evaluate Tools for Data Science

Data AnalysisΒΆ

  • Terms used interchangeably
  • Asking intelligent questions
  • Finding patterns
  • Deriving insight
  • Decision making
  • Machine learning

Data Science - The Workflow ModelΒΆ

flowchart LR
    A[Asking Questions to Identify Problem]
    B[Obtain Data & Process it]
    C[Visualise Processed Data]
    D[Choose Appriopriate Algorithm for Intended Result]
    E[Evaluate the Output]
    F[Attempt Different Approaches]
    G["Compare Results (Chose of Approach to Alternative Appoaches)"]

    A-->B-->C-->D-->E-->F-->G-->A

The data science workflow model

Stage 1 of WorkflowΒΆ

flowchart LR
    A[Collect Information from Various Sources]
    B[Store the Collected Data]
    C["Identify the Targettable Information (From Raw Data)"]

    A-->B-->C

Stage 2 of WorkflowΒΆ

flowchart LR
    A[Classification]
    B[Regression]
    C[Clustering]

    A-->B-->C

ClassificationΒΆ

Assigning data to categories or labels pinpointing key data. Once labels are correct around the problem space, for the business case, the problem set can be assessed.

Algorithms: Decision Trees, Random Forests, Support Vector Machines, Neural Networks.

RegressionΒΆ

Focuses on a predictable continuous numerical value within the data. and NOT a category, it finds a relationship between one or more features of the dataset and uses them to reach a numerical target.

ExampleΒΆ

Size of House (m^2) Bedrooms Price (Β£)
60 2 150,000
80 3 180,000
100 4 220,00

For this example the regression fits the line which fits through all these data points, which can be used to predict new values.

ClusteringΒΆ


Data Science LibrariesΒΆ

  • NumPy
  • Matplotlib
  • SciPy
  • Scikit Learn
  • Statmodels
  • Seaborn
  • Blaze
  • Scrapy
  • SymPy
  • Bokeh
  • PANDAS

Data Science using PANDASΒΆ

PANDAS has numerous benefits for Data Science:
- Data Analysis & Manipulation
- Extensive Means for Data Analysis
- Methods for Data Filtering
- Fast, Flexible & User-friendly

Data StructuresΒΆ

DataFrame & Series form the basic data model.

DataFrameΒΆ

Similar to an Excel Workbook. Columns (indexs) and row numbers.

SeriesΒΆ

One dimensional Indexed array, allowing for easy access to elements.

Importing Dataset & LibrariesΒΆ

import pandas as pd
df = pd.read_csv("dataset.csv") # Reads dataset into DataFrame ready for viewing

Data ExplorationΒΆ

df.head(15) # Displays 15 rows
df.describe() # Provides the Mean, Quartile, Count, Min, Max & Standard Deviation as well as outputs.
df['Column_Name'].value_counts() # Accessing a particular column in the data frame

CO3722 Lecture 4 - Data Cleaning