CO3722 Data Science
CO3722 Lecture 2 - Ethics
Lecture DocumentsΒΆ
Written NotesΒΆ

Learning ObjectivesΒΆ
- Discuss Data Analysis Design
- Evaluate Tools for Data Science
Data AnalysisΒΆ
- Terms used interchangeably
- Asking intelligent questions
- Finding patterns
- Deriving insight
- Decision making
- Machine learning
Data Science - The Workflow ModelΒΆ
flowchart LR
A[Asking Questions to Identify Problem]
B[Obtain Data & Process it]
C[Visualise Processed Data]
D[Choose Appriopriate Algorithm for Intended Result]
E[Evaluate the Output]
F[Attempt Different Approaches]
G["Compare Results (Chose of Approach to Alternative Appoaches)"]
A-->B-->C-->D-->E-->F-->G-->AThe data science workflow model
Stage 1 of WorkflowΒΆ
flowchart LR
A[Collect Information from Various Sources]
B[Store the Collected Data]
C["Identify the Targettable Information (From Raw Data)"]
A-->B-->C Stage 2 of WorkflowΒΆ
flowchart LR
A[Classification]
B[Regression]
C[Clustering]
A-->B-->C ClassificationΒΆ
Assigning data to categories or labels pinpointing key data. Once labels are correct around the problem space, for the business case, the problem set can be assessed.
Algorithms: Decision Trees, Random Forests, Support Vector Machines, Neural Networks.
RegressionΒΆ
Focuses on a predictable continuous numerical value within the data. and NOT a category, it finds a relationship between one or more features of the dataset and uses them to reach a numerical target.
ExampleΒΆ
| Size of House (m^2) | Bedrooms | Price (Β£) |
|---|---|---|
| 60 | 2 | 150,000 |
| 80 | 3 | 180,000 |
| 100 | 4 | 220,00 |
For this example the regression fits the line which fits through all these data points, which can be used to predict new values.
ClusteringΒΆ
Data Science LibrariesΒΆ
- NumPy
- Matplotlib
- SciPy
- Scikit Learn
- Statmodels
- Seaborn
- Blaze
- Scrapy
- SymPy
- Bokeh
- PANDAS
Data Science using PANDASΒΆ
PANDAS has numerous benefits for Data Science:
- Data Analysis & Manipulation
- Extensive Means for Data Analysis
- Methods for Data Filtering
- Fast, Flexible & User-friendly
Data StructuresΒΆ
DataFrame & Series form the basic data model.
DataFrameΒΆ
Similar to an Excel Workbook. Columns (indexs) and row numbers.
SeriesΒΆ
One dimensional Indexed array, allowing for easy access to elements.
Importing Dataset & LibrariesΒΆ
import pandas as pd
df = pd.read_csv("dataset.csv") # Reads dataset into DataFrame ready for viewing
Data ExplorationΒΆ
df.describe() # Provides the Mean, Quartile, Count, Min, Max & Standard Deviation as well as outputs.