Web Scraping and end-to-end data pipelines

Web Scraping, and end-to-end
data pipelines

Régis Amichia & Thomas Pical

Please note to take this course you must first have completed Advanced Machine Learning & Programming in Python

This course is an introduction to web scraping. It covers basic techniques of web scraping, reviews common libraries and frameworks for web scraping in Python, extraction from HTML and XML, and discusses more advanced techniques. This course also covers the basics of data engineering, including data ingestion, cleaning, and transformation, as well as data storage and retrieval.

This module can be taken as part of a PG Certificate, PG Diploma or Full Masters Program.

Download Documentation

MSc

PG Cert

PG Diploma

Key Skills

By the end of this course, participants should have knowledge and ability to:

Data Engineering Basics: Fundamental understanding of data engineering principles.

Data Pipeline Setup: Ability to set up data pipelines for efficient data processing.

Web Scraping Techniques: Skills in collecting data from the web using scraping methods.

Tool Proficiency: Familiarity with tools necessary for data engineering tasks.

Data Processing Skills: Handling and processing data efficiently within pipelines.

Understanding Tools for Data Engineering: Knowledge of tools essential for data engineering tasks.

Data Collection from the Web: Competence in extracting data from web sources.

Practical Application: Applying data engineering skills to real-world scenarios.

Desired Skills

By the end of this course, students should be able to:

Understand the basics of web scraping and its applications.

Extract data from HTML and XML using web scraping techniques.

Use common libraries and frameworks for web scraping in Python, such as BeautifulSoup and others.

Handle advanced web scraping challenges, such as dealing with dynamic websites and avoiding detection.

Communicate effectively about web scraping techniques and their applications.

Understand the basics of data engineering and the role of end-to-end data pipelines in data analytics.

Design and implement data pipelines for data ingestion, cleaning, and transformation.

Use common tools and technologies for building data pipelines ingestion, such as Apache Airflow.

Structure

Web Scraping and end-to-end data pipelines is an elective 10 credit course and therefore students are expected to input approximately 100 hours of study into the course.

The total number of contact hours is 15 hours. This leaves 85 hours for private study.

Lectures

This module consists of 2- hour lectures per day for 5 days, plus a 1 - hour tutorial per day.

There will be optional clinics on the last day of the course.

The dates of each lecture are confirmed closer to the start of each term. If you have any questions about dates, please contact edu@timberlake.co.uk.