2 Lessons | 80 minutes
1

Data Transformation Challenge

Convert a school schedule from PDF format to structured CSV data

Two 40-minute lessons

Real-world skill: PDF data extraction is a common task in data analysis, administrative work, and automation projects.

Download both files before starting the challenge

Lesson Plan
2

Two-Lesson Structure

Lesson 1: Analysis & Extraction

Focus: Understanding the data and planning the extraction

  • 10 min - Introduction to PDF data extraction
  • 15 min - Analyze Schedule.pdf structure
  • 10 min - Choose tools and methods
  • 5 min - Begin data extraction

Lesson 2: Transformation & Validation

Focus: Cleaning data and creating the final CSV

  • 10 min - Review and clean extracted data
  • 15 min - Transform to CSV format
  • 10 min - Validate against template
  • 5 min - Discussion and reflection

Pro tip: Take notes during Lesson 1 about the PDF structure. This will save time in Lesson 2.

Challenge Overview
3

The Challenge

Your Mission:

Transform unstructured schedule data from a PDF into a structured CSV file.

1

Extract

Get data out of the PDF file using Python libraries or tools

2

Clean

Organize the messy, unstructured text into logical groups

3

Transform

Convert the data to match the CSV template format

4

Validate

Check that your CSV matches the expected structure

Input File
4

Input: Schedule.pdf

This PDF contains unstructured school schedule data with:

  • Days of week in Russian (Пн, Вт, Ср, Чт, Пт)
  • Time slots (1-13 with specific times)
  • Class information (subject, class, room)
  • Teacher name at the bottom

PDF Content Preview:

01.09.2025 aSc Расписание 6A/6B ICT B24 Ict1 2А/2В/2С Maths B24 E5 7C/7D ICT B24 Ict1 ... Пн Вт Ср Чт Пт 1 9:00 - 9:40 2 10:00 - 10:40 ... Учитель Bob Santos

Challenge: The data is unstructured - you'll need to find patterns to extract it correctly.

Output File
5

Output: Template.csv

Your goal is to create a CSV file matching this structure:

  • First row: Column headers (Day, time slots)
  • Each row: A day of the week (Monday-Friday)
  • Cells: Class information or empty if no class
  • Multi-line cells for detailed class info

CSV Structure Preview:

Day,1 (9:00-9:40),2 (10:00-10:40),3 (11:00-11:40)... Monday,,"Subject: Maths Class: 2А/2В/2С E5 Room: B24",,,"Subject: ICT Class: 6A/6B Room: B24"... Tuesday,"Subject: Технотрек Class: 7A/7B/7C/7D/7E Room: B24, B02"...

Note: Notice how class information is formatted as "Subject: ... Class: ... Room: ..."

Tools
6

Recommended Tools

Choose from these options for the data extraction:

PyPDF2

Basic PDF text extraction

pdfplumber

Advanced table extraction

tabula-py

Extract tables from PDF

pandas

Data cleaning & CSV export

Tabula (GUI)

Visual table extraction tool

Manual

Copy-paste & clean in spreadsheet

Suggestion: Start with pdfplumber for Python or Tabula GUI if you're new to PDF extraction.

Tips
7

Key Considerations

Important Details to Notice:

1

Russian to English

Convert Пн, Вт, Ср, Чт, Пт to Monday, Tuesday, Wednesday, Thursday, Friday

2

Time Slots

Match class information to the correct time slots (1-13 with specific times)

3

Formatting

Follow the exact format: "Subject: ... Class: ... Room: ..." in CSV cells

4

Empty Cells

Leave cells empty for time slots with no classes

Common Challenges:

  • Handling multi-room assignments (e.g., "B24,B02")
  • Dealing with split classes (e.g., "6A/6B")
  • Identifying which classes belong to which time slots
  • Managing multi-line cells in the CSV
Reflection
8

Discussion Questions

After completing the challenge, consider these questions:

  • What was the most challenging part of extracting data from the PDF?
  • How did you handle the Russian day abbreviations?
  • What pattern recognition strategies worked best?
  • How would you validate that all data was extracted correctly?
  • If the PDF format changed next semester, how could you make your solution more flexible?
  • What real-world applications can you think of for PDF data extraction skills?
  • Learning outcome: This challenge develops problem-solving, pattern recognition, and data transformation skills applicable to many real-world scenarios.