diff --git a/.DS_Store b/.DS_Store index b1f811b..21c159c 100644 Binary files a/.DS_Store and b/.DS_Store differ diff --git a/PDF to CSV Challenge_ School Schedule Transformation.html b/PDF to CSV Challenge_ School Schedule Transformation.html new file mode 100644 index 0000000..dda94d5 --- /dev/null +++ b/PDF to CSV Challenge_ School Schedule Transformation.html @@ -0,0 +1,936 @@ + + + + + + PDF to CSV Challenge: School Schedule Transformation + + + + +
+ +
+
+
2 Lessons | 80 minutes
+
1
+ +

Data Transformation Challenge

+

Convert a school schedule from PDF format to structured CSV data

+ +
+ Two 40-minute lessons +
+ +
+

Real-world skill: PDF data extraction is a common task in data analysis, administrative work, and automation projects.

+
+ + + +

+ Download both files before starting the challenge +

+
+
+ + +
+
+
Lesson Plan
+
2
+

Two-Lesson Structure

+ +
+

Lesson 1: Analysis & Extraction

+

Focus: Understanding the data and planning the extraction

+
    +
  • 10 min - Introduction to PDF data extraction
  • +
  • 15 min - Analyze Schedule.pdf structure
  • +
  • 10 min - Choose tools and methods
  • +
  • 5 min - Begin data extraction
  • +
+
+ +
+

Lesson 2: Transformation & Validation

+

Focus: Cleaning data and creating the final CSV

+
    +
  • 10 min - Review and clean extracted data
  • +
  • 15 min - Transform to CSV format
  • +
  • 10 min - Validate against template
  • +
  • 5 min - Discussion and reflection
  • +
+
+ +
+

Pro tip: Take notes during Lesson 1 about the PDF structure. This will save time in Lesson 2.

+
+
+
+ + +
+
+
Challenge Overview
+
3
+

The Challenge

+ +

Your Mission:

+

Transform unstructured schedule data from a PDF into a structured CSV file.

+ +
+
+
1
+
+

Extract

+

Get data out of the PDF file using Python libraries or tools

+
+
+ +
+
2
+
+

Clean

+

Organize the messy, unstructured text into logical groups

+
+
+ +
+
3
+
+

Transform

+

Convert the data to match the CSV template format

+
+
+ +
+
4
+
+

Validate

+

Check that your CSV matches the expected structure

+
+
+
+ + +
+
+ + +
+
+
Input File
+
4
+

Input: Schedule.pdf

+

This PDF contains unstructured school schedule data with:

+ +
    +
  • Days of week in Russian (Пн, Вт, Ср, Чт, Пт)
  • +
  • Time slots (1-13 with specific times)
  • +
  • Class information (subject, class, room)
  • +
  • Teacher name at the bottom
  • +
+ +

PDF Content Preview:

+
+01.09.2025 +aSc Расписание +6A/6B ICT B24 Ict1 +2А/2В/2С Maths B24 E5 +7C/7D ICT B24 Ict1 +... +Пн Вт Ср Чт Пт +1 9:00 - 9:40 +2 10:00 - 10:40 +... +Учитель Bob Santos +
+ +
+

Challenge: The data is unstructured - you'll need to find patterns to extract it correctly.

+
+
+
+ + +
+
+
Output File
+
5
+

Output: Template.csv

+

Your goal is to create a CSV file matching this structure:

+ +
    +
  • First row: Column headers (Day, time slots)
  • +
  • Each row: A day of the week (Monday-Friday)
  • +
  • Cells: Class information or empty if no class
  • +
  • Multi-line cells for detailed class info
  • +
+ +

CSV Structure Preview:

+
+Day,1 (9:00-9:40),2 (10:00-10:40),3 (11:00-11:40)... +Monday,,"Subject: Maths Class: 2А/2В/2С E5 +Room: B24",,,"Subject: ICT Class: 6A/6B Room: B24"... +Tuesday,"Subject: Технотрек Class: 7A/7B/7C/7D/7E Room: B24, B02"... +
+ +
+

Note: Notice how class information is formatted as "Subject: ... Class: ... Room: ..."

+
+
+
+ + +
+
+
Tools
+
6
+

Recommended Tools

+

Choose from these options for the data extraction:

+ +
+
+ +

PyPDF2

+

Basic PDF text extraction

+
+ +
+ +

pdfplumber

+

Advanced table extraction

+
+ +
+ +

tabula-py

+

Extract tables from PDF

+
+
+ +
+
+ +

pandas

+

Data cleaning & CSV export

+
+ +
+ +

Tabula (GUI)

+

Visual table extraction tool

+
+ +
+ +

Manual

+

Copy-paste & clean in spreadsheet

+
+
+ +
+

Suggestion: Start with pdfplumber for Python or Tabula GUI if you're new to PDF extraction.

+
+
+
+ + +
+
+
Tips
+
7
+

Key Considerations

+ +

Important Details to Notice:

+
+
+
1
+
+

Russian to English

+

Convert Пн, Вт, Ср, Чт, Пт to Monday, Tuesday, Wednesday, Thursday, Friday

+
+
+ +
+
2
+
+

Time Slots

+

Match class information to the correct time slots (1-13 with specific times)

+
+
+ +
+
3
+
+

Formatting

+

Follow the exact format: "Subject: ... Class: ... Room: ..." in CSV cells

+
+
+ +
+
4
+
+

Empty Cells

+

Leave cells empty for time slots with no classes

+
+
+
+ +

Common Challenges:

+
    +
  • Handling multi-room assignments (e.g., "B24,B02")
  • +
  • Dealing with split classes (e.g., "6A/6B")
  • +
  • Identifying which classes belong to which time slots
  • +
  • Managing multi-line cells in the CSV
  • +
+
+
+ + +
+
+
Reflection
+
8
+

Discussion Questions

+ +

After completing the challenge, consider these questions:

+ +
+
  • What was the most challenging part of extracting data from the PDF?
  • +
  • How did you handle the Russian day abbreviations?
  • +
  • What pattern recognition strategies worked best?
  • +
  • How would you validate that all data was extracted correctly?
  • +
  • If the PDF format changed next semester, how could you make your solution more flexible?
  • +
  • What real-world applications can you think of for PDF data extraction skills?
  • +
    + +
    +

    Learning outcome: This challenge develops problem-solving, pattern recognition, and data transformation skills applicable to many real-world scenarios.

    +
    + + +
    +
    + + + +
    + + + + \ No newline at end of file diff --git a/Schedule.pdf b/Schedule.pdf new file mode 100644 index 0000000..4305ca7 Binary files /dev/null and b/Schedule.pdf differ diff --git a/repo_chat_bots/.DS_Store b/repo_chat_bots/.DS_Store index a0da7cf..25d1364 100644 Binary files a/repo_chat_bots/.DS_Store and b/repo_chat_bots/.DS_Store differ diff --git a/repo_chat_bots/yandex-chat-app/.DS_Store b/repo_chat_bots/yandex-chat-app/.DS_Store new file mode 100644 index 0000000..00fb8d1 Binary files /dev/null and b/repo_chat_bots/yandex-chat-app/.DS_Store differ diff --git a/training_data/Schedule.png b/training_data/Schedule.png new file mode 100644 index 0000000..6a91b09 Binary files /dev/null and b/training_data/Schedule.png differ