Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import festival syllabus #44

Closed
blinklet opened this issue Aug 22, 2024 · 7 comments · Fixed by #49
Closed

Import festival syllabus #44

blinklet opened this issue Aug 22, 2024 · 7 comments · Fixed by #49

Comments

@blinklet
Copy link
Owner

Need to import the local festival syllabus, which contains information from the provincial syllabus plus fees:

  • class number
  • description (which is an aggregation of class information)
  • fee

The document to look at are:

  • Music Festival Reference Material/03 Syllabus/Full-syllabus-FINAL-2.pdf
  • Music Festival Reference Material/03 Syllabus/QCMF-fees-syllablist-final.pdf
@blinklet
Copy link
Owner Author

blinklet commented Aug 23, 2024

To implement this, I am considering changing the schema of the classes table. I need the following fields:

  • Discipline (Piano, Strings, etc)
  • Instrument (Piano, Violin, etc)
  • Type (Solo, Duet, Large Group, etc)
  • Category (None, Masterwork Classes, Graded Solo Classes, etc)
  • Category Type (Sonata, "Graded Study Class", "Unaccompanied Bach", etc)
  • Category Details ("One Movement", "Complete", , etc)
  • Age Category (Junior, Senior, None, etc)
  • Age Detail ("10 and under", etc)
  • Level ("Grade 8 to 10", "First Year of Study", "Minimum RCM Grade 3", etc)

I should probably create tables for "Discipline", "Instrument", "Category", and "Age Category" so I can capture the "Information applying to all classes" information for the Instrument and "Additional information" for the Category. In addition, capture the standard age ranges for the Age Categories (which can be overridden by the Age Detail).

For PEI Queens County Music Festival, the Syllabus descriptions consist of the following fields, in order, separated by dashes:

  • Instrument (Piano, Violin, etc)
  • Type (Solo, Duet, Large Group, etc)
  • Category Type (Sonata, "Graded Study Class", "Unaccompanied Bach", etc)
  • Category Details ("One Movement", "Complete", etc)
  • Age Category (Junior, Senior, None, etc)
  • Age Detail ("10 and under", etc)
  • Level ("Grade 8 to 10", "First Year of Study", "Minimum RCM Grade 3", etc)
  • Repertoire code ("List A", "List B", etc)

Where the value of a field is "None", nothing is printed and there is still only one dash before the next field.
So, for example:

  • Piano Solo - Sonata - One movement - Intermediate - Grade 5 to Grade 8
  • Violin Solo - Unaccompanied Bach - Junior
  • Violin Solo - Graded Classes - Grade 3 - List B

In some cases, the syllabus breaks this rule so I will have to "enforce" a change in the syllabus and ask if the client is OK with that.

@blinklet
Copy link
Owner Author

Next, how to ingest the existing Syllabus. Could I use NLP?

@blinklet
Copy link
Owner Author

For now, just use the file, Music Festival Reference Material/03 Syllabus/QCMF-fees-syllablist-final.pdf. It is all one simple table with course number, description (which is usable as a title), and fee. This code will be "throw-away" code that is useful only during the development phase where I am trying to quickly build a test data set.

Procedure will be:

  1. import the syllabus with fees
  2. import the Google form results

In the future, before the first release, create a new schema for the classes and use the setup pages to manually add each class (maybe find way to add multiple classes in the same page).

@blinklet
Copy link
Owner Author

blinklet commented Aug 23, 2024

I installed Excalibur, the web-interface for Camelot.

$ apt install ghostscript python3-tk
$ python3 -m venv .venv2
$ . .venv2/bin/activate
(.venv2) $ pip install excalibur-py
(.venv2) $ excalibur initdb
(.venv2) $ excalibur webserver

When I ran the excalibur initdb command, I saw the Python error: ImportError: cannot import name 'MutableMapping' from 'collections' (/usr/lib/python3.12/collections/__init__.py). It seems that the Excalibur project is using an extremely old version of Flask and that relies on functionality that was deprecated in Python 3.10. I could probably hack around this but I want something that works "as is".

Cancelling the Excalibur effort. Next, will try Camelot command line interface.

@blinklet
Copy link
Owner Author

Installed Camelot:

$ apt install ghostscript python3-tk
$ python3 -m venv .venv2
$ . .venv2/bin/activate
(.venv2) $ pip install tk
(.venv2) $ pip install "camelot-py[base]"

Test the install by running Camelot command:

(.venv2) $ camelot --version

This raises an import error. It seems that the install documentation left out a dependency. I need to install OpenCV:

$ (.venv) pip install opencv-python-headless

Now, the camelot --version command works.

Then, I ran the command:

(.venv2) $ camelot -o test.csv -f csv stream QCMF-fees-syllablist-final.pdf

I got a deprecation error: PdfFileReader is deprecated and was removed in PyPDF2 3.0.0. Use PdfReader instead.

Worked around this with the following commands:

(.venv2) $ pip uninstall 'camelot-py[base]'
(.venv2) $ pip uninstall PyPDF2
(.venv2) $ pip install PyPDF2==2.12.1
(.venv2) $ pip install 'camelot-py[base]'

Now the Camelot command seems to work. Camelot seems to use the filename in the"-o" option as a basis for output filenames and adds page and table information to the filename. Opened the test-page-1-table-1.csv file and saw it only processed the first page. Looks like that's the default. I specified the "-p" option and tried again:

(.venv2) $ camelot -o test.csv -f csv -p 1-end stream QCMF-fees-syllablist-final.pdf

This time, Camelot processed all the pages, but created a separate CSV file for each page. There seems to be no option in the Camelot CLI to output a single large CSV. So, I solved the problem with the following Bash script:

for file in *.csv; do
    sed -i '1,4d' "$file"
done
cat *.csv > combined.csv

Now, I can edit the file in a spreadsheet program.

@blinklet
Copy link
Owner Author

blinklet commented Aug 23, 2024

I decided to not use the Camelot CLI and, instead, incorporate the Camelot library into my program and directly read the PDF file. This lets me include headers, split suffixes from the class numbers, add a header row, etc. I wrote a quick script to test the process and it works reliably. The script is shown below:

import camelot
import csv
import pandas as pd

input_pdf = 'QCMF-fees-syllablist-final.pdf'
output_csv = 'processed.csv'

tables = camelot.read_pdf(input_pdf, pages='all', flavor='stream', strip_text='\n')

combined_df = pd.DataFrame()

for table in tables:
    df = table.df  # turn table into a pandas dataframe
    
    # Drop the first 4 rows of each table
    df = df.drop(df.index[:4])  
    
    df.columns = ["Number","Description", "Fee", "Festival"]
    df["Suffix"] = pd.Series(dtype="string")
    
    df.Suffix = df.Number.str[4:]
    df.Number = df.Number.str[0:4]
    
    df = df[["Number", "Suffix", "Description", "Fee", "Festival"]]
    
    combined_df = pd.concat([combined_df, df], ignore_index=True)

combined_df.to_csv(output_csv, index=False, quoting=csv.QUOTE_ALL)
print(f"Processed file saved as {output_csv}")

I also found a reliable way to install Camelot: use the "cv" option instead of "base". This avoids the opencv dependency issue. I still also need to pin the version of PyPDF:

(.venv) $ pip install "camelot-py[cv]"
(.venv) $ pip install PyPDF2==2.12.1

Camelot works really well but I am concerned it is not well maintained. I will use it, for now, but need to keep an eye out for a better solution. It looks like the only other easy-to-use Python library for reading tables in PDFs is tabula-py...

@blinklet
Copy link
Owner Author

In a Flask web-app, the Camelot library does not support "file-like-objects" in memory. So, I switched to using tabula-py.

See PR for the details.

@blinklet blinklet linked a pull request Aug 26, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant