Skip to content

Fine-tuned NER models for banking and regulation texts, trained on eCFR Title 12 using manual and few-shot (GPT 3.5 v3) annotations.

Notifications You must be signed in to change notification settings

ericphann/eCFR-title12_NER

Repository files navigation

⚖️ NER Models - eCFR Title 12 🏦

Fine-tuned NER models for banking and regulation texts, trained on eCFR Title 12 using manual and few-shot (GPT 3.5 v3) annotations.

Please see the executive write-up for metrics and process details.

Team

  • Eric Phann (data, programming, modeling)
  • Kristen Zhang (annotation, reporting, documentation)
  • Yaxin Zhao (annotation, research, procedure)
  • Sydney Kelly (annotation, future considerations)
  • Jake Stallard (annotation, future considerations)

Contents

  • corpuses folder (configs, .spaCy, etc. for each pipeline)
  • data folder (few-shot, manual, and unlabeled data)
  • models folder (best/last model for each type)
  • milestones 2 & 3 folder (prior deliverables)
  • spacy-llm folder (stuff to make few-shot annotations)
  • ecfr_ner_models.ipynb (step-by-step Colab notebook)
  • write-up.pdf (executive summary; conclusions)
  • requirements.txt (for reproducibility)

Dataset

Processing

  • Generate Entity Labels, Definitions, and Few-shot Data
  • Train/Test a Model Using ecfr-few-shot.jsonl
  • Compile Metrics and Review
  • Label 100 examples from ecfr-unlabeled.jsonl
  • Review Labels and Refine Annotation Guidelines
  • Create a Final Test Dataset (ecfr-manual.jsonl)
  • Model Development

Models

  • few-shot-model
  • manual-model
  • mixed-model

Future Work

  • Refine Annotation Guidelines
  • Expand Dataset
  • Fine-tuning with Prodigy and SpaCy
  • Chunking Data
  • Data Privacy and Security

Releases

No releases published

Packages

No packages published