Skip to content

Commit

Permalink
inital commit with workflow
Browse files Browse the repository at this point in the history
  • Loading branch information
woodthom2 committed Aug 1, 2023
1 parent 080cc20 commit 8e22d65
Show file tree
Hide file tree
Showing 34 changed files with 272,408 additions and 0 deletions.
27 changes: 27 additions & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
name: Release Pypi package

on:
release:
types: [created]
workflow_dispatch:

jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: "3.x"
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install setuptools wheel twine
- name: Build and publish
env:
TWINE_USERNAME: ${{ secrets.TWINE_USERNAME }}
TWINE_PASSWORD: ${{ secrets.TWINE_PASSWORD }}
run: |
python setup.py sdist bdist_wheel
twine upload --repository pypi dist/*
32 changes: 32 additions & 0 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
name: Test Pypi package

on:
push:
branches:
- main
paths-ignore:
- README.md
pull_request:
branches:
- main
paths-ignore:
- README.md

jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python: [3.10.11]

steps:
- uses: actions/checkout@v2
- name: Setup Python
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python }}
- name: Install Tox and any other packages
run: pip install tox
- name: Run Tox
# Run tox using the version of Python in `PATH`
run: tox -e py
579 changes: 579 additions & 0 deletions Burrows Delta Walkthrough.ipynb

Large diffs are not rendered by default.

25 changes: 25 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
MIT License

Copyright (c) 2023 Fast Data Science Ltd (https://fastdatascience.com)

Maintainer: Thomas Wood

Tutorial at https://fastdatascience.com/faststylometry-python-library/

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
4 changes: 4 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
include pyproject.toml
include *.md
include LICENSE
recursive-include tests test*.py
186 changes: 186 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,186 @@
# Fast Stylometry Python library

<!-- badges: start -->
![my badge](https://badgen.net/badge/Status/In%20Development/orange)

[![PyPI package](https://img.shields.io/badge/pip%20install-faststylometry-brightgreen)](https://pypi.org/project/faststylometry/) [![version number](https://img.shields.io/pypi/v/faststylometry?color=green&label=version)](https://github.com/fastdatascience/faststylometry/releases) [![License](https://img.shields.io/github/license/fastdatascience/faststylometry)](https://github.com/fastdatascience/faststylometry/blob/main/LICENSE)

<!-- badges: end -->

# Fast Stylometry - Burrows Delta

Developed by Fast Data Science, https://fastdatascience.com

Source code at https://github.com/fastdatascience/faststylometry

Tutorial at https://fastdatascience.com/fast-stylometry-python-library/

This is a lightweight Python library for finding drug names in a string, otherwise known as named entity recognition (NER) and named entity linking.

Python library for calculating the Burrows Delta.

Burrows' Delta is an algorithm for comparing the similarity of the writing styles of documents, known as [forensic stylometry](https://fastdatascience.com/how-you-can-identify-the-author-of-a-document/).

* [A useful explanation of the maths and thinking behind Burrows' Delta and how it works](https://programminghistorian.org/en/lessons/introduction-to-stylometry-with-python#third-stylometric-test-john-burrows-delta-method-advanced)


# Installing Fast Stylometry Python package

You can install from [PyPI](https://pypi.org/project/faststylometry).

```
pip install faststylometry
```



# Usage examples

Demonstration of Burrows' Delta on a small corpus downloaded from Project Gutenberg.

We will test the Burrows' Delta code on two "unknown" texts: Sense and Sensibility by Jane Austen, and Villette by Charlotte Bronte. Both authors are in our training corpus.

You can get the training corpus by cloning https://github.com/woodthom2/faststylometry, the data is in faststylometry/data.

## Create a corpus

To create a corpus and add books, the pattern is as follows:

```
corpus = Corpus()
corpus.add_book("Jane Austen", "Pride and Prejudice", [whole book text])
```

Here is the pattern for creating a corpus and adding books from a directory on your system. You can also use the method ```util.load_corpus_from_folder(folder, pattern)```.

```
import os
import re
from faststylometry.corpus import Corpus
corpus = Corpus()
for root, _, files in os.walk(folder):
for filename in files:
if filename.endswith(".txt") and "_" in filename:
with open(os.path.join(root, filename), "r", encoding="utf-8") as f:
text = f.read()
author, book = re.split("_-_", re.sub(r'\.txt', '', filename))
corpus.add_book(author, book, text)
```

## Example 1

Load a corpus and calculate Burrows' Delta

```
from faststylometry.util import load_corpus_from_folder
from faststylometry.en import tokenise_remove_pronouns_en
from faststylometry.burrows_delta import calculate_burrows_delta
train_corpus = load_corpus_from_folder("faststylometry/data/train")
train_corpus.tokenise(tokenise_remove_pronouns_en)
test_corpus_sense_and_sensibility = load_corpus_from_folder("faststylometry/data/test", pattern="sense")
test_corpus_sense_and_sensibility.tokenise(tokenise_remove_pronouns_en)
calculate_burrows_delta(train_corpus, test_corpus_sense_and_sensibility)
```

returns a Pandas dataframe of Burrows' Delta scores

## Example 2

Using the probability calibration functionality, you can calculate the probability of two books being by the same author.

```
from faststylometry.probability import predict_proba, calibrate
calibrate(train_corpus)
predict_proba(train_corpus, test_corpus_sense_and_sensibility)
```

outputs a Pandas dataframe of probabilities.

# Who to contact

Thomas Wood at [Fast Data Science](https://fastdatascience.com)

## Contributing to the project

If you'd like to contribute to this project, you can contact us at https://fastdatascience.com/ or make a pull request on our [Github repository](https://github.com/fastdatascience/faststylometry). You can also [raise an issue](https://github.com/fastdatascience/faststylometry/issues).

## Developing the library

### Automated tests

Test code is in **tests/** folder using [unittest](https://docs.python.org/3/library/unittest.html).

The testing tool `tox` is used in the automation with GitHub Actions CI/CD.

### Use tox locally

Install tox and run it:

```
pip install tox
tox
```

In our configuration, tox runs a check of source distribution using [check-manifest](https://pypi.org/project/check-manifest/) (which requires your repo to be git-initialized (`git init`) and added (`git add .`) at least), setuptools's check, and unit tests using pytest. You don't need to install check-manifest and pytest though, tox will install them in a separate environment.

The automated tests are run against several Python versions, but on your machine, you might be using only one version of Python, if that is Python 3.9, then run:

```
tox -e py39
```

Thanks to GitHub Actions' automated process, you don't need to generate distribution files locally. But if you insist, click to read the "Generate distribution files" section.

### Continuous integration/deployment to PyPI

This package is based on the template https://pypi.org/project/example-pypi-package/

This package

- uses GitHub Actions for both testing and publishing
- is tested when pushing `master` or `main` branch, and is published when create a release
- includes test files in the source distribution
- uses **setup.cfg** for [version single-sourcing](https://packaging.python.org/guides/single-sourcing-package-version/) (setuptools 46.4.0+)

## Re-releasing the package manually

The code to re-release Harmony on PyPI is as follows:

```
source activate py311
pip install twine
rm -rf dist
python setup.py sdist
twine upload dist/*
```

## Who worked on the Fast Stylometry library?

The tool was developed:

* Thomas Wood ([Fast Data Science](https://fastdatascience.com))

## License of Fast Stylometry library

MIT License. Copyright (c) 2023 [Fast Data Science](https://fastdatascience.com)

## Citing the Fast Stylometry library

Wood, T.A., Fast Stylometry [Computer software], Version 1.0.0, accessed at [https://fastdatascience.com/fast-stylometry-python-library](https://fastdatascience.com/fast-stylometry-python-library), Fast Data Science Ltd (2023)

```
@unpublished{faststylometry,
AUTHOR = {Wood, T.A.},
TITLE = {Fast Stylometry (Computer software), Version 1.0.0},
YEAR = {2023},
Note = {To appear},
}
```
Loading

0 comments on commit 8e22d65

Please sign in to comment.