Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update common pre-commit configs #12516

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -31,4 +31,4 @@ paddleocr.egg-info/
/deploy/android_demo/app/.cxx/
/deploy/android_demo/app/cache/
test_tipc/web/models/
test_tipc/web/node_modules/
test_tipc/web/node_modules/
21 changes: 8 additions & 13 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -1,26 +1,22 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: a11d9314b22d8f8c7556443875b731ef05965464
rev: v4.6.0
hooks:
- id: check-added-large-files
args: ['--maxkb=512']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more concerned that 512kb won't be enough for our use, as some of the previous files had more than 512kb in them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will only be applied to local staged files as documented here: https://github.com/pre-commit/pre-commit-hooks?tab=readme-ov-file#check-added-large-files. which means, current .github/workflows/codestyle.yml will not capture large files being added.

Another option : increase the limit to 1024kb, and check all the files on each pull_request by using --enforce-all, at the same time, exclude existing large files. there are 28 files larger than 1M in the repo.

- id: check-case-conflict
- id: check-merge-conflict
- id: check-symlinks
- id: detect-private-key
files: (?!.*paddle)^.*$
- id: end-of-file-fixer
files: \.md$
- id: trailing-whitespace
files: \.md$
files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|py|md)$
- repo: https://github.com/Lucas-C/pre-commit-hooks
rev: v1.0.1
rev: v1.5.1
hooks:
- id: forbid-crlf
files: \.md$
- id: remove-crlf
files: \.md$
- id: forbid-tabs
files: \.md$
- id: remove-tabs
files: \.md$
files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|py|md)$
- repo: local
hooks:
- id: clang-format
Expand All @@ -31,7 +27,7 @@ repos:
files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|cuh|proto)$
# For Python files
- repo: https://github.com/psf/black.git
Copy link
Collaborator

@SWHL SWHL May 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just curious why the Flake8 tool is used? Why not just use Black tools?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SWHL These folders are excluded because they contain numerous bugs that are too time-consuming to fix and will be addressed at a later date.

rev: 23.3.0
rev: 24.4.2
hooks:
- id: black
files: (.*\.(py|pyi|bzl)|BUILD|.*\.BUILD|WORKSPACE)$
Expand All @@ -47,4 +43,3 @@ repos:
- --show-source
- --statistics
exclude: ^benchmark/|^test_tipc/

2 changes: 1 addition & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -7,4 +7,4 @@ recursive-include ppocr/postprocess *.py
recursive-include tools/infer *.py
recursive-include tools __init__.py
recursive-include ppocr/utils/e2e_utils *.py
recursive-include ppstructure *.py
recursive-include ppstructure *.py
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -207,12 +207,12 @@ PaddleOCR is being oversight by a [PMC](https://github.com/PaddlePaddle/PaddleOC
<details open>
<summary>PP-Structure 文档分析</summary>

- 版面分析+表格识别
- 版面分析+表格识别
<div align="center">
<img src="./ppstructure/docs/table/ppstructure.GIF" width="800">
</div>

- SER(语义实体识别)
- SER(语义实体识别)
<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185310636-6ce02f7c-790d-479f-b163-ea97a5a04808.jpg" width="600">
</div>
Expand Down
10 changes: 5 additions & 5 deletions README_en.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,11 +119,11 @@ PaddleOCR support a variety of cutting-edge algorithms related to OCR, and devel
- [Mobile](./deploy/lite/readme.md)
- [Paddle2ONNX](./deploy/paddle2onnx/readme.md)
- [PaddleCloud](./deploy/paddlecloud/README.md)
- [Benchmark](./doc/doc_en/benchmark_en.md)
- [Benchmark](./doc/doc_en/benchmark_en.md)
- [PP-Structure 🔥](./ppstructure/README.md)
- [Quick Start](./ppstructure/docs/quickstart_en.md)
- [Model Zoo](./ppstructure/docs/models_list_en.md)
- [Model training](./doc/doc_en/training_en.md)
- [Model training](./doc/doc_en/training_en.md)
- [Layout Analysis](./ppstructure/layout/README.md)
- [Table Recognition](./ppstructure/table/README.md)
- [Key Information Extraction](./ppstructure/kie/README.md)
Expand All @@ -136,7 +136,7 @@ PaddleOCR support a variety of cutting-edge algorithms related to OCR, and devel
- [Text recognition](./doc/doc_en/algorithm_overview_en.md)
- [End-to-end OCR](./doc/doc_en/algorithm_overview_en.md)
- [Table Recognition](./doc/doc_en/algorithm_overview_en.md)
- [Key Information Extraction](./doc/doc_en/algorithm_overview_en.md)
- [Key Information Extraction](./doc/doc_en/algorithm_overview_en.md)
- [Add New Algorithms to PaddleOCR](./doc/doc_en/add_new_algorithm_en.md)
- Data Annotation and Synthesis
- [Semi-automatic Annotation Tool: PPOCRLabel](https://github.com/PFCCLab/PPOCRLabel/blob/main/README.md)
Expand Down Expand Up @@ -188,7 +188,7 @@ PaddleOCR support a variety of cutting-edge algorithms related to OCR, and devel
<details open>
<summary>PP-StructureV2</summary>

- layout analysis + table recognition
- layout analysis + table recognition
<div align="center">
<img src="./ppstructure/docs/table/ppstructure.GIF" width="800">
</div>
Expand All @@ -209,7 +209,7 @@ PaddleOCR support a variety of cutting-edge algorithms related to OCR, and devel
- RE (Relation Extraction)
<div align="center">
<img src="https://user-images.githubusercontent.com/25809855/186094813-3a8e16cc-42e5-4982-b9f4-0134dfb5688d.png" width="600">
</div>
</div>

<div align="center">
<img src="https://user-images.githubusercontent.com/14270174/185393805-c67ff571-cf7e-4217-a4b0-8b396c4f22bb.jpg" width="600">
Expand Down
2 changes: 1 addition & 1 deletion applications/PCB字符识别/PCB字符识别.md
Original file line number Diff line number Diff line change
Expand Up @@ -546,7 +546,7 @@ python3 tools/infer/predict_system.py \
--use_gpu=True
```

得到保存结果,文本检测识别可视化图保存在`det_rec_infer/`目录下,预测结果保存在`det_rec_infer/system_results.txt`中,格式如下:`0018.jpg [{"transcription": "E295", "points": [[88, 33], [137, 33], [137, 40], [88, 40]]}]`
得到保存结果,文本检测识别可视化图保存在`det_rec_infer/`目录下,预测结果保存在`det_rec_infer/system_results.txt`中,格式如下:`0018.jpg [{"transcription": "E295", "points": [[88, 33], [137, 33], [137, 40], [88, 40]]}]`

2)然后将步骤一保存的数据转换为端对端评测需要的数据格式: 修改 `tools/end2end/convert_ppocr_label.py`中的代码,convert_label函数中设置输入标签路径,Mode,保存标签路径等,对预测数据的GTlabel和预测结果的label格式进行转换。
```
Expand Down
2 changes: 1 addition & 1 deletion applications/PCB字符识别/gen_data/corpus/text.txt
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,4 @@ K06
KIEY
NZQJ
UN1B
6X4
6X4
4 changes: 2 additions & 2 deletions applications/中文表格识别.md
Original file line number Diff line number Diff line change
Expand Up @@ -456,7 +456,7 @@ display(HTML('<html><body><table><tr><td colspan="5">alleadersh</td><td rowspan=

预测结果如下:
```
val_9.jpg: {'attributes': ['Scanned', 'Little', 'Black-and-White', 'Clear', 'Without-Obstacles', 'Horizontal'], 'output': [1, 1, 1, 1, 1, 1]}
val_9.jpg: {'attributes': ['Scanned', 'Little', 'Black-and-White', 'Clear', 'Without-Obstacles', 'Horizontal'], 'output': [1, 1, 1, 1, 1, 1]}
```


Expand All @@ -466,7 +466,7 @@ val_9.jpg: {'attributes': ['Scanned', 'Little', 'Black-and-White', 'Clear', 'Wi

预测结果如下:
```
val_3253.jpg: {'attributes': ['Photo', 'Little', 'Black-and-White', 'Blurry', 'Without-Obstacles', 'Tilted'], 'output': [0, 1, 1, 0, 1, 0]}
val_3253.jpg: {'attributes': ['Photo', 'Little', 'Black-and-White', 'Blurry', 'Without-Obstacles', 'Tilted'], 'output': [0, 1, 1, 0, 1, 0]}
```

对比两张图片可以发现,第一张图片比较清晰,表格属性的结果也偏向于比较容易识别,我们可以更相信表格识别的结果,第二张图片比较模糊,且存在倾斜现象,表格识别可能存在错误,需要我们人工进一步校验。通过表格的属性识别能力,可以进一步将“人工”和“智能”很好的结合起来,为表格识别能力的落地的精度提供保障。
Original file line number Diff line number Diff line change
Expand Up @@ -434,16 +434,16 @@ python3 -m paddle.distributed.launch --gpus '0' tools/eval.py -c configs/rec/PP-

```
output/rec/
├── best_accuracy.pdopt
├── best_accuracy.pdparams
├── best_accuracy.states
├── config.yml
├── iter_epoch_3.pdopt
├── iter_epoch_3.pdparams
├── iter_epoch_3.states
├── latest.pdopt
├── latest.pdparams
├── latest.states
├── best_accuracy.pdopt
├── best_accuracy.pdparams
├── best_accuracy.states
├── config.yml
├── iter_epoch_3.pdopt
├── iter_epoch_3.pdparams
├── iter_epoch_3.states
├── latest.pdopt
├── latest.pdparams
├── latest.states
└── train.log
```

Expand Down
2 changes: 1 addition & 1 deletion applications/包装生产日期识别.md
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,7 @@ def get_cropus(f):
elif 0.7 < rad < 0.8:
f.write('20{:02d}-{:02d}-{:02d}'.format(year, month, day))
elif 0.8 < rad < 0.9:
f.write('20{:02d}.{:02d}.{:02d}'.format(year, month, day))
f.write('20{:02d}.{:02d}.{:02d}'.format(year, month, day))
else:
f.write('{:02d}:{:02d}:{:02d} {:02d}'.format(hours, minute, second, file_id2))

Expand Down
6 changes: 3 additions & 3 deletions applications/印章弯曲文字识别.md
Original file line number Diff line number Diff line change
Expand Up @@ -409,7 +409,7 @@ def crop_seal_from_img(label_file, data_dir, save_dir, save_gt_path):
if __name__ == "__main__":
if __name__ == "__main__":
# 数据处理
gen_extract_label("./seal_labeled_datas", "./seal_labeled_datas/Label.txt", "./seal_ppocr_gt/seal_det_img.txt", "./seal_ppocr_gt/seal_ppocr_img.txt")
Expand Down Expand Up @@ -523,7 +523,7 @@ def gen_xml_label(mode='train'):
xml_file = open(("./seal_VOC/Annotations" + '/' + i_name + '.xml'), 'w')
xml_file.write('<annotation>\n')
xml_file.write(' <folder>seal_VOC</folder>\n')
xml_file.write(' <filename>' + str(img_name) + '</filename>\n')
xml_file.write(' <filename>' + str(img_name) + '</filename>\n')
xml_file.write(' <path>' + 'Annotations/' + str(img_name) + '</path>\n')
xml_file.write(' <size>\n')
xml_file.write(' <width>' + str(width) + '</width>\n')
Expand Down Expand Up @@ -553,7 +553,7 @@ def gen_xml_label(mode='train'):
xml_file.write(' <ymax>'+str(ymax)+'</ymax>\n')
xml_file.write(' </bndbox>\n')
xml_file.write(' </object>\n')
xml_file.write('</annotation>')
xml_file.write('</annotation>')
xml_file.close()
print(f'{mode} xml save done!')
Expand Down
14 changes: 7 additions & 7 deletions applications/多模态表单识别.md
Original file line number Diff line number Diff line change
Expand Up @@ -110,12 +110,12 @@ tar -xf XFUND.tar

```bash
/home/aistudio/PaddleOCR/ppstructure/vqa/XFUND
└─ zh_train/ 训练集
├── image/ 图片存放文件夹
├── xfun_normalize_train.json 标注信息
└─ zh_val/ 验证集
├── image/ 图片存放文件夹
├── xfun_normalize_val.json 标注信息
└─ zh_train/ 训练集
├── image/ 图片存放文件夹
├── xfun_normalize_train.json 标注信息
└─ zh_val/ 验证集
├── image/ 图片存放文件夹
├── xfun_normalize_val.json 标注信息

```

Expand Down Expand Up @@ -805,7 +805,7 @@ CUDA_VISIBLE_DEVICES=0 python3 tools/infer_vqa_token_ser_re.py \
最终会在config.Global.save_res_path字段所配置的目录下保存预测结果可视化图像以及预测结果文本文件,预测结果文本文件名为infer_results.txt, 每一行表示一张图片的结果,每张图片的结果如下所示,前面表示测试图片路径,后面为测试结果:key字段及对应的value字段。

```
test_imgs/t131.jpg {"政治面税": "群众", "性别": "男", "籍贯": "河北省邯郸市", "婚姻状况": "亏末婚口已婚口已娇", "通讯地址": "邯郸市阳光苑7号楼003", "民族": "汉族", "毕业院校": "河南工业大学", "户口性质": "口农村城镇", "户口地址": "河北省邯郸市", "联系电话": "13288888888", "健康状况": "健康", "姓名": "小六", "好高cm": "180", "出生年月": "1996年8月9日", "文化程度": "本科", "身份证号码": "458933777777777777"}
test_imgs/t131.jpg {"政治面税": "群众", "性别": "男", "籍贯": "河北省邯郸市", "婚姻状况": "亏末婚口已婚口已娇", "通讯地址": "邯郸市阳光苑7号楼003", "民族": "汉族", "毕业院校": "河南工业大学", "户口性质": "口农村城镇", "户口地址": "河北省邯郸市", "联系电话": "13288888888", "健康状况": "健康", "姓名": "小六", "好高cm": "180", "出生年月": "1996年8月9日", "文化程度": "本科", "身份证号码": "458933777777777777"}
````

展示预测结果
Expand Down
Loading