-
Notifications
You must be signed in to change notification settings - Fork 7.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
1.在ppstructure管道中添加latex_ocr公式识别功能;2.添加pdf转markdown文件功能 #13868
Open
ztyf-lq
wants to merge
4
commits into
PaddlePaddle:main
Choose a base branch
from
ztyf-lq:new_branch
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
感谢大佬的贡献 |
@liuhongen1234567 大佬,麻烦review一下这个PR。 |
GreatV
reviewed
Sep 13, 2024
建议更新一下文档,说明使用方法。由于我们的文档站点还在迁移中,所以需要更新两个地方。 ppstructure
docs |
您好,我后续有更新文档的打算,最近可能使用ppocr复现其他的项目,更新文档的时间最晚会在十月。 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
前言
尊敬的 ppocr 官方人员您好,我是一名 ppocr 项目的使用者,在日常工作学习中我都会用到ppocr,我深感 ppocr 的强大之处!同时能为 ppocr 做贡献也是我非常想要做的事情。非常期待您在百忙之中看看我写的代码是否是 ppocr 所需要的。
改动如下:
ppstructure
管道中添加latex_ocr
公式识别功能;a. 修改
ppstructure/predict_system.py
文件中StructureSystem
类,添加latex_ocr
模型和布局为公式的区域处理;b. 由于 docx 中不支持插入 latex 公式,在
ppstructure/recovery/recovery_to_doc.py
文件中convert_info_docx
函数中跳过latex公式;c. 在
ppstructure/utility.py
中draw_structure_result
函数中可视化 ocr 结果中跳过 latex 公式;a. 在目录
ppstructure/recovery
下添加文件recovery_to_markdown.py
,其中程序功能为转换ppstructure识别结果为markdown文件。其中对于文本区域处理目前给出了两种处理方法,第一种为每一个自然段分割标志位开头两个空格,第二种为每个自然段开头没有空格,这种情况下以每个自然段最后一行一般不会是“满行”,而是会留有空余空间;b.
ppstructure/predict_system.py
文件中调用转换 ppstructure 识别结果到 markdown 文件的函数;a. 添加
latex_ocr
公式识别模型必要的参数;b. 添加
recovery_to_markdown
选项达到开启/关闭转换 ppstructure 识别结果到 markdown 文件;c. 添加 formula 选项达到开启/关闭latex公式识别;
如果我的代码恰巧是 ppocr 所需要的,后续我会跟进官方人员的建议并且在版面恢复文档中添加 pdf 转 markdown 文件的教程。