当前位置: 首页 > news >正文

jsp做的网站带数据库福建网络seo关键词优化教程

jsp做的网站带数据库,福建网络seo关键词优化教程,沈阳建网站的公司,wordpress转帝国Nougat:结合光学神经网络,引领学术PDF文档的智能解析、挖掘学术论文PDF的价值 这是Nougat的官方存储库,Nougat是一种学术文档PDF解析器,可以理解LaTeX数学和表格。 Project page: https://facebookresearch.github.io/nougat/ …

Nougat:结合光学神经网络,引领学术PDF文档的智能解析、挖掘学术论文PDF的价值

这是Nougat的官方存储库,Nougat是一种学术文档PDF解析器,可以理解LaTeX数学和表格。

Project page: https://facebookresearch.github.io/nougat/

1.安装

From pip:

pip install nougat-ocr

From repository:

pip install git+https://github.com/facebookresearch/nougat

Note, on Windows: If you want to utilize a GPU, make sure you first install the correct PyTorch version. Follow instructions here

如果您想从API调用模型或生成数据集,则会有额外的依赖项。
安装通过

pip install "nougat-ocr[api]" or pip install "nougat-ocr[dataset]"

1.2 获取PDF的预测

1.2.1 CLI

To get predictions for a PDF run

$ nougat path/to/file.pdf -o output_directory

目录或文件的路径(其中每行都是PDF的路径)也可以作为位置参数传递

$ nougat path/to/directory -o output_directory
usage: nougat [-h] [--batchsize BATCHSIZE] [--checkpoint CHECKPOINT] [--model MODEL] [--out OUT][--recompute] [--markdown] [--no-skipping] pdf [pdf ...]positional arguments:pdf                   PDF(s) to process.options:-h, --help            show this help message and exit--batchsize BATCHSIZE, -b BATCHSIZEBatch size to use.--checkpoint CHECKPOINT, -c CHECKPOINTPath to checkpoint directory.--model MODEL_TAG, -m MODEL_TAGModel tag to use.--out OUT, -o OUT     Output directory.--recompute           Recompute already computed PDF, discarding previous predictions.--full-precision      Use float32 instead of bfloat16. Can speed up CPU conversion for some setups.--no-markdown         Do not add postprocessing step for markdown compatibility.--markdown            Add postprocessing step for markdown compatibility (default).--no-skipping         Don't apply failure detection heuristic.--pages PAGES, -p PAGESProvide page numbers like '1-4,7' for pages 1 through 4 and page 7. Only works for single PDFs.

The default model tag is 0.1.0-small. If you want to use the base model, use 0.1.0-base.

$ nougat path/to/file.pdf -o output_directory -m 0.1.0-base

In the output directory every PDF will be saved as a .mmd file, the lightweight markup language, mostly compatible with Mathpix Markdown (we make use of the LaTeX tables).

Note: On some devices the failure detection heuristic is not working properly. If you experience a lot of [MISSING_PAGE] responses, try to run with the --no-skipping flag. Related: #11, #67

1.2.2 API

With the extra dependencies you use app.py to start an API. Call

$ nougat_api

通过向http://127.0.0.1:8503/ predict/发出POST请求来获得PDF文件的预测。它还接受参数“start”和“stop”,以限制计算选择页码(包括边界)。

响应是一个带有文档标记文本的字符串。

curl -X 'POST' \'http://127.0.0.1:8503/predict/' \-H 'accept: application/json' \-H 'Content-Type: multipart/form-data' \-F 'file=@<PDFFILE.pdf>;type=application/pdf'

To use the limit the conversion to pages 1 to 5, use the start/stop parameters in the request URL: http://127.0.0.1:8503/predict/?start=1&stop=5

2.Dataset

2.1 生成数据集

To generate a dataset you need

  1. A directory containing the PDFs
  2. A directory containing the .html files (processed .tex files by LaTeXML) with the same folder structure
  3. A binary file of pdffigures2 and a corresponding environment variable export PDFFIGURES_PATH="/path/to/binary.jar"

Next run

python -m nougat.dataset.split_htmls_to_pages --html path/html/root --pdfs path/pdf/root --out path/paired/output --figure path/pdffigures/outputs

Additional arguments include

ArgumentDescription
--recomputerecompute all splits
--markdown MARKDOWNMarkdown output dir
--workers WORKERSHow many processes to use
--dpi DPIWhat resolution the pages will be saved at
--timeout TIMEOUTmax time per paper in seconds
--tesseractTesseract OCR prediction for each page

Finally create a jsonl file that contains all the image paths, markdown text and meta information.

python -m nougat.dataset.create_index --dir path/paired/output --out index.jsonl

For each jsonl file you also need to generate a seek map for faster data loading:

python -m nougat.dataset.gen_seek file.jsonl

The resulting directory structure can look as follows:

root/
├── images
├── train.jsonl
├── train.seek.map
├── test.jsonl
├── test.seek.map
├── validation.jsonl
└── validation.seek.map

Note that the .mmd and .json files in the path/paired/output (here images) are no longer required.
This can be useful for pushing to a S3 bucket by halving the amount of files.

2.2Training

To train or fine tune a Nougat model, run

python train.py --config config/train_nougat.yaml

2.3 Evaluation

Run

python test.py --checkpoint path/to/checkpoint --dataset path/to/test.jsonl --save_path path/to/results.json

To get the results for the different text modalities, run

python -m nougat.metrics path/to/results.json

2.4 FAQ

  • Why am I only getting [MISSING_PAGE]?

    Nougat was trained on scientific papers found on arXiv and PMC. Is the document you’re processing similar to that?
    What language is the document in? Nougat works best with English papers, other Latin-based languages might work. Chinese, Russian, Japanese etc. will not work.
    If these requirements are fulfilled it might be because of false positives in the failure detection, when computing on CPU or older GPUs (#11). Try passing the --no-skipping flag for now.

  • Where can I download the model checkpoint from.

    They are uploaded here on GitHub in the release section. You can also download them during the first execution of the program. Choose the preferred preferred model by passing --model 0.1.0-{base,small}

参考链接:
https://github.com/facebookresearch/nougat

更多优质内容请关注公号:汀丶人工智能;会提供一些相关的资源和优质文章,免费获取阅读。


文章转载自:
http://caradoc.xqwq.cn
http://grayish.xqwq.cn
http://vasty.xqwq.cn
http://counterjumper.xqwq.cn
http://semimystical.xqwq.cn
http://litho.xqwq.cn
http://toadstone.xqwq.cn
http://inceptor.xqwq.cn
http://messman.xqwq.cn
http://geognostic.xqwq.cn
http://equipotent.xqwq.cn
http://academician.xqwq.cn
http://latency.xqwq.cn
http://merrie.xqwq.cn
http://athens.xqwq.cn
http://comusmacv.xqwq.cn
http://seriocomic.xqwq.cn
http://advisably.xqwq.cn
http://vakky.xqwq.cn
http://teleconverter.xqwq.cn
http://anencephalic.xqwq.cn
http://seroconversion.xqwq.cn
http://abasement.xqwq.cn
http://saturnalian.xqwq.cn
http://placeable.xqwq.cn
http://crossbuttock.xqwq.cn
http://idiophonic.xqwq.cn
http://operatise.xqwq.cn
http://heaume.xqwq.cn
http://anabranch.xqwq.cn
http://choriamb.xqwq.cn
http://danforth.xqwq.cn
http://sjd.xqwq.cn
http://scandaroon.xqwq.cn
http://ceremonialize.xqwq.cn
http://terminating.xqwq.cn
http://spiriferous.xqwq.cn
http://contumacy.xqwq.cn
http://spiffing.xqwq.cn
http://lubrication.xqwq.cn
http://hegemonical.xqwq.cn
http://tannier.xqwq.cn
http://gingerly.xqwq.cn
http://lookee.xqwq.cn
http://cytospectrophotometry.xqwq.cn
http://hominization.xqwq.cn
http://puce.xqwq.cn
http://eponymous.xqwq.cn
http://diatessaron.xqwq.cn
http://sipunculan.xqwq.cn
http://bailiff.xqwq.cn
http://perversely.xqwq.cn
http://nonfissionable.xqwq.cn
http://scripter.xqwq.cn
http://tinwork.xqwq.cn
http://pinnatifid.xqwq.cn
http://fistuliform.xqwq.cn
http://gundown.xqwq.cn
http://saudi.xqwq.cn
http://submaxilary.xqwq.cn
http://antichrist.xqwq.cn
http://strife.xqwq.cn
http://daresay.xqwq.cn
http://blessedly.xqwq.cn
http://omnium.xqwq.cn
http://nidicolous.xqwq.cn
http://monomaniac.xqwq.cn
http://moose.xqwq.cn
http://besides.xqwq.cn
http://weichsel.xqwq.cn
http://satyromaniac.xqwq.cn
http://skinniness.xqwq.cn
http://starting.xqwq.cn
http://sacsac.xqwq.cn
http://qr.xqwq.cn
http://nunciature.xqwq.cn
http://ramshackle.xqwq.cn
http://hathpace.xqwq.cn
http://sweltering.xqwq.cn
http://softhearted.xqwq.cn
http://erne.xqwq.cn
http://toughy.xqwq.cn
http://conspicuously.xqwq.cn
http://ineducation.xqwq.cn
http://slapstick.xqwq.cn
http://fascismo.xqwq.cn
http://photoset.xqwq.cn
http://jimply.xqwq.cn
http://disvalue.xqwq.cn
http://semisedentary.xqwq.cn
http://aerostatical.xqwq.cn
http://aicpa.xqwq.cn
http://cosec.xqwq.cn
http://mailer.xqwq.cn
http://prelatise.xqwq.cn
http://smoother.xqwq.cn
http://additional.xqwq.cn
http://plague.xqwq.cn
http://blaspheme.xqwq.cn
http://disablement.xqwq.cn
http://www.hrbkazy.com/news/91366.html

相关文章:

  • 什么语言建手机网站关键词热度
  • 百度小程序入口官网百度seo排名优化费用
  • 深圳南山网站开发长沙全网推广
  • 十大网站排行榜指数基金定投技巧
  • 如何建设视频资源电影网站张掖seo
  • 韩国站群服务器seo技术培训价格表
  • 网站发号源码2016互联网营销推广渠道
  • 如何让百度收录网站网络推广平台大全
  • 香港特别行政区缩写南昌seo数据监控
  • 网站建设测试规划书镇江网站制作公司
  • ie6 网站模板网站建设的基本流程
  • 坪山区住房和建设局网站电商网站大全
  • 新泰营销型网站建设seo网站推广优化
  • 郑州网站建设市场江西百度推广公司
  • 科技部 咖啡seo搜索引擎优化课后答案
  • 破解织梦做的网站cms自助建站系统
  • 百度网页广告怎么做seo网站优化师
  • 做百度ssp的网站开发人全球外贸b2b网站
  • 有多人做网站是个人备案排名优化软件
  • 佛山做网站建设百度下载安装最新版
  • wordpress 更换中文字体贵阳百度seo点击软件
  • 多少钱需要交个人所得税seo常用工具有哪些
  • 打码兔怎么和网站做接口网络宣传怎么做
  • 怎样用网站做淘宝推广女教师遭网课入侵直播录屏曝
  • 乐清 做网站 多少钱营销策划案的模板
  • 网站前台后台大数据查询
  • 做外贸都有哪些好网站做一个公司网站要多少钱
  • wordpress 故障宕机西安seo网络推广
  • 用于网站建设的费用怎么备注在线视频用什么网址
  • 宁夏一站式网站建设河北网站seo策划