| 123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511 |
- Metadata-Version: 2.4
- Name: olmocr
- Version: 0.1.67
- Author-email: Allen Institute for Artificial Intelligence <jakep@allenai.org>
- License: Apache License
- Version 2.0, January 2004
- https://www.apache.org/licenses/
-
- TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
-
- 1. Definitions.
-
- "License" shall mean the terms and conditions for use, reproduction,
- and distribution as defined by Sections 1 through 9 of this document.
-
- "Licensor" shall mean the copyright owner or entity authorized by
- the copyright owner that is granting the License.
-
- "Legal Entity" shall mean the union of the acting entity and all
- other entities that control, are controlled by, or are under common
- control with that entity. For the purposes of this definition,
- "control" means (i) the power, direct or indirect, to cause the
- direction or management of such entity, whether by contract or
- otherwise, or (ii) ownership of fifty percent (50%) or more of the
- outstanding shares, or (iii) beneficial ownership of such entity.
-
- "You" (or "Your") shall mean an individual or Legal Entity
- exercising permissions granted by this License.
-
- "Source" form shall mean the preferred form for making modifications,
- including but not limited to software source code, documentation
- source, and configuration files.
-
- "Object" form shall mean any form resulting from mechanical
- transformation or translation of a Source form, including but
- not limited to compiled object code, generated documentation,
- and conversions to other media types.
-
- "Work" shall mean the work of authorship, whether in Source or
- Object form, made available under the License, as indicated by a
- copyright notice that is included in or attached to the work
- (an example is provided in the Appendix below).
-
- "Derivative Works" shall mean any work, whether in Source or Object
- form, that is based on (or derived from) the Work and for which the
- editorial revisions, annotations, elaborations, or other modifications
- represent, as a whole, an original work of authorship. For the purposes
- of this License, Derivative Works shall not include works that remain
- separable from, or merely link (or bind by name) to the interfaces of,
- the Work and Derivative Works thereof.
-
- "Contribution" shall mean any work of authorship, including
- the original version of the Work and any modifications or additions
- to that Work or Derivative Works thereof, that is intentionally
- submitted to Licensor for inclusion in the Work by the copyright owner
- or by an individual or Legal Entity authorized to submit on behalf of
- the copyright owner. For the purposes of this definition, "submitted"
- means any form of electronic, verbal, or written communication sent
- to the Licensor or its representatives, including but not limited to
- communication on electronic mailing lists, source code control systems,
- and issue tracking systems that are managed by, or on behalf of, the
- Licensor for the purpose of discussing and improving the Work, but
- excluding communication that is conspicuously marked or otherwise
- designated in writing by the copyright owner as "Not a Contribution."
-
- "Contributor" shall mean Licensor and any individual or Legal Entity
- on behalf of whom a Contribution has been received by Licensor and
- subsequently incorporated within the Work.
-
- 2. Grant of Copyright License. Subject to the terms and conditions of
- this License, each Contributor hereby grants to You a perpetual,
- worldwide, non-exclusive, no-charge, royalty-free, irrevocable
- copyright license to reproduce, prepare Derivative Works of,
- publicly display, publicly perform, sublicense, and distribute the
- Work and such Derivative Works in Source or Object form.
-
- 3. Grant of Patent License. Subject to the terms and conditions of
- this License, each Contributor hereby grants to You a perpetual,
- worldwide, non-exclusive, no-charge, royalty-free, irrevocable
- (except as stated in this section) patent license to make, have made,
- use, offer to sell, sell, import, and otherwise transfer the Work,
- where such license applies only to those patent claims licensable
- by such Contributor that are necessarily infringed by their
- Contribution(s) alone or by combination of their Contribution(s)
- with the Work to which such Contribution(s) was submitted. If You
- institute patent litigation against any entity (including a
- cross-claim or counterclaim in a lawsuit) alleging that the Work
- or a Contribution incorporated within the Work constitutes direct
- or contributory patent infringement, then any patent licenses
- granted to You under this License for that Work shall terminate
- as of the date such litigation is filed.
-
- 4. Redistribution. You may reproduce and distribute copies of the
- Work or Derivative Works thereof in any medium, with or without
- modifications, and in Source or Object form, provided that You
- meet the following conditions:
-
- (a) You must give any other recipients of the Work or
- Derivative Works a copy of this License; and
-
- (b) You must cause any modified files to carry prominent notices
- stating that You changed the files; and
-
- (c) You must retain, in the Source form of any Derivative Works
- that You distribute, all copyright, patent, trademark, and
- attribution notices from the Source form of the Work,
- excluding those notices that do not pertain to any part of
- the Derivative Works; and
-
- (d) If the Work includes a "NOTICE" text file as part of its
- distribution, then any Derivative Works that You distribute must
- include a readable copy of the attribution notices contained
- within such NOTICE file, excluding those notices that do not
- pertain to any part of the Derivative Works, in at least one
- of the following places: within a NOTICE text file distributed
- as part of the Derivative Works; within the Source form or
- documentation, if provided along with the Derivative Works; or,
- within a display generated by the Derivative Works, if and
- wherever such third-party notices normally appear. The contents
- of the NOTICE file are for informational purposes only and
- do not modify the License. You may add Your own attribution
- notices within Derivative Works that You distribute, alongside
- or as an addendum to the NOTICE text from the Work, provided
- that such additional attribution notices cannot be construed
- as modifying the License.
-
- You may add Your own copyright statement to Your modifications and
- may provide additional or different license terms and conditions
- for use, reproduction, or distribution of Your modifications, or
- for any such Derivative Works as a whole, provided Your use,
- reproduction, and distribution of the Work otherwise complies with
- the conditions stated in this License.
-
- 5. Submission of Contributions. Unless You explicitly state otherwise,
- any Contribution intentionally submitted for inclusion in the Work
- by You to the Licensor shall be under the terms and conditions of
- this License, without any additional terms or conditions.
- Notwithstanding the above, nothing herein shall supersede or modify
- the terms of any separate license agreement you may have executed
- with Licensor regarding such Contributions.
-
- 6. Trademarks. This License does not grant permission to use the trade
- names, trademarks, service marks, or product names of the Licensor,
- except as required for reasonable and customary use in describing the
- origin of the Work and reproducing the content of the NOTICE file.
-
- 7. Disclaimer of Warranty. Unless required by applicable law or
- agreed to in writing, Licensor provides the Work (and each
- Contributor provides its Contributions) on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
- implied, including, without limitation, any warranties or conditions
- of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
- PARTICULAR PURPOSE. You are solely responsible for determining the
- appropriateness of using or redistributing the Work and assume any
- risks associated with Your exercise of permissions under this License.
-
- 8. Limitation of Liability. In no event and under no legal theory,
- whether in tort (including negligence), contract, or otherwise,
- unless required by applicable law (such as deliberate and grossly
- negligent acts) or agreed to in writing, shall any Contributor be
- liable to You for damages, including any direct, indirect, special,
- incidental, or consequential damages of any character arising as a
- result of this License or out of the use or inability to use the
- Work (including but not limited to damages for loss of goodwill,
- work stoppage, computer failure or malfunction, or any and all
- other commercial damages or losses), even if such Contributor
- has been advised of the possibility of such damages.
-
- 9. Accepting Warranty or Additional Liability. While redistributing
- the Work or Derivative Works thereof, You may choose to offer,
- and charge a fee for, acceptance of support, warranty, indemnity,
- or other liability obligations and/or rights consistent with this
- License. However, in accepting such obligations, You may act only
- on Your own behalf and on Your sole responsibility, not on behalf
- of any other Contributor, and only if You agree to indemnify,
- defend, and hold each Contributor harmless for any liability
- incurred by, or claims asserted against, such Contributor by reason
- of your accepting any such warranty or additional liability.
-
- END OF TERMS AND CONDITIONS
-
- APPENDIX: How to apply the Apache License to your work.
-
- To apply the Apache License to your work, attach the following
- boilerplate notice, with the fields enclosed by brackets "{}"
- replaced with your own identifying information. (Don't include
- the brackets!) The text should be enclosed in the appropriate
- comment syntax for the file format. We also recommend that a
- file or class name and description of purpose be included on the
- same "printed page" as the copyright notice for easier
- identification within third-party archives.
-
- Copyright {yyyy} {name of copyright owner}
-
- Licensed under the Apache License, Version 2.0 (the "License");
- you may not use this file except in compliance with the License.
- You may obtain a copy of the License at
-
- https://www.apache.org/licenses/LICENSE-2.0
-
- Unless required by applicable law or agreed to in writing, software
- distributed under the License is distributed on an "AS IS" BASIS,
- WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
- See the License for the specific language governing permissions and
- limitations under the License.
-
- Project-URL: Homepage, https://github.com/allenai/olmocr
- Project-URL: Repository, https://github.com/allenai/olmocr
- Project-URL: Changelog, https://github.com/allenai/olmocr/blob/main/CHANGELOG.md
- Classifier: Intended Audience :: Science/Research
- Classifier: Development Status :: 3 - Alpha
- Classifier: License :: OSI Approved :: Apache Software License
- Classifier: Programming Language :: Python :: 3
- Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
- Requires-Python: >=3.11
- Description-Content-Type: text/markdown
- License-File: LICENSE
- Requires-Dist: cached-path
- Requires-Dist: smart_open
- Requires-Dist: pypdf>=5.2.0
- Requires-Dist: pypdfium2
- Requires-Dist: cryptography
- Requires-Dist: lingua-language-detector
- Requires-Dist: Pillow
- Requires-Dist: ftfy
- Requires-Dist: bleach
- Requires-Dist: markdown2
- Requires-Dist: filelock
- Requires-Dist: orjson
- Requires-Dist: requests
- Requires-Dist: zstandard
- Requires-Dist: boto3
- Requires-Dist: httpx
- Requires-Dist: torch>=2.5.1
- Requires-Dist: transformers==4.46.2
- Requires-Dist: img2pdf
- Requires-Dist: beaker-py
- Provides-Extra: gpu
- Requires-Dist: sgl-kernel==0.0.3.post1; extra == "gpu"
- Requires-Dist: sglang[all]==0.4.2; extra == "gpu"
- Provides-Extra: dev
- Requires-Dist: ruff; extra == "dev"
- Requires-Dist: mypy; extra == "dev"
- Requires-Dist: black; extra == "dev"
- Requires-Dist: isort; extra == "dev"
- Requires-Dist: pytest; extra == "dev"
- Requires-Dist: pytest-sphinx; extra == "dev"
- Requires-Dist: pytest-cov; extra == "dev"
- Requires-Dist: twine>=1.11.0; extra == "dev"
- Requires-Dist: build; extra == "dev"
- Requires-Dist: setuptools; extra == "dev"
- Requires-Dist: wheel; extra == "dev"
- Requires-Dist: Sphinx<7.1.0,>=4.3.0; extra == "dev"
- Requires-Dist: furo==2023.7.26; extra == "dev"
- Requires-Dist: myst-parser<2.1,>=1.0; extra == "dev"
- Requires-Dist: sphinx-copybutton==0.5.2; extra == "dev"
- Requires-Dist: sphinx-autobuild==2021.3.14; extra == "dev"
- Requires-Dist: sphinx-autodoc-typehints==1.23.3; extra == "dev"
- Requires-Dist: packaging; extra == "dev"
- Requires-Dist: necessary; extra == "dev"
- Requires-Dist: peft; extra == "dev"
- Requires-Dist: datasets; extra == "dev"
- Requires-Dist: omegaconf; extra == "dev"
- Requires-Dist: spacy; extra == "dev"
- Provides-Extra: bench
- Requires-Dist: tinyhost; extra == "bench"
- Requires-Dist: fuzzysearch; extra == "bench"
- Requires-Dist: rapidfuzz; extra == "bench"
- Requires-Dist: sequence_align; extra == "bench"
- Requires-Dist: syntok; extra == "bench"
- Requires-Dist: openai; extra == "bench"
- Requires-Dist: google-genai; extra == "bench"
- Requires-Dist: playwright; extra == "bench"
- Requires-Dist: mistralai; extra == "bench"
- Requires-Dist: lxml; extra == "bench"
- Requires-Dist: flask; extra == "bench"
- Provides-Extra: train
- Requires-Dist: torch; extra == "train"
- Requires-Dist: torchvision; extra == "train"
- Requires-Dist: accelerate; extra == "train"
- Requires-Dist: datasets; extra == "train"
- Requires-Dist: peft; extra == "train"
- Requires-Dist: wandb; extra == "train"
- Requires-Dist: omegaconf; extra == "train"
- Requires-Dist: s3fs; extra == "train"
- Requires-Dist: necessary; extra == "train"
- Requires-Dist: einops; extra == "train"
- Requires-Dist: transformers>=4.45.1; extra == "train"
- Provides-Extra: elo
- Requires-Dist: numpy; extra == "elo"
- Requires-Dist: scipy; extra == "elo"
- Requires-Dist: pandas; extra == "elo"
- Requires-Dist: matplotlib; extra == "elo"
- Dynamic: license-file
- <div align="center">
- <!-- <img src="https://github.com/allenai/OLMo/assets/8812459/774ac485-a535-4768-8f7c-db7be20f5cc3" width="300"/> -->
- <img src="https://github.com/user-attachments/assets/d70c8644-3e64-4230-98c3-c52fddaeccb6" alt="olmOCR Logo" width="300"/>
- <br/>
- <br>
- <h1>olmOCR</h1>
- </div>
- <p align="center">
- <a href="https://github.com/allenai/OLMo/blob/main/LICENSE">
- <img alt="GitHub License" src="https://img.shields.io/github/license/allenai/OLMo">
- </a>
- <a href="https://github.com/allenai/olmocr/releases">
- <img alt="GitHub release" src="https://img.shields.io/github/release/allenai/olmocr.svg">
- </a>
- <a href="https://olmocr.allenai.org/papers/olmocr.pdf">
- <img alt="Tech Report" src="https://img.shields.io/badge/Paper-olmOCR-blue">
- </a>
- <a href="https://olmocr.allenai.org">
- <img alt="Demo" src="https://img.shields.io/badge/Ai2-Demo-F0529C">
- </a>
- <a href="https://discord.gg/sZq3jTNVNG">
- <img alt="Discord" src="https://img.shields.io/badge/Discord%20-%20blue?style=flat&logo=discord&label=Ai2&color=%235B65E9">
- </a>
- </p>
- A toolkit for training language models to work with PDF documents in the wild.
- Try the online demo: [https://olmocr.allenai.org/](https://olmocr.allenai.org/)
- What is included:
- - A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/buildsilver.py)
- - An side-by-side eval toolkit for comparing different pipeline versions - [runeval.py](https://github.com/allenai/olmocr/blob/main/olmocr/eval/runeval.py)
- - Basic filtering by language and SEO spam removal - [filter.py](https://github.com/allenai/olmocr/blob/main/olmocr/filter/filter.py)
- - Finetuning code for Qwen2-VL and Molmo-O - [train.py](https://github.com/allenai/olmocr/blob/main/olmocr/train/train.py)
- - Processing millions of PDFs through a finetuned model using Sglang - [pipeline.py](https://github.com/allenai/olmocr/blob/main/olmocr/pipeline.py)
- - Viewing [Dolma docs](https://github.com/allenai/dolma) created from PDFs - [dolmaviewer.py](https://github.com/allenai/olmocr/blob/main/olmocr/viewer/dolmaviewer.py)
- ### Installation
- Requirements:
- - Recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100) with at least 20 GB of GPU RAM
- - 30GB of free disk space
- You will need to install poppler-utils and additional fonts for rendering PDF images.
- Install dependencies (Ubuntu/Debian)
- ```bash
- sudo apt-get update
- sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools
- ```
- Set up a conda environment and install olmocr
- ```bash
- conda create -n olmocr python=3.11
- conda activate olmocr
- git clone https://github.com/allenai/olmocr.git
- cd olmocr
- # For CPU-only operations, ex. running benchmarks
- pip install -e .
- # For actually converting the files with your own GPU
- pip install -e .[gpu] --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
- ```
- ### Local Usage Example
- For quick testing, try the [web demo](https://olmocr.allen.ai/). To run locally, a GPU is required, as inference is powered by [sglang](https://github.com/sgl-project/sglang) under the hood.
- Convert a Single PDF:
- ```bash
- python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf
- ```
- Convert an Image file:
- ```bash
- python -m olmocr.pipeline ./localworkspace --pdfs random_page.png
- ```
- Convert Multiple PDFs:
- ```bash
- python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf
- ```
- Results will be stored as JSON in `./localworkspace`.
- #### Viewing Results
- Extracted text is stored as [Dolma](https://github.com/allenai/dolma)-style JSONL inside of the `./localworkspace/results` directory.
- ```bash
- cat localworkspace/results/output_*.jsonl
- ```
- View results side-by-side with the original PDFs (uses `dolmaviewer` command):
- ```bash
- python -m olmocr.viewer.dolmaviewer localworkspace/results/output_*.jsonl
- ```
- Now open `./dolma_previews/tests_gnarly_pdfs_horribleocr_pdf.html` in your favorite browser.
- 
- ### Multi-node / Cluster Usage
- If you want to convert millions of PDFs, using multiple nodes running in parallel, then olmOCR supports
- reading your PDFs from AWS S3, and coordinating work using an AWS S3 output bucket.
- For example, you can start this command on your first worker node, and it will set up
- a simple work queue in your AWS bucket and start converting PDFs.
- ```bash
- python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf
- ```
- Now on any subsequent nodes, just run this and they will start grabbing items from the same workspace queue.
- ```bash
- python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace
- ```
- If you are at Ai2 and want to linearize millions of PDFs efficiently using [beaker](https://www.beaker.org), just add the `--beaker`
- flag. This will prepare the workspace on your local machine, and then launch N GPU workers in the cluster to start
- converting PDFs.
- For example:
- ```bash
- python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf --beaker --beaker_gpus 4
- ```
- ### Full documentation for the pipeline
- ```bash
- python -m olmocr.pipeline --help
- usage: pipeline.py [-h] [--pdfs PDFS] [--workspace_profile WORKSPACE_PROFILE] [--pdf_profile PDF_PROFILE] [--pages_per_group PAGES_PER_GROUP]
- [--max_page_retries MAX_PAGE_RETRIES] [--max_page_error_rate MAX_PAGE_ERROR_RATE] [--workers WORKERS] [--apply_filter] [--stats] [--model MODEL]
- [--model_max_context MODEL_MAX_CONTEXT] [--model_chat_template MODEL_CHAT_TEMPLATE] [--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM]
- [--target_anchor_text_len TARGET_ANCHOR_TEXT_LEN] [--beaker] [--beaker_workspace BEAKER_WORKSPACE] [--beaker_cluster BEAKER_CLUSTER]
- [--beaker_gpus BEAKER_GPUS] [--beaker_priority BEAKER_PRIORITY]
- workspace
- Manager for running millions of PDFs through a batch inference pipeline
- positional arguments:
- workspace The filesystem path where work will be stored, can be a local folder, or an s3 path if coordinating work with many workers, s3://bucket/prefix/
- options:
- -h, --help show this help message and exit
- --pdfs PDFS Path to add pdfs stored in s3 to the workspace, can be a glob path s3://bucket/prefix/*.pdf or path to file containing list of pdf paths
- --workspace_profile WORKSPACE_PROFILE
- S3 configuration profile for accessing the workspace
- --pdf_profile PDF_PROFILE
- S3 configuration profile for accessing the raw pdf documents
- --pages_per_group PAGES_PER_GROUP
- Aiming for this many pdf pages per work item group
- --max_page_retries MAX_PAGE_RETRIES
- Max number of times we will retry rendering a page
- --max_page_error_rate MAX_PAGE_ERROR_RATE
- Rate of allowable failed pages in a document, 1/250 by default
- --workers WORKERS Number of workers to run at a time
- --apply_filter Apply basic filtering to English pdfs which are not forms, and not likely seo spam
- --stats Instead of running any job, reports some statistics about the current workspace
- --model MODEL List of paths where you can find the model to convert this pdf. You can specify several different paths here, and the script will try to use the
- one which is fastest to access
- --model_max_context MODEL_MAX_CONTEXT
- Maximum context length that the model was fine tuned under
- --model_chat_template MODEL_CHAT_TEMPLATE
- Chat template to pass to sglang server
- --target_longest_image_dim TARGET_LONGEST_IMAGE_DIM
- Dimension on longest side to use for rendering the pdf pages
- --target_anchor_text_len TARGET_ANCHOR_TEXT_LEN
- Maximum amount of anchor text to use (characters)
- --beaker Submit this job to beaker instead of running locally
- --beaker_workspace BEAKER_WORKSPACE
- Beaker workspace to submit to
- --beaker_cluster BEAKER_CLUSTER
- Beaker clusters you want to run on
- --beaker_gpus BEAKER_GPUS
- Number of gpu replicas to run
- --beaker_priority BEAKER_PRIORITY
- Beaker priority level for the job
- ```
- ## Team
- <!-- start team -->
- **olmOCR** is developed and maintained by the AllenNLP team, backed by [the Allen Institute for Artificial Intelligence (AI2)](https://allenai.org/).
- AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.
- To learn more about who specifically contributed to this codebase, see [our contributors](https://github.com/allenai/olmocr/graphs/contributors) page.
- <!-- end team -->
- ## License
- <!-- start license -->
- **olmOCR** is licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0).
- A full copy of the license can be found [on GitHub](https://github.com/allenai/olmocr/blob/main/LICENSE).
- <!-- end license -->
- ## Citing
- ```bibtex
- @misc{olmocr,
- title={{olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models}},
- author={Jake Poznanski and Jon Borchardt and Jason Dunkelberger and Regan Huff and Daniel Lin and Aman Rangapur and Christopher Wilhelm and Kyle Lo and Luca Soldaini},
- year={2025},
- eprint={2502.18443},
- archivePrefix={arXiv},
- primaryClass={cs.CL},
- url={https://arxiv.org/abs/2502.18443},
- }
- ```
|