Преглед на файлове

创建olmocr项目仓库

Sherlck1011 преди 6 месеца
ревизия
a83781a3f5
променени са 100 файла, в които са добавени 6287 реда и са изтрити 0 реда
  1. 18 0
      CHANGELOG.md
  2. 201 0
      LICENSE
  3. 17 0
      Makefile
  4. 216 0
      README.md
  5. 24 0
      RELEASE_PROCESS.md
  6. BIN
      __pycache__/app.cpython-310.pyc
  7. BIN
      __pycache__/app.cpython-311.pyc
  8. 98 0
      api.py
  9. 8 0
      api_test.py
  10. 210 0
      app.py
  11. 36 0
      combine_results.py
  12. 1 0
      docs/.gitignore
  13. 20 0
      docs/Makefile
  14. 35 0
      docs/make.bat
  15. 18 0
      docs/source/CHANGELOG.md
  16. 168 0
      docs/source/CONTRIBUTING.md
  17. 0 0
      docs/source/_static/css/custom.css
  18. BIN
      docs/source/_static/favicon.ico
  19. 121 0
      docs/source/conf.py
  20. 27 0
      docs/source/index.md
  21. 27 0
      docs/source/installation.md
  22. 3 0
      docs/source/overview.md
  23. 140 0
      dolma_previews/olmocr_workspace_job_1747418779_input_pdf.html
  24. 154 0
      dolma_previews/olmocr_workspace_job_1747418950_input_pdf.html
  25. 118 0
      dolma_previews/olmocr_workspace_job_1747419168_input_pdf.html
  26. 118 0
      dolma_previews/olmocr_workspace_job_1747419731_input_pdf.html
  27. 157 0
      dolma_previews/olmocr_workspace_job_1747420686_input_pdf.html
  28. 154 0
      dolma_previews/olmocr_workspace_job_1747493567_input_pdf.html
  29. 157 0
      dolma_previews/olmocr_workspace_job_1747493749_input_pdf.html
  30. 130 0
      dolma_previews/olmocr_workspace_job_1747493917_input_pdf.html
  31. 157 0
      dolma_previews/olmocr_workspace_job_1747495273_input_pdf.html
  32. 157 0
      dolma_previews/olmocr_workspace_job_1747495669_input_pdf.html
  33. 157 0
      dolma_previews/olmocr_workspace_job_1747495750_input_pdf.html
  34. 157 0
      dolma_previews/olmocr_workspace_job_1747496237_input_pdf.html
  35. 157 0
      dolma_previews/olmocr_workspace_job_1747496349_input_pdf.html
  36. 157 0
      dolma_previews/olmocr_workspace_job_1747496456_input_pdf.html
  37. 157 0
      dolma_previews/olmocr_workspace_job_1747496736_input_pdf.html
  38. 157 0
      dolma_previews/olmocr_workspace_job_1747496860_input_pdf.html
  39. 157 0
      dolma_previews/olmocr_workspace_job_1747496960_input_pdf.html
  40. 161 0
      dolma_previews/olmocr_workspace_job_1747497084_input_pdf.html
  41. 157 0
      dolma_previews/olmocr_workspace_job_1747497493_input_pdf.html
  42. 157 0
      dolma_previews/olmocr_workspace_job_1747497590_input_pdf.html
  43. 157 0
      dolma_previews/olmocr_workspace_job_1747534381_input_pdf.html
  44. 154 0
      dolma_previews/olmocr_workspace_job_1747534713_input_pdf.html
  45. 157 0
      dolma_previews/olmocr_workspace_job_1747795702_input_pdf.html
  46. 157 0
      dolma_previews/olmocr_workspace_job_1747795907_input_pdf.html
  47. 238 0
      dolma_previews/olmocr_workspace_job_1753002204_input_pdf.html
  48. 238 0
      dolma_previews/olmocr_workspace_job_1753002474_input_pdf.html
  49. 35 0
      gantry-requirements.txt
  50. 0 0
      localworkspace/results/output_03f19a67ca1619f854740bd806a32d7112c3c315.jsonl
  51. 0 0
      localworkspace/results/output_0640d37e5d5afe1fb4a4e053d7d3389e927e5bf7.jsonl
  52. 0 0
      localworkspace/results/output_06798e8f7cc26525f138f26354ffab7c63074f2c.jsonl
  53. 0 0
      localworkspace/results/output_0c3e9a89b35c3045b6a67f7cd5c06009a31d750f.jsonl
  54. 0 0
      localworkspace/results/output_10dc5d29c3f17870daf918c9555cd0b939acbffe.jsonl
  55. 0 0
      localworkspace/results/output_1cbf4da516b0dca0de138db476a8a65d2dbc5aab.jsonl
  56. 0 0
      localworkspace/results/output_21ee5d5d32535bcacd750ef2dace24b98fa42fdb.jsonl
  57. 0 0
      localworkspace/results/output_225426c1e59a9bf843a4d1088c3c98aa0321642c.jsonl
  58. 0 0
      localworkspace/results/output_24809642f1ed21aee754e7c58d350b261d121212.jsonl
  59. 0 0
      localworkspace/results/output_2b4bbfbba141c9173ab5abba31f4a4c140a0fd85.jsonl
  60. 0 0
      localworkspace/results/output_2ff00bac5e9500c24956e5386f6e7a49b2b55098.jsonl
  61. 0 0
      localworkspace/results/output_398aeb9cc239880a7222603994af5c4016796381.jsonl
  62. 0 0
      localworkspace/results/output_5da3510f60e4d62bb38dbf36fb90d4a0034727fa.jsonl
  63. 0 0
      localworkspace/results/output_662cdaa711447efb75b7c325ea177326afc2747b.jsonl
  64. 0 0
      localworkspace/results/output_7815bd6305410d3cbbea8287ed60dae1462e6e65.jsonl
  65. 0 0
      localworkspace/results/output_7e7415b1a884dd4b422626d1f93cc9d5ff33301c.jsonl
  66. 0 0
      localworkspace/results/output_8450bc4e95932e232e795c885ec59ab601993cab.jsonl
  67. 0 0
      localworkspace/results/output_95eb6113ad117cc5bc5c734f7ca31625e117229d.jsonl
  68. 0 0
      localworkspace/results/output_9face5eb793573e747789b627bf1cc4b334b5b93.jsonl
  69. 0 0
      localworkspace/results/output_a516ff5c967066055babccbea12ff6a88bdfe9b5.jsonl
  70. 0 0
      localworkspace/results/output_a7cda58bb6cdd49b7ffd2f6d48a871b4e1da7e62.jsonl
  71. 0 0
      localworkspace/results/output_aef98857329873e434b4b835531b5abd2cfca622.jsonl
  72. 0 0
      localworkspace/results/output_b3152b4cd8ddb87e2ad8e5fbf7906815031ce44f.jsonl
  73. 0 0
      localworkspace/results/output_c07c41e4c78e5049d035d0059223ac0adc60be49.jsonl
  74. 0 0
      localworkspace/results/output_c1e2b4f5c6c4bb6407c21dcae6a8dccdc2ad0e74.jsonl
  75. 0 0
      localworkspace/results/output_d0cf1cf8644fafcb025a313b4bec083ea97e8c8d.jsonl
  76. 0 0
      localworkspace/results/output_dbac13d5d8d14af821606b2b6fcec79288c911ad.jsonl
  77. 0 0
      localworkspace/results/output_e4811c9442eb8e0a3b6177e544c95e0299d41166.jsonl
  78. 0 0
      localworkspace/results/output_f5bd195da84dc4c9a132080ffb1a40239bb6d12b.jsonl
  79. 0 0
      localworkspace/results/output_f89f7b1c93bc7bae613c7002942c0c65ba3a03d7.jsonl
  80. BIN
      localworkspace/work_index_list.csv.zstd
  81. 40 0
      olmocr-pipeline-debug.log
  82. 511 0
      olmocr.egg-info/PKG-INFO
  83. 131 0
      olmocr.egg-info/SOURCES.txt
  84. 1 0
      olmocr.egg-info/dependency_links.txt
  85. 81 0
      olmocr.egg-info/requires.txt
  86. 6 0
      olmocr.egg-info/top_level.txt
  87. 1 0
      olmocr/__init__.py
  88. BIN
      olmocr/__pycache__/__init__.cpython-310.pyc
  89. BIN
      olmocr/__pycache__/__init__.cpython-311.pyc
  90. BIN
      olmocr/__pycache__/check.cpython-311.pyc
  91. BIN
      olmocr/__pycache__/image_utils.cpython-311.pyc
  92. BIN
      olmocr/__pycache__/metrics.cpython-311.pyc
  93. BIN
      olmocr/__pycache__/pipeline.cpython-310.pyc
  94. BIN
      olmocr/__pycache__/pipeline.cpython-311.pyc
  95. BIN
      olmocr/__pycache__/s3_utils.cpython-311.pyc
  96. BIN
      olmocr/__pycache__/version.cpython-310.pyc
  97. BIN
      olmocr/__pycache__/version.cpython-311.pyc
  98. BIN
      olmocr/__pycache__/work_queue.cpython-311.pyc
  99. 116 0
      olmocr/bench/README.md
  100. 0 0
      olmocr/bench/__init__.py

+ 18 - 0
CHANGELOG.md

@@ -0,0 +1,18 @@
+# Changelog
+
+All notable changes to this project will be documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+
+## Unreleased
+
+## [v0.1.60](https://github.com/allenai/olmocr/releases/tag/v0.1.60) - 2025-03-17
+
+## [v0.1.58](https://github.com/allenai/olmocr/releases/tag/v0.1.58) - 2025-02-15
+
+## [v0.1.53](https://github.com/allenai/olmocr/releases/tag/v0.1.53) - 2025-02-14
+
+- Fixed git checks
+
+- Added gemini and claude runners and a viewer.

+ 201 - 0
LICENSE

@@ -0,0 +1,201 @@
+                                 Apache License
+                           Version 2.0, January 2004
+                        https://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "{}"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright {yyyy} {name of copyright owner}
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       https://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.

+ 17 - 0
Makefile

@@ -0,0 +1,17 @@
+.PHONY : docs
+docs :
+	rm -rf docs/build/
+	sphinx-autobuild -b html --watch olmocr/ docs/source/ docs/build/
+
+.PHONY : run-checks
+run-checks :
+	isort --check .
+	black --check .
+	ruff check .
+	mypy .
+	CUDA_VISIBLE_DEVICES='' pytest -v --color=yes --doctest-modules tests/ olmocr/
+
+.PHONY : build
+build :
+	rm -rf *.egg-info/
+	python -m build

+ 216 - 0
README.md

@@ -0,0 +1,216 @@
+<div align="center">
+  <!-- <img src="https://github.com/allenai/OLMo/assets/8812459/774ac485-a535-4768-8f7c-db7be20f5cc3" width="300"/> -->
+<img src="https://github.com/user-attachments/assets/d70c8644-3e64-4230-98c3-c52fddaeccb6" alt="olmOCR Logo" width="300"/>
+<br/>
+  <br>
+  <h1>olmOCR</h1>
+</div>
+<p align="center">
+  <a href="https://github.com/allenai/OLMo/blob/main/LICENSE">
+    <img alt="GitHub License" src="https://img.shields.io/github/license/allenai/OLMo">
+  </a>
+  <a href="https://github.com/allenai/olmocr/releases">
+    <img alt="GitHub release" src="https://img.shields.io/github/release/allenai/olmocr.svg">
+  </a>
+  <a href="https://olmocr.allenai.org/papers/olmocr.pdf">
+    <img alt="Tech Report" src="https://img.shields.io/badge/Paper-olmOCR-blue">
+  </a>
+  <a href="https://olmocr.allenai.org">
+    <img alt="Demo" src="https://img.shields.io/badge/Ai2-Demo-F0529C">
+  </a>
+  <a href="https://discord.gg/sZq3jTNVNG">
+    <img alt="Discord" src="https://img.shields.io/badge/Discord%20-%20blue?style=flat&logo=discord&label=Ai2&color=%235B65E9">
+  </a>
+</p>
+
+A toolkit for training language models to work with PDF documents in the wild.
+
+Try the online demo: [https://olmocr.allenai.org/](https://olmocr.allenai.org/)
+
+What is included:
+ - A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/buildsilver.py)
+ - An side-by-side eval toolkit for comparing different pipeline versions - [runeval.py](https://github.com/allenai/olmocr/blob/main/olmocr/eval/runeval.py)
+ - Basic filtering by language and SEO spam removal - [filter.py](https://github.com/allenai/olmocr/blob/main/olmocr/filter/filter.py)
+ - Finetuning code for Qwen2-VL and Molmo-O - [train.py](https://github.com/allenai/olmocr/blob/main/olmocr/train/train.py)
+ - Processing millions of PDFs through a finetuned model using Sglang - [pipeline.py](https://github.com/allenai/olmocr/blob/main/olmocr/pipeline.py)
+ - Viewing [Dolma docs](https://github.com/allenai/dolma) created from PDFs - [dolmaviewer.py](https://github.com/allenai/olmocr/blob/main/olmocr/viewer/dolmaviewer.py)
+
+### Installation
+
+Requirements:
+ - Recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100) with at least 20 GB of GPU RAM
+ - 30GB of free disk space
+
+You will need to install poppler-utils and additional fonts for rendering PDF images.
+
+Install dependencies (Ubuntu/Debian)
+```bash
+sudo apt-get update
+sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools
+```
+
+Set up a conda environment and install olmocr
+```bash
+conda create -n olmocr python=3.11
+conda activate olmocr
+
+git clone https://github.com/allenai/olmocr.git
+cd olmocr
+
+# For CPU-only operations, ex. running benchmarks
+pip install -e .
+
+# For actually converting the files with your own GPU
+pip install -e .[gpu] --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
+```
+
+### Local Usage Example
+
+For quick testing, try the [web demo](https://olmocr.allen.ai/). To run locally, a GPU is required, as inference is powered by [sglang](https://github.com/sgl-project/sglang) under the hood.
+Convert a Single PDF:
+```bash
+python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf
+```
+
+Convert an Image file:
+```bash
+python -m olmocr.pipeline ./localworkspace --pdfs random_page.png
+```
+
+Convert Multiple PDFs:
+```bash
+python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf
+```
+Results will be stored as JSON in `./localworkspace`.
+
+#### Viewing Results
+
+Extracted text is stored as [Dolma](https://github.com/allenai/dolma)-style JSONL inside of the `./localworkspace/results` directory.
+
+```bash
+cat localworkspace/results/output_*.jsonl
+```
+
+View results side-by-side with the original PDFs (uses `dolmaviewer` command):
+
+```bash
+python -m olmocr.viewer.dolmaviewer localworkspace/results/output_*.jsonl
+```
+
+Now open `./dolma_previews/tests_gnarly_pdfs_horribleocr_pdf.html` in your favorite browser.
+
+![image](https://github.com/user-attachments/assets/128922d1-63e6-4d34-84f2-d7901237da1f)
+
+
+### Multi-node / Cluster Usage
+
+If you want to convert millions of PDFs, using multiple nodes running in parallel, then olmOCR supports
+reading your PDFs from AWS S3, and coordinating work using an AWS S3 output bucket.
+
+For example, you can start this command on your first worker node, and it will set up
+a simple work queue in your AWS bucket and start converting PDFs.
+
+```bash
+python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf
+```
+
+Now on any subsequent nodes, just run this and they will start grabbing items from the same workspace queue.
+```bash
+python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace
+```
+
+If you are at Ai2 and want to linearize millions of PDFs efficiently using [beaker](https://www.beaker.org), just add the `--beaker`
+flag. This will prepare the workspace on your local machine, and then launch N GPU workers in the cluster to start
+converting PDFs.
+
+For example:
+```bash
+python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf --beaker --beaker_gpus 4
+```
+
+### Full documentation for the pipeline
+
+```bash
+python -m olmocr.pipeline --help
+usage: pipeline.py [-h] [--pdfs PDFS] [--workspace_profile WORKSPACE_PROFILE] [--pdf_profile PDF_PROFILE] [--pages_per_group PAGES_PER_GROUP]
+                   [--max_page_retries MAX_PAGE_RETRIES] [--max_page_error_rate MAX_PAGE_ERROR_RATE] [--workers WORKERS] [--apply_filter] [--stats] [--model MODEL]
+                   [--model_max_context MODEL_MAX_CONTEXT] [--model_chat_template MODEL_CHAT_TEMPLATE] [--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM]
+                   [--target_anchor_text_len TARGET_ANCHOR_TEXT_LEN] [--beaker] [--beaker_workspace BEAKER_WORKSPACE] [--beaker_cluster BEAKER_CLUSTER]
+                   [--beaker_gpus BEAKER_GPUS] [--beaker_priority BEAKER_PRIORITY]
+                   workspace
+
+Manager for running millions of PDFs through a batch inference pipeline
+
+positional arguments:
+  workspace             The filesystem path where work will be stored, can be a local folder, or an s3 path if coordinating work with many workers, s3://bucket/prefix/
+
+options:
+  -h, --help            show this help message and exit
+  --pdfs PDFS           Path to add pdfs stored in s3 to the workspace, can be a glob path s3://bucket/prefix/*.pdf or path to file containing list of pdf paths
+  --workspace_profile WORKSPACE_PROFILE
+                        S3 configuration profile for accessing the workspace
+  --pdf_profile PDF_PROFILE
+                        S3 configuration profile for accessing the raw pdf documents
+  --pages_per_group PAGES_PER_GROUP
+                        Aiming for this many pdf pages per work item group
+  --max_page_retries MAX_PAGE_RETRIES
+                        Max number of times we will retry rendering a page
+  --max_page_error_rate MAX_PAGE_ERROR_RATE
+                        Rate of allowable failed pages in a document, 1/250 by default
+  --workers WORKERS     Number of workers to run at a time
+  --apply_filter        Apply basic filtering to English pdfs which are not forms, and not likely seo spam
+  --stats               Instead of running any job, reports some statistics about the current workspace
+  --model MODEL         List of paths where you can find the model to convert this pdf. You can specify several different paths here, and the script will try to use the
+                        one which is fastest to access
+  --model_max_context MODEL_MAX_CONTEXT
+                        Maximum context length that the model was fine tuned under
+  --model_chat_template MODEL_CHAT_TEMPLATE
+                        Chat template to pass to sglang server
+  --target_longest_image_dim TARGET_LONGEST_IMAGE_DIM
+                        Dimension on longest side to use for rendering the pdf pages
+  --target_anchor_text_len TARGET_ANCHOR_TEXT_LEN
+                        Maximum amount of anchor text to use (characters)
+  --beaker              Submit this job to beaker instead of running locally
+  --beaker_workspace BEAKER_WORKSPACE
+                        Beaker workspace to submit to
+  --beaker_cluster BEAKER_CLUSTER
+                        Beaker clusters you want to run on
+  --beaker_gpus BEAKER_GPUS
+                        Number of gpu replicas to run
+  --beaker_priority BEAKER_PRIORITY
+                        Beaker priority level for the job
+```
+
+
+## Team
+
+<!-- start team -->
+
+**olmOCR** is developed and maintained by the AllenNLP team, backed by [the Allen Institute for Artificial Intelligence (AI2)](https://allenai.org/).
+AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.
+To learn more about who specifically contributed to this codebase, see [our contributors](https://github.com/allenai/olmocr/graphs/contributors) page.
+
+<!-- end team -->
+
+## License
+
+<!-- start license -->
+
+**olmOCR** is licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0).
+A full copy of the license can be found [on GitHub](https://github.com/allenai/olmocr/blob/main/LICENSE).
+
+<!-- end license -->
+
+## Citing
+
+```bibtex
+@misc{olmocr,
+      title={{olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models}},
+      author={Jake Poznanski and Jon Borchardt and Jason Dunkelberger and Regan Huff and Daniel Lin and Aman Rangapur and Christopher Wilhelm and Kyle Lo and Luca Soldaini},
+      year={2025},
+      eprint={2502.18443},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2502.18443},
+}
+```

+ 24 - 0
RELEASE_PROCESS.md

@@ -0,0 +1,24 @@
+# GitHub Release Process
+
+## Steps
+
+1. Update the version in `olmocr/version.py`.
+
+3. Run the release script:
+
+    ```bash
+    ./scripts/release.sh
+    ```
+
+    This will commit the changes to the CHANGELOG and `version.py` files and then create a new tag in git
+    which will trigger a workflow on GitHub Actions that handles the rest.
+
+## Fixing a failed release
+
+If for some reason the GitHub Actions release workflow failed with an error that needs to be fixed, you'll have to delete both the tag and corresponding release from GitHub. After you've pushed a fix, delete the tag from your local clone with
+
+```bash
+git tag -l | xargs git tag -d && git fetch -t
+```
+
+Then repeat the steps above.

BIN
__pycache__/app.cpython-310.pyc


BIN
__pycache__/app.cpython-311.pyc


+ 98 - 0
api.py

@@ -0,0 +1,98 @@
+from fastapi import FastAPI, File, UploadFile
+from fastapi.responses import JSONResponse
+import os
+import subprocess
+import uvicorn
+import json
+import shutil
+
+from pathlib import Path
+
+app = FastAPI()
+
+# 确保工作目录存在
+WORKSPACE = "./workspace"
+os.makedirs(WORKSPACE, exist_ok=True)
+
+@app.post("/olmocr/")
+async def olmocr(file: UploadFile = File(...)):
+    """使用olmocr将pdf处理成具有格式化的文本"""
+    if file.content_type != "application/pdf":
+        print(file.content_type)
+        return JSONResponse(
+            status_code=400,
+            content={"message": "只支持 PDF 文件"}
+        )
+        
+    # 生成保存路径
+    file_path = os.path.join(WORKSPACE, str(file.filename))
+    # 保存文件
+    try:
+        contents = await file.read()
+        with open(file_path, "wb") as f:
+            f.write(contents)
+        pdf_path = os.path.join(WORKSPACE, str(file.filename))
+        
+        # 构建命令并执行
+        cmd = ["python", "-m", "olmocr.pipeline", WORKSPACE, "--pdfs", pdf_path]
+        
+         # 执行命令,等待完成
+        process = subprocess.run(
+            cmd,
+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE,
+            text=True,
+            check=True
+        )
+        # 命令输出
+        log_text = process.stdout
+        
+        # 检查结果目录
+        results_dir = os.path.join(WORKSPACE, "results")
+        # 查找输出文件
+        output_files = list(Path(results_dir).glob("output_*.jsonl"))
+        # 读取JSONL文件
+        output_file = output_files[0]
+        
+        with open(output_file, "r") as f:
+            content = f.read().strip()
+            if not content:
+                return f"输出文件为空\n\n日志输出:\n{log_text}", "", None, None
+            
+            # 解析JSON
+            result = json.loads(content)
+            extracted_text = result.get("text", "未找到文本内容")
+        return {
+            "message": extracted_text,
+        }
+    except Exception as e:
+        return JSONResponse(
+            status_code=500,
+            content={"message": f"文件上传失败: {str(e)}"}
+        )
+    finally:
+         # 清空 WORKSPACE 文件夹
+        try:
+            # 先关闭文件(如果已打开)
+            await file.close()
+            
+            # 删除 WORKSPACE 下的所有文件和文件夹
+            for filename in os.listdir(WORKSPACE):
+                file_path = os.path.join(WORKSPACE, filename)
+                try:
+                    if os.path.isfile(file_path) or os.path.islink(file_path):
+                        os.unlink(file_path)
+                    elif os.path.isdir(file_path):
+                        shutil.rmtree(file_path)
+                except Exception as e:
+                    print(f"删除 {file_path} 时出错: {e}")
+                    
+        except Exception as e:
+            print(f"清理工作空间时出错: {e}")
+    
+    
+    
+    
+if __name__ == "__main__":
+    uvicorn.run(app, host="0.0.0.0", port=8000)
+    

+ 8 - 0
api_test.py

@@ -0,0 +1,8 @@
+import requests
+
+url = "http://127.0.0.1:8000/olmocr"
+files = {"file": ("UNETR.pdf", open("./tests/gnarly_pdfs/UNETR.pdf", "rb"), "application/pdf")}
+
+response = requests.post(url, files=files)
+print(response.json())
+

+ 210 - 0
app.py

@@ -0,0 +1,210 @@
+import os
+import json
+import gradio as gr
+import subprocess
+import pandas as pd
+from pathlib import Path
+import shutil
+import time
+import re
+
+# 创建工作目录
+WORKSPACE_DIR = "olmocr_workspace"
+os.makedirs(WORKSPACE_DIR, exist_ok=True)
+
+def modify_html_for_better_display(html_content):
+    """修改HTML以便在Gradio中更好地显示"""
+    if not html_content:
+        return html_content
+    
+    # 增加容器宽度
+    html_content = html_content.replace('<div class="container">', 
+                                       '<div class="container" style="max-width: 100%; width: 100%;">')
+    
+    # 增加文本大小
+    html_content = html_content.replace('<style>', 
+                                       '<style>\nbody {font-size: 16px;}\n.text-content {font-size: 16px; line-height: 1.5;}\n')
+    
+    # 调整图像和文本部分的大小比例
+    html_content = html_content.replace('<div class="row">', 
+                                       '<div class="row" style="display: flex; flex-wrap: wrap;">')
+    html_content = html_content.replace('<div class="col-md-6">', 
+                                       '<div class="col-md-6" style="flex: 0 0 50%; max-width: 50%; padding: 15px;">')
+    
+    # 增加页面之间的间距
+    html_content = html_content.replace('<div class="page">', 
+                                       '<div class="page" style="margin-bottom: 30px; border-bottom: 1px solid #ccc; padding-bottom: 20px;">')
+    
+    # 增加图像大小
+    html_content = re.sub(r'<img([^>]*)style="([^"]*)"', 
+                         r'<img\1style="max-width: 100%; height: auto; \2"', 
+                         html_content)
+    
+    # 添加缩放控制
+    zoom_controls = """
+    <div style="position: fixed; bottom: 20px; right: 20px; background: #fff; padding: 10px; border-radius: 5px; box-shadow: 0 0 10px rgba(0,0,0,0.2); z-index: 1000;">
+        <button onclick="document.body.style.zoom = parseFloat(document.body.style.zoom || 1) + 0.1;" style="margin-right: 5px;">放大</button>
+        <button onclick="document.body.style.zoom = parseFloat(document.body.style.zoom || 1) - 0.1;">缩小</button>
+    </div>
+    """
+    html_content = html_content.replace('</body>', f'{zoom_controls}</body>')
+    
+    return html_content
+
+def process_pdf(pdf_file):
+    """处理PDF文件并返回结果"""
+    if pdf_file is None:
+        return "请上传PDF文件", "", None, None
+    
+    # 创建一个唯一的工作目录
+    timestamp = int(time.time())
+    work_dir = os.path.join(WORKSPACE_DIR, f"job_{timestamp}")
+    os.makedirs(work_dir, exist_ok=True)
+    
+    # 复制PDF文件
+    pdf_path = os.path.join(work_dir, "input.pdf")
+    shutil.copy(pdf_file, pdf_path)
+    
+    # 构建命令并执行
+    cmd = ["python", "-m", "olmocr.pipeline", work_dir, "--pdfs", pdf_path]
+    
+    try:
+        # 执行命令,等待完成
+        process = subprocess.run(
+            cmd,
+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE,
+            text=True,
+            check=True
+        )
+        
+        # 命令输出
+        log_text = process.stdout
+        
+        # 检查结果目录
+        results_dir = os.path.join(work_dir, "results")
+        if not os.path.exists(results_dir):
+            return f"处理完成,但未生成结果目录\n\n日志输出:\n{log_text}", "", None, None
+        
+        # 查找输出文件
+        output_files = list(Path(results_dir).glob("output_*.jsonl"))
+        if not output_files:
+            return f"处理完成,但未找到输出文件\n\n日志输出:\n{log_text}", "", None, None
+        
+        # 读取JSONL文件
+        output_file = output_files[0]
+        with open(output_file, "r") as f:
+            content = f.read().strip()
+            if not content:
+                return f"输出文件为空\n\n日志输出:\n{log_text}", "", None, None
+            
+            # 解析JSON
+            result = json.loads(content)
+            extracted_text = result.get("text", "未找到文本内容")
+            
+            # 生成HTML预览
+            try:
+                preview_cmd = ["python", "-m", "olmocr.viewer.dolmaviewer", str(output_file)]
+                subprocess.run(preview_cmd, check=True)
+            except Exception as e:
+                log_text += f"\n生成HTML预览失败: {str(e)}"
+            
+            # 查找HTML文件
+            html_files = list(Path("dolma_previews").glob("*.html"))
+            html_content = ""
+            if html_files:
+                try:
+                    with open(html_files[0], "r", encoding="utf-8") as hf:
+                        html_content = hf.read()
+                        # 修改HTML以更好地显示
+                        html_content = modify_html_for_better_display(html_content)
+                except Exception as e:
+                    log_text += f"\n读取HTML预览失败: {str(e)}"
+            
+            # 创建元数据表格
+            metadata = result.get("metadata", {})
+            meta_rows = []
+            for key, value in metadata.items():
+                meta_rows.append([key, value])
+            
+            df = pd.DataFrame(meta_rows, columns=["属性", "值"])
+            
+            # return log_text, extracted_text, html_content, df
+            return extracted_text
+        
+    except subprocess.CalledProcessError as e:
+        return f"命令执行失败: {e.stderr}", "", None, None
+    except Exception as e:
+        return f"处理过程中发生错误: {str(e)}", "", None, None
+
+# 创建Gradio界面
+with gr.Blocks(title="鼎盛方圆 PDF提取工具") as app:
+    gr.Markdown("# 鼎盛方圆 PDF文本提取工具")
+    
+    with gr.Row():
+        with gr.Column(scale=1):
+            pdf_input = gr.File(label="上传PDF文件", file_types=[".pdf"])
+            process_btn = gr.Button("处理PDF", variant="primary")
+        
+        with gr.Column(scale=2):
+            tabs = gr.Tabs()
+            with tabs:
+                with gr.TabItem("提取文本"):
+                    text_output = gr.Textbox(label="提取的文本", lines=20, interactive=True)
+                # with gr.TabItem("HTML预览", id="html_preview_tab"):
+                #     # 使用更大的HTML组件
+                #     html_output = gr.HTML(label="HTML预览", elem_id="html_preview_container")
+                # with gr.TabItem("元数据"):
+                #     meta_output = gr.DataFrame(label="文档元数据")
+                # with gr.TabItem("日志"):
+                #     log_output = gr.Textbox(label="处理日志", lines=15, interactive=False)
+    
+    # 使用CSS自定义HTML预览标签页和内容大小
+    gr.HTML("""
+    <style>
+    #html_preview_container {
+        height: 800px;
+        width: 100%; 
+        overflow: auto;
+        border: 1px solid #ddd;
+        border-radius: 4px;
+    }
+    #html_preview_container iframe {
+        width: 100%;
+        height: 100%;
+        border: none;
+    }
+    </style>
+    """)
+    
+    # 添加操作说明
+    gr.Markdown("""
+    ## 使用说明
+    1. 上传PDF文件
+    2. 点击"处理PDF"按钮
+    3. 等待处理完成
+    4. 查看提取的文本
+    
+    ## 注意
+    - 处理过程可能需要几分钟,请耐心等待
+    - 首次运行会下载模型(约7GB)
+    """)
+    
+    # 绑定按钮事件 - 使用阻塞模式
+    process_btn.click(
+        fn=process_pdf,
+        inputs=pdf_input,
+        # outputs=[log_output, text_output, html_output, meta_output],
+        outputs=[text_output],
+        api_name="process"
+    )
+
+# 启动应用
+if __name__ == "__main__":
+    app.launch(
+        server_name='0.0.0.0',
+        server_port=5000,
+        share=False
+    )
+
+

+ 36 - 0
combine_results.py

@@ -0,0 +1,36 @@
+import os
+import json
+
+def merge_jsonl_files(input_folder, output_file):
+    """
+    合并指定文件夹下所有.jsonl文件的内容到results.txt
+    :param input_folder: 包含.jsonl文件的文件夹路径
+    :param output_file: 输出文件路径(如results.txt)
+    """
+    # 确保输出目录存在
+
+    with open(output_file, 'w', encoding='utf-8') as out_f:
+        # 遍历文件夹中的所有文件
+        for filename in os.listdir(input_folder):
+            if filename.endswith('.jsonl'):
+                filepath = os.path.join(input_folder, filename)
+                try:
+                    with open(filepath, 'r', encoding='utf-8') as in_f:
+                        for line in in_f:
+                            # 直接写入原始行(保留JSONL格式)
+                            out_f.write(line)
+                            # 如果需要提取特定字段,可以解析JSON:
+                            # data = json.loads(line)
+                            # out_f.write(data.get('text', '') + '\n')
+                    print(f"已合并: {filename}")
+                except Exception as e:
+                    print(f"处理文件 {filename} 时出错: {e}")
+
+    print(f"所有.jsonl文件已合并到 {output_file}")
+
+if __name__ == "__main__":
+    # 使用示例
+    input_folder = "./localworkspace/results/"  # 替换为你的.jsonl文件所在文件夹
+    output_file = "./results.txt"        # 输出文件路径
+    
+    merge_jsonl_files(input_folder, output_file)

+ 1 - 0
docs/.gitignore

@@ -0,0 +1 @@
+build

+ 20 - 0
docs/Makefile

@@ -0,0 +1,20 @@
+# Minimal makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line, and also
+# from the environment for the first two.
+SPHINXOPTS    ?= -W
+SPHINXBUILD   ?= sphinx-build
+SOURCEDIR     = source
+BUILDDIR      = build
+
+# Put it first so that "make" without argument is like "make help".
+help:
+	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
+
+.PHONY: help Makefile
+
+# Catch-all target: route all unknown targets to Sphinx using the new
+# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
+%: Makefile
+	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

+ 35 - 0
docs/make.bat

@@ -0,0 +1,35 @@
+@ECHO OFF
+
+pushd %~dp0
+
+REM Command file for Sphinx documentation
+
+if "%SPHINXBUILD%" == "" (
+	set SPHINXBUILD=sphinx-build
+)
+set SOURCEDIR=source
+set BUILDDIR=build
+
+if "%1" == "" goto help
+
+%SPHINXBUILD% >NUL 2>NUL
+if errorlevel 9009 (
+	echo.
+	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
+	echo.installed, then set the SPHINXBUILD environment variable to point
+	echo.to the full path of the 'sphinx-build' executable. Alternatively you
+	echo.may add the Sphinx directory to PATH.
+	echo.
+	echo.If you don't have Sphinx installed, grab it from
+	echo.https://www.sphinx-doc.org/
+	exit /b 1
+)
+
+%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+goto end
+
+:help
+%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
+
+:end
+popd

+ 18 - 0
docs/source/CHANGELOG.md

@@ -0,0 +1,18 @@
+# Changelog
+
+All notable changes to this project will be documented in this file.
+
+The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
+and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
+
+## Unreleased
+
+## [v0.1.60](https://github.com/allenai/olmocr/releases/tag/v0.1.60) - 2025-03-17
+
+## [v0.1.58](https://github.com/allenai/olmocr/releases/tag/v0.1.58) - 2025-02-15
+
+## [v0.1.53](https://github.com/allenai/olmocr/releases/tag/v0.1.53) - 2025-02-14
+
+- Fixed git checks
+
+- Added gemini and claude runners and a viewer.

+ 168 - 0
docs/source/CONTRIBUTING.md

@@ -0,0 +1,168 @@
+# Contributing
+
+Thanks for considering contributing! Please read this document to learn the various ways you can contribute to this project and how to go about doing it.
+
+## Bug reports and feature requests
+
+### Did you find a bug?
+
+First, do [a quick search](https://github.com/allenai/olmocrissues) to see whether your issue has already been reported.
+If your issue has already been reported, please comment on the existing issue.
+
+Otherwise, open [a new GitHub issue](https://github.com/allenai/olmocrissues).  Be sure to include a clear title
+and description.  The description should include as much relevant information as possible.  The description should
+explain how to reproduce the erroneous behavior as well as the behavior you expect to see.  Ideally you would include a
+code sample or an executable test case demonstrating the expected behavior.
+
+### Do you have a suggestion for an enhancement or new feature?
+
+We use GitHub issues to track feature requests. Before you create a feature request:
+
+* Make sure you have a clear idea of the enhancement you would like. If you have a vague idea, consider discussing
+it first on a GitHub issue.
+* Check the documentation to make sure your feature does not already exist.
+* Do [a quick search](https://github.com/allenai/olmocrissues) to see whether your feature has already been suggested.
+
+When creating your request, please:
+
+* Provide a clear title and description.
+* Explain why the enhancement would be useful. It may be helpful to highlight the feature in other libraries.
+* Include code examples to demonstrate how the enhancement would be used.
+
+## Making a pull request
+
+When you're ready to contribute code to address an open issue, please follow these guidelines to help us be able to review your pull request (PR) quickly.
+
+1. **Initial setup** (only do this once)
+
+    <details><summary>Expand details 👇</summary><br/>
+
+    If you haven't already done so, please [fork](https://help.github.com/en/enterprise/2.13/user/articles/fork-a-repo) this repository on GitHub.
+
+    Then clone your fork locally with
+
+        git clone https://github.com/USERNAME/olmocrgit
+
+    or 
+
+        git clone git@github.com:USERNAME/olmocrgit
+
+    At this point the local clone of your fork only knows that it came from *your* repo, github.com/USERNAME/olmocrgit, but doesn't know anything the *main* repo, [https://github.com/allenai/oolmocrit](https://github.com/allenai/ololmocrYou can see this by running
+
+        git remote -v
+
+    which will output something like this:
+
+        origin https://github.com/USERNAME/olmocrgit (fetch)
+        origin https://github.com/USERNAME/olmocrgit (push)
+
+    This means that your local clone can only track changes from your fork, but not from the main repo, and so you won't be able to keep your fork up-to-date with the main repo over time. Therefore you'll need to add another "remote" to your clone that points to [https://github.com/allenai/olmocrgit](https://github.com/allenai/oolmocr To do this, run the following:
+
+        git remote add upstream https://github.com/allenai/olmocrgit
+
+    Now if you do `git remote -v` again, you'll see
+
+        origin https://github.com/USERNAME/olmocrgit (fetch)
+        origin https://github.com/USERNAME/olmocrgit (push)
+        upstream https://github.com/allenai/olmocrgit (fetch)
+        upstream https://github.com/allenai/olmocrgit (push)
+
+    Finally, you'll need to create a Python 3 virtual environment suitable for working on this project. There a number of tools out there that making working with virtual environments easier.
+    The most direct way is with the [`venv` module](https://docs.python.org/3.7/library/venv.html) in the standard library, but if you're new to Python or you don't already have a recent Python 3 version installed on your machine,
+    we recommend [Miniconda](https://docs.conda.io/en/latest/miniconda.html).
+
+    On Mac, for example, you can install Miniconda with [Homebrew](https://brew.sh/):
+
+        brew install miniconda
+
+    Then you can create and activate a new Python environment by running:
+
+        conda create -n olmocrpython=3.9
+        conda activate olmocr
+
+    Once your virtual environment is activated, you can install your local clone in "editable mode" with
+
+        pip install -U pip setuptools wheel
+        pip install -e .[dev]
+
+    The "editable mode" comes from the `-e` argument to `pip`, and essential just creates a symbolic link from the site-packages directory of your virtual environment to the source code in your local clone. That way any changes you make will be immediately reflected in your virtual environment.
+
+    </details>
+
+2. **Ensure your fork is up-to-date**
+
+    <details><summary>Expand details 👇</summary><br/>
+
+    Once you've added an "upstream" remote pointing to [https://github.com/allenai/python-package-temlate.git](https://github.com/allenai/olmocr, keeping your fork up-to-date is easy:
+
+        git checkout main  # if not already on main
+        git pull --rebase upstream main
+        git push
+
+    </details>
+
+3. **Create a new branch to work on your fix or enhancement**
+
+    <details><summary>Expand details 👇</summary><br/>
+
+    Committing directly to the main branch of your fork is not recommended. It will be easier to keep your fork clean if you work on a separate branch for each contribution you intend to make.
+
+    You can create a new branch with
+
+        # replace BRANCH with whatever name you want to give it
+        git checkout -b BRANCH
+        git push -u origin BRANCH
+
+    </details>
+
+4. **Test your changes**
+
+    <details><summary>Expand details 👇</summary><br/>
+
+    Our continuous integration (CI) testing runs [a number of checks](https://github.com/allenai/olmocractions) for each pull request on [GitHub Actions](https://github.com/features/actions). You can run most of these tests locally, which is something you should do *before* opening a PR to help speed up the review process and make it easier for us.
+
+    First, you should run [`isort`](https://github.com/PyCQA/isort) and [`black`](https://github.com/psf/black) to make sure you code is formatted consistently.
+    Many IDEs support code formatters as plugins, so you may be able to setup isort and black to run automatically everytime you save.
+    For example, [`black.vim`](https://github.com/psf/black/tree/master/plugin) will give you this functionality in Vim. But both `isort` and `black` are also easy to run directly from the command line.
+    Just run this from the root of your clone:
+
+        isort .
+        black .
+
+    Our CI also uses [`ruff`](https://github.com/astral-sh/ruff) to lint the code base and [`mypy`](http://mypy-lang.org/) for type-checking. You should run both of these next with
+
+        ruff check .
+
+    and
+
+        mypy .
+
+    We also strive to maintain high test coverage, so most contributions should include additions to [the unit tests](https://github.com/allenai/olmocrtree/main/tests). These tests are run with [`pytest`](https://docs.pytest.org/en/latest/), which you can use to locally run any test modules that you've added or changed.
+
+    For example, if you've fixed a bug in `olmocra/b.py`, you can run the tests specific to that module with
+
+        pytest -v tests/a/b_test.py
+
+    If your contribution involves additions to any public part of the API, we require that you write docstrings
+    for each function, method, class, or module that you add.
+    See the [Writing docstrings](#writing-docstrings) section below for details on the syntax.
+    You should test to make sure the API documentation can build without errors by running
+
+        make docs
+
+    If the build fails, it's most likely due to small formatting issues. If the error message isn't clear, feel free to comment on this in your pull request.
+
+    And finally, please update the [CHANGELOG](https://github.com/allenai/olmocrblob/main/CHANGELOG.md) with notes on your contribution in the "Unreleased" section at the top.
+
+    After all of the above checks have passed, you can now open [a new GitHub pull request](https://github.com/allenai/olmocrpulls).
+    Make sure you have a clear description of the problem and the solution, and include a link to relevant issues.
+
+    We look forward to reviewing your PR!
+
+    </details>
+
+### Writing docstrings
+
+We use [Sphinx](https://www.sphinx-doc.org/en/master/index.html) to build our API docs, which automatically parses all docstrings
+of public classes and methods using the [autodoc](https://www.sphinx-doc.org/en/master/usage/extensions/autodoc.html) extension.
+Please refer to autoc's documentation to learn about the docstring syntax.

+ 0 - 0
docs/source/_static/css/custom.css


BIN
docs/source/_static/favicon.ico


+ 121 - 0
docs/source/conf.py

@@ -0,0 +1,121 @@
+# Configuration file for the Sphinx documentation builder.
+#
+# This file only contains a selection of the most common options. For a full
+# list see the documentation:
+# https://www.sphinx-doc.org/en/master/usage/configuration.html
+
+import logging
+import os
+import sys
+from datetime import datetime
+
+# -- Path setup --------------------------------------------------------------
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#
+
+sys.path.insert(0, os.path.abspath("../../"))
+
+from olmocr import VERSION, VERSION_SHORT  # noqa: E402
+
+# -- Project information -----------------------------------------------------
+
+project = "olmocr"
+copyright = f"{datetime.today().year}, Allen Institute for Artificial Intelligence"
+author = "Allen Institute for Artificial Intelligence"
+version = VERSION_SHORT
+release = VERSION
+
+
+# -- General configuration ---------------------------------------------------
+
+# Add any Sphinx extension module names here, as strings. They can be
+# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
+# ones.
+extensions = [
+    "sphinx.ext.autodoc",
+    "sphinx.ext.napoleon",
+    "myst_parser",
+    "sphinx.ext.intersphinx",
+    "sphinx.ext.viewcode",
+    "sphinx.ext.doctest",
+    "sphinx_copybutton",
+    "sphinx_autodoc_typehints",
+]
+
+# Tell myst-parser to assign header anchors for h1-h3.
+myst_heading_anchors = 3
+
+suppress_warnings = ["myst.header"]
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ["_templates"]
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+# This pattern also affects html_static_path and html_extra_path.
+exclude_patterns = ["_build"]
+
+source_suffix = [".rst", ".md"]
+
+intersphinx_mapping = {
+    "python": ("https://docs.python.org/3", None),
+    # Uncomment these if you use them in your codebase:
+    #  "torch": ("https://pytorch.org/docs/stable", None),
+    #  "datasets": ("https://huggingface.co/docs/datasets/master/en", None),
+    #  "transformers": ("https://huggingface.co/docs/transformers/master/en", None),
+}
+
+# By default, sort documented members by type within classes and modules.
+autodoc_member_order = "groupwise"
+
+# Include default values when documenting parameter types.
+typehints_defaults = "comma"
+
+
+# -- Options for HTML output -------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+#
+html_theme = "furo"
+
+html_title = f"olmocr v{VERSION}"
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ["_static"]
+
+html_css_files = ["css/custom.css"]
+
+html_favicon = "_static/favicon.ico"
+
+html_theme_options = {
+    "footer_icons": [
+        {
+            "name": "GitHub",
+            "url": "https://github.com/allenai/olmocr",
+            "html": """
+                <svg stroke="currentColor" fill="currentColor" stroke-width="0" viewBox="0 0 16 16">
+                    <path fill-rule="evenodd" d="M8 0C3.58 0 0 3.58 0 8c0 3.54 2.29 6.53 5.47 7.59.4.07.55-.17.55-.38 0-.19-.01-.82-.01-1.49-2.01.37-2.53-.49-2.69-.94-.09-.23-.48-.94-.82-1.13-.28-.15-.68-.52-.01-.53.63-.01 1.08.58 1.23.82.72 1.21 1.87.87 2.33.66.07-.52.28-.87.51-1.07-1.78-.2-3.64-.89-3.64-3.95 0-.87.31-1.59.82-2.15-.08-.2-.36-1.02.08-2.12 0 0 .67-.21 2.2.82.64-.18 1.32-.27 2-.27.68 0 1.36.09 2 .27 1.53-1.04 2.2-.82 2.2-.82.44 1.1.16 1.92.08 2.12.51.56.82 1.27.82 2.15 0 3.07-1.87 3.75-3.65 3.95.29.25.54.73.54 1.48 0 1.07-.01 1.93-.01 2.2 0 .21.15.46.55.38A8.013 8.013 0 0 0 16 8c0-4.42-3.58-8-8-8z"></path>
+                </svg>
+            """,  # noqa: E501
+            "class": "",
+        },
+    ],
+}
+
+# -- Hack to get rid of stupid warnings from sphinx_autodoc_typehints --------
+
+
+class ShutupSphinxAutodocTypehintsFilter(logging.Filter):
+    def filter(self, record: logging.LogRecord) -> bool:
+        if "Cannot resolve forward reference" in record.msg:
+            return False
+        return True
+
+
+logging.getLogger("sphinx.sphinx_autodoc_typehints").addFilter(ShutupSphinxAutodocTypehintsFilter())

+ 27 - 0
docs/source/index.md

@@ -0,0 +1,27 @@
+# **olmocr**
+
+```{toctree}
+:maxdepth: 2
+:hidden:
+:caption: Getting started
+
+installation
+overview
+```
+
+```{toctree}
+:hidden:
+:caption: Development
+
+CHANGELOG
+CONTRIBUTING
+License <https://raw.githubusercontent.com/allenai/olmocr/main/LICENSE>
+GitHub Repository <https://github.com/allenai/olmocr>
+```
+
+## Indices and tables
+
+```{eval-rst}
+* :ref:`genindex`
+* :ref:`modindex`
+```

+ 27 - 0
docs/source/installation.md

@@ -0,0 +1,27 @@
+Installation
+============
+
+**olmocr** supports Python >= 3.8.
+
+## Installing with `pip`
+
+**olmocr** is available [on PyPI](https://pypi.org/project/olmocr/). Just run
+
+```bash
+pip install olmocr
+```
+
+## Installing from source
+
+To install **olmocr** from source, first clone [the repository](https://github.com/allenai/olmocr):
+
+```bash
+git clone https://github.com/allenai/olmocr.git
+cd olmocr
+```
+
+Then run
+
+```bash
+pip install -e .
+```

+ 3 - 0
docs/source/overview.md

@@ -0,0 +1,3 @@
+Overview
+========
+

Файловите разлики са ограничени, защото са твърде много
+ 140 - 0
dolma_previews/olmocr_workspace_job_1747418779_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 154 - 0
dolma_previews/olmocr_workspace_job_1747418950_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 118 - 0
dolma_previews/olmocr_workspace_job_1747419168_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 118 - 0
dolma_previews/olmocr_workspace_job_1747419731_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 157 - 0
dolma_previews/olmocr_workspace_job_1747420686_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 154 - 0
dolma_previews/olmocr_workspace_job_1747493567_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 157 - 0
dolma_previews/olmocr_workspace_job_1747493749_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 130 - 0
dolma_previews/olmocr_workspace_job_1747493917_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 157 - 0
dolma_previews/olmocr_workspace_job_1747495273_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 157 - 0
dolma_previews/olmocr_workspace_job_1747495669_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 157 - 0
dolma_previews/olmocr_workspace_job_1747495750_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 157 - 0
dolma_previews/olmocr_workspace_job_1747496237_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 157 - 0
dolma_previews/olmocr_workspace_job_1747496349_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 157 - 0
dolma_previews/olmocr_workspace_job_1747496456_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 157 - 0
dolma_previews/olmocr_workspace_job_1747496736_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 157 - 0
dolma_previews/olmocr_workspace_job_1747496860_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 157 - 0
dolma_previews/olmocr_workspace_job_1747496960_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 161 - 0
dolma_previews/olmocr_workspace_job_1747497084_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 157 - 0
dolma_previews/olmocr_workspace_job_1747497493_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 157 - 0
dolma_previews/olmocr_workspace_job_1747497590_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 157 - 0
dolma_previews/olmocr_workspace_job_1747534381_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 154 - 0
dolma_previews/olmocr_workspace_job_1747534713_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 157 - 0
dolma_previews/olmocr_workspace_job_1747795702_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 157 - 0
dolma_previews/olmocr_workspace_job_1747795907_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 238 - 0
dolma_previews/olmocr_workspace_job_1753002204_input_pdf.html


Файловите разлики са ограничени, защото са твърде много
+ 238 - 0
dolma_previews/olmocr_workspace_job_1753002474_input_pdf.html


+ 35 - 0
gantry-requirements.txt

@@ -0,0 +1,35 @@
+torchvision
+cached-path
+smart_open
+pypdf
+pypdfium2
+lingua-language-detector
+Pillow
+ruff
+mypy>=1.0,<1.5
+black>=23.0,<24.0
+isort>=5.12,<5.13
+pytest
+pytest-sphinx
+pytest-cov
+twine>=1.11.0
+build
+setuptools
+wheel
+Sphinx>=4.3.0,<7.1.0
+furo==2023.7.26
+myst-parser>=1.0,<2.1
+sphinx-copybutton==0.5.2
+sphinx-autobuild==2021.3.14
+sphinx-autodoc-typehints==1.23.3
+packaging
+necessary
+accelerate>=0.34.2
+datasets==3.0.0
+peft
+wandb
+omegaconf
+s3fs
+transformers>=4.45.1
+bitsandbytes
+ftfy

Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_03f19a67ca1619f854740bd806a32d7112c3c315.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_0640d37e5d5afe1fb4a4e053d7d3389e927e5bf7.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_06798e8f7cc26525f138f26354ffab7c63074f2c.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_0c3e9a89b35c3045b6a67f7cd5c06009a31d750f.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_10dc5d29c3f17870daf918c9555cd0b939acbffe.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_1cbf4da516b0dca0de138db476a8a65d2dbc5aab.jsonl


+ 0 - 0
localworkspace/results/output_21ee5d5d32535bcacd750ef2dace24b98fa42fdb.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_225426c1e59a9bf843a4d1088c3c98aa0321642c.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_24809642f1ed21aee754e7c58d350b261d121212.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_2b4bbfbba141c9173ab5abba31f4a4c140a0fd85.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_2ff00bac5e9500c24956e5386f6e7a49b2b55098.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_398aeb9cc239880a7222603994af5c4016796381.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_5da3510f60e4d62bb38dbf36fb90d4a0034727fa.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_662cdaa711447efb75b7c325ea177326afc2747b.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_7815bd6305410d3cbbea8287ed60dae1462e6e65.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_7e7415b1a884dd4b422626d1f93cc9d5ff33301c.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_8450bc4e95932e232e795c885ec59ab601993cab.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_95eb6113ad117cc5bc5c734f7ca31625e117229d.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_9face5eb793573e747789b627bf1cc4b334b5b93.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_a516ff5c967066055babccbea12ff6a88bdfe9b5.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_a7cda58bb6cdd49b7ffd2f6d48a871b4e1da7e62.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_aef98857329873e434b4b835531b5abd2cfca622.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_b3152b4cd8ddb87e2ad8e5fbf7906815031ce44f.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_c07c41e4c78e5049d035d0059223ac0adc60be49.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_c1e2b4f5c6c4bb6407c21dcae6a8dccdc2ad0e74.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_d0cf1cf8644fafcb025a313b4bec083ea97e8c8d.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_dbac13d5d8d14af821606b2b6fcec79288c911ad.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_e4811c9442eb8e0a3b6177e544c95e0299d41166.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_f5bd195da84dc4c9a132080ffb1a40239bb6d12b.jsonl


Файловите разлики са ограничени, защото са твърде много
+ 0 - 0
localworkspace/results/output_f89f7b1c93bc7bae613c7002942c0c65ba3a03d7.jsonl


BIN
localworkspace/work_index_list.csv.zstd


Файловите разлики са ограничени, защото са твърде много
+ 40 - 0
olmocr-pipeline-debug.log


+ 511 - 0
olmocr.egg-info/PKG-INFO

@@ -0,0 +1,511 @@
+Metadata-Version: 2.4
+Name: olmocr
+Version: 0.1.67
+Author-email: Allen Institute for Artificial Intelligence <jakep@allenai.org>
+License:                                  Apache License
+                                   Version 2.0, January 2004
+                                https://www.apache.org/licenses/
+        
+           TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+        
+           1. Definitions.
+        
+              "License" shall mean the terms and conditions for use, reproduction,
+              and distribution as defined by Sections 1 through 9 of this document.
+        
+              "Licensor" shall mean the copyright owner or entity authorized by
+              the copyright owner that is granting the License.
+        
+              "Legal Entity" shall mean the union of the acting entity and all
+              other entities that control, are controlled by, or are under common
+              control with that entity. For the purposes of this definition,
+              "control" means (i) the power, direct or indirect, to cause the
+              direction or management of such entity, whether by contract or
+              otherwise, or (ii) ownership of fifty percent (50%) or more of the
+              outstanding shares, or (iii) beneficial ownership of such entity.
+        
+              "You" (or "Your") shall mean an individual or Legal Entity
+              exercising permissions granted by this License.
+        
+              "Source" form shall mean the preferred form for making modifications,
+              including but not limited to software source code, documentation
+              source, and configuration files.
+        
+              "Object" form shall mean any form resulting from mechanical
+              transformation or translation of a Source form, including but
+              not limited to compiled object code, generated documentation,
+              and conversions to other media types.
+        
+              "Work" shall mean the work of authorship, whether in Source or
+              Object form, made available under the License, as indicated by a
+              copyright notice that is included in or attached to the work
+              (an example is provided in the Appendix below).
+        
+              "Derivative Works" shall mean any work, whether in Source or Object
+              form, that is based on (or derived from) the Work and for which the
+              editorial revisions, annotations, elaborations, or other modifications
+              represent, as a whole, an original work of authorship. For the purposes
+              of this License, Derivative Works shall not include works that remain
+              separable from, or merely link (or bind by name) to the interfaces of,
+              the Work and Derivative Works thereof.
+        
+              "Contribution" shall mean any work of authorship, including
+              the original version of the Work and any modifications or additions
+              to that Work or Derivative Works thereof, that is intentionally
+              submitted to Licensor for inclusion in the Work by the copyright owner
+              or by an individual or Legal Entity authorized to submit on behalf of
+              the copyright owner. For the purposes of this definition, "submitted"
+              means any form of electronic, verbal, or written communication sent
+              to the Licensor or its representatives, including but not limited to
+              communication on electronic mailing lists, source code control systems,
+              and issue tracking systems that are managed by, or on behalf of, the
+              Licensor for the purpose of discussing and improving the Work, but
+              excluding communication that is conspicuously marked or otherwise
+              designated in writing by the copyright owner as "Not a Contribution."
+        
+              "Contributor" shall mean Licensor and any individual or Legal Entity
+              on behalf of whom a Contribution has been received by Licensor and
+              subsequently incorporated within the Work.
+        
+           2. Grant of Copyright License. Subject to the terms and conditions of
+              this License, each Contributor hereby grants to You a perpetual,
+              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+              copyright license to reproduce, prepare Derivative Works of,
+              publicly display, publicly perform, sublicense, and distribute the
+              Work and such Derivative Works in Source or Object form.
+        
+           3. Grant of Patent License. Subject to the terms and conditions of
+              this License, each Contributor hereby grants to You a perpetual,
+              worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+              (except as stated in this section) patent license to make, have made,
+              use, offer to sell, sell, import, and otherwise transfer the Work,
+              where such license applies only to those patent claims licensable
+              by such Contributor that are necessarily infringed by their
+              Contribution(s) alone or by combination of their Contribution(s)
+              with the Work to which such Contribution(s) was submitted. If You
+              institute patent litigation against any entity (including a
+              cross-claim or counterclaim in a lawsuit) alleging that the Work
+              or a Contribution incorporated within the Work constitutes direct
+              or contributory patent infringement, then any patent licenses
+              granted to You under this License for that Work shall terminate
+              as of the date such litigation is filed.
+        
+           4. Redistribution. You may reproduce and distribute copies of the
+              Work or Derivative Works thereof in any medium, with or without
+              modifications, and in Source or Object form, provided that You
+              meet the following conditions:
+        
+              (a) You must give any other recipients of the Work or
+                  Derivative Works a copy of this License; and
+        
+              (b) You must cause any modified files to carry prominent notices
+                  stating that You changed the files; and
+        
+              (c) You must retain, in the Source form of any Derivative Works
+                  that You distribute, all copyright, patent, trademark, and
+                  attribution notices from the Source form of the Work,
+                  excluding those notices that do not pertain to any part of
+                  the Derivative Works; and
+        
+              (d) If the Work includes a "NOTICE" text file as part of its
+                  distribution, then any Derivative Works that You distribute must
+                  include a readable copy of the attribution notices contained
+                  within such NOTICE file, excluding those notices that do not
+                  pertain to any part of the Derivative Works, in at least one
+                  of the following places: within a NOTICE text file distributed
+                  as part of the Derivative Works; within the Source form or
+                  documentation, if provided along with the Derivative Works; or,
+                  within a display generated by the Derivative Works, if and
+                  wherever such third-party notices normally appear. The contents
+                  of the NOTICE file are for informational purposes only and
+                  do not modify the License. You may add Your own attribution
+                  notices within Derivative Works that You distribute, alongside
+                  or as an addendum to the NOTICE text from the Work, provided
+                  that such additional attribution notices cannot be construed
+                  as modifying the License.
+        
+              You may add Your own copyright statement to Your modifications and
+              may provide additional or different license terms and conditions
+              for use, reproduction, or distribution of Your modifications, or
+              for any such Derivative Works as a whole, provided Your use,
+              reproduction, and distribution of the Work otherwise complies with
+              the conditions stated in this License.
+        
+           5. Submission of Contributions. Unless You explicitly state otherwise,
+              any Contribution intentionally submitted for inclusion in the Work
+              by You to the Licensor shall be under the terms and conditions of
+              this License, without any additional terms or conditions.
+              Notwithstanding the above, nothing herein shall supersede or modify
+              the terms of any separate license agreement you may have executed
+              with Licensor regarding such Contributions.
+        
+           6. Trademarks. This License does not grant permission to use the trade
+              names, trademarks, service marks, or product names of the Licensor,
+              except as required for reasonable and customary use in describing the
+              origin of the Work and reproducing the content of the NOTICE file.
+        
+           7. Disclaimer of Warranty. Unless required by applicable law or
+              agreed to in writing, Licensor provides the Work (and each
+              Contributor provides its Contributions) on an "AS IS" BASIS,
+              WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+              implied, including, without limitation, any warranties or conditions
+              of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+              PARTICULAR PURPOSE. You are solely responsible for determining the
+              appropriateness of using or redistributing the Work and assume any
+              risks associated with Your exercise of permissions under this License.
+        
+           8. Limitation of Liability. In no event and under no legal theory,
+              whether in tort (including negligence), contract, or otherwise,
+              unless required by applicable law (such as deliberate and grossly
+              negligent acts) or agreed to in writing, shall any Contributor be
+              liable to You for damages, including any direct, indirect, special,
+              incidental, or consequential damages of any character arising as a
+              result of this License or out of the use or inability to use the
+              Work (including but not limited to damages for loss of goodwill,
+              work stoppage, computer failure or malfunction, or any and all
+              other commercial damages or losses), even if such Contributor
+              has been advised of the possibility of such damages.
+        
+           9. Accepting Warranty or Additional Liability. While redistributing
+              the Work or Derivative Works thereof, You may choose to offer,
+              and charge a fee for, acceptance of support, warranty, indemnity,
+              or other liability obligations and/or rights consistent with this
+              License. However, in accepting such obligations, You may act only
+              on Your own behalf and on Your sole responsibility, not on behalf
+              of any other Contributor, and only if You agree to indemnify,
+              defend, and hold each Contributor harmless for any liability
+              incurred by, or claims asserted against, such Contributor by reason
+              of your accepting any such warranty or additional liability.
+        
+           END OF TERMS AND CONDITIONS
+        
+           APPENDIX: How to apply the Apache License to your work.
+        
+              To apply the Apache License to your work, attach the following
+              boilerplate notice, with the fields enclosed by brackets "{}"
+              replaced with your own identifying information. (Don't include
+              the brackets!)  The text should be enclosed in the appropriate
+              comment syntax for the file format. We also recommend that a
+              file or class name and description of purpose be included on the
+              same "printed page" as the copyright notice for easier
+              identification within third-party archives.
+        
+           Copyright {yyyy} {name of copyright owner}
+        
+           Licensed under the Apache License, Version 2.0 (the "License");
+           you may not use this file except in compliance with the License.
+           You may obtain a copy of the License at
+        
+               https://www.apache.org/licenses/LICENSE-2.0
+        
+           Unless required by applicable law or agreed to in writing, software
+           distributed under the License is distributed on an "AS IS" BASIS,
+           WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+           See the License for the specific language governing permissions and
+           limitations under the License.
+        
+Project-URL: Homepage, https://github.com/allenai/olmocr
+Project-URL: Repository, https://github.com/allenai/olmocr
+Project-URL: Changelog, https://github.com/allenai/olmocr/blob/main/CHANGELOG.md
+Classifier: Intended Audience :: Science/Research
+Classifier: Development Status :: 3 - Alpha
+Classifier: License :: OSI Approved :: Apache Software License
+Classifier: Programming Language :: Python :: 3
+Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
+Requires-Python: >=3.11
+Description-Content-Type: text/markdown
+License-File: LICENSE
+Requires-Dist: cached-path
+Requires-Dist: smart_open
+Requires-Dist: pypdf>=5.2.0
+Requires-Dist: pypdfium2
+Requires-Dist: cryptography
+Requires-Dist: lingua-language-detector
+Requires-Dist: Pillow
+Requires-Dist: ftfy
+Requires-Dist: bleach
+Requires-Dist: markdown2
+Requires-Dist: filelock
+Requires-Dist: orjson
+Requires-Dist: requests
+Requires-Dist: zstandard
+Requires-Dist: boto3
+Requires-Dist: httpx
+Requires-Dist: torch>=2.5.1
+Requires-Dist: transformers==4.46.2
+Requires-Dist: img2pdf
+Requires-Dist: beaker-py
+Provides-Extra: gpu
+Requires-Dist: sgl-kernel==0.0.3.post1; extra == "gpu"
+Requires-Dist: sglang[all]==0.4.2; extra == "gpu"
+Provides-Extra: dev
+Requires-Dist: ruff; extra == "dev"
+Requires-Dist: mypy; extra == "dev"
+Requires-Dist: black; extra == "dev"
+Requires-Dist: isort; extra == "dev"
+Requires-Dist: pytest; extra == "dev"
+Requires-Dist: pytest-sphinx; extra == "dev"
+Requires-Dist: pytest-cov; extra == "dev"
+Requires-Dist: twine>=1.11.0; extra == "dev"
+Requires-Dist: build; extra == "dev"
+Requires-Dist: setuptools; extra == "dev"
+Requires-Dist: wheel; extra == "dev"
+Requires-Dist: Sphinx<7.1.0,>=4.3.0; extra == "dev"
+Requires-Dist: furo==2023.7.26; extra == "dev"
+Requires-Dist: myst-parser<2.1,>=1.0; extra == "dev"
+Requires-Dist: sphinx-copybutton==0.5.2; extra == "dev"
+Requires-Dist: sphinx-autobuild==2021.3.14; extra == "dev"
+Requires-Dist: sphinx-autodoc-typehints==1.23.3; extra == "dev"
+Requires-Dist: packaging; extra == "dev"
+Requires-Dist: necessary; extra == "dev"
+Requires-Dist: peft; extra == "dev"
+Requires-Dist: datasets; extra == "dev"
+Requires-Dist: omegaconf; extra == "dev"
+Requires-Dist: spacy; extra == "dev"
+Provides-Extra: bench
+Requires-Dist: tinyhost; extra == "bench"
+Requires-Dist: fuzzysearch; extra == "bench"
+Requires-Dist: rapidfuzz; extra == "bench"
+Requires-Dist: sequence_align; extra == "bench"
+Requires-Dist: syntok; extra == "bench"
+Requires-Dist: openai; extra == "bench"
+Requires-Dist: google-genai; extra == "bench"
+Requires-Dist: playwright; extra == "bench"
+Requires-Dist: mistralai; extra == "bench"
+Requires-Dist: lxml; extra == "bench"
+Requires-Dist: flask; extra == "bench"
+Provides-Extra: train
+Requires-Dist: torch; extra == "train"
+Requires-Dist: torchvision; extra == "train"
+Requires-Dist: accelerate; extra == "train"
+Requires-Dist: datasets; extra == "train"
+Requires-Dist: peft; extra == "train"
+Requires-Dist: wandb; extra == "train"
+Requires-Dist: omegaconf; extra == "train"
+Requires-Dist: s3fs; extra == "train"
+Requires-Dist: necessary; extra == "train"
+Requires-Dist: einops; extra == "train"
+Requires-Dist: transformers>=4.45.1; extra == "train"
+Provides-Extra: elo
+Requires-Dist: numpy; extra == "elo"
+Requires-Dist: scipy; extra == "elo"
+Requires-Dist: pandas; extra == "elo"
+Requires-Dist: matplotlib; extra == "elo"
+Dynamic: license-file
+
+<div align="center">
+  <!-- <img src="https://github.com/allenai/OLMo/assets/8812459/774ac485-a535-4768-8f7c-db7be20f5cc3" width="300"/> -->
+<img src="https://github.com/user-attachments/assets/d70c8644-3e64-4230-98c3-c52fddaeccb6" alt="olmOCR Logo" width="300"/>
+<br/>
+  <br>
+  <h1>olmOCR</h1>
+</div>
+<p align="center">
+  <a href="https://github.com/allenai/OLMo/blob/main/LICENSE">
+    <img alt="GitHub License" src="https://img.shields.io/github/license/allenai/OLMo">
+  </a>
+  <a href="https://github.com/allenai/olmocr/releases">
+    <img alt="GitHub release" src="https://img.shields.io/github/release/allenai/olmocr.svg">
+  </a>
+  <a href="https://olmocr.allenai.org/papers/olmocr.pdf">
+    <img alt="Tech Report" src="https://img.shields.io/badge/Paper-olmOCR-blue">
+  </a>
+  <a href="https://olmocr.allenai.org">
+    <img alt="Demo" src="https://img.shields.io/badge/Ai2-Demo-F0529C">
+  </a>
+  <a href="https://discord.gg/sZq3jTNVNG">
+    <img alt="Discord" src="https://img.shields.io/badge/Discord%20-%20blue?style=flat&logo=discord&label=Ai2&color=%235B65E9">
+  </a>
+</p>
+
+A toolkit for training language models to work with PDF documents in the wild.
+
+Try the online demo: [https://olmocr.allenai.org/](https://olmocr.allenai.org/)
+
+What is included:
+ - A prompting strategy to get really good natural text parsing using ChatGPT 4o - [buildsilver.py](https://github.com/allenai/olmocr/blob/main/olmocr/data/buildsilver.py)
+ - An side-by-side eval toolkit for comparing different pipeline versions - [runeval.py](https://github.com/allenai/olmocr/blob/main/olmocr/eval/runeval.py)
+ - Basic filtering by language and SEO spam removal - [filter.py](https://github.com/allenai/olmocr/blob/main/olmocr/filter/filter.py)
+ - Finetuning code for Qwen2-VL and Molmo-O - [train.py](https://github.com/allenai/olmocr/blob/main/olmocr/train/train.py)
+ - Processing millions of PDFs through a finetuned model using Sglang - [pipeline.py](https://github.com/allenai/olmocr/blob/main/olmocr/pipeline.py)
+ - Viewing [Dolma docs](https://github.com/allenai/dolma) created from PDFs - [dolmaviewer.py](https://github.com/allenai/olmocr/blob/main/olmocr/viewer/dolmaviewer.py)
+
+### Installation
+
+Requirements:
+ - Recent NVIDIA GPU (tested on RTX 4090, L40S, A100, H100) with at least 20 GB of GPU RAM
+ - 30GB of free disk space
+
+You will need to install poppler-utils and additional fonts for rendering PDF images.
+
+Install dependencies (Ubuntu/Debian)
+```bash
+sudo apt-get update
+sudo apt-get install poppler-utils ttf-mscorefonts-installer msttcorefonts fonts-crosextra-caladea fonts-crosextra-carlito gsfonts lcdf-typetools
+```
+
+Set up a conda environment and install olmocr
+```bash
+conda create -n olmocr python=3.11
+conda activate olmocr
+
+git clone https://github.com/allenai/olmocr.git
+cd olmocr
+
+# For CPU-only operations, ex. running benchmarks
+pip install -e .
+
+# For actually converting the files with your own GPU
+pip install -e .[gpu] --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
+```
+
+### Local Usage Example
+
+For quick testing, try the [web demo](https://olmocr.allen.ai/). To run locally, a GPU is required, as inference is powered by [sglang](https://github.com/sgl-project/sglang) under the hood.
+Convert a Single PDF:
+```bash
+python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/horribleocr.pdf
+```
+
+Convert an Image file:
+```bash
+python -m olmocr.pipeline ./localworkspace --pdfs random_page.png
+```
+
+Convert Multiple PDFs:
+```bash
+python -m olmocr.pipeline ./localworkspace --pdfs tests/gnarly_pdfs/*.pdf
+```
+Results will be stored as JSON in `./localworkspace`.
+
+#### Viewing Results
+
+Extracted text is stored as [Dolma](https://github.com/allenai/dolma)-style JSONL inside of the `./localworkspace/results` directory.
+
+```bash
+cat localworkspace/results/output_*.jsonl
+```
+
+View results side-by-side with the original PDFs (uses `dolmaviewer` command):
+
+```bash
+python -m olmocr.viewer.dolmaviewer localworkspace/results/output_*.jsonl
+```
+
+Now open `./dolma_previews/tests_gnarly_pdfs_horribleocr_pdf.html` in your favorite browser.
+
+![image](https://github.com/user-attachments/assets/128922d1-63e6-4d34-84f2-d7901237da1f)
+
+
+### Multi-node / Cluster Usage
+
+If you want to convert millions of PDFs, using multiple nodes running in parallel, then olmOCR supports
+reading your PDFs from AWS S3, and coordinating work using an AWS S3 output bucket.
+
+For example, you can start this command on your first worker node, and it will set up
+a simple work queue in your AWS bucket and start converting PDFs.
+
+```bash
+python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf
+```
+
+Now on any subsequent nodes, just run this and they will start grabbing items from the same workspace queue.
+```bash
+python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace
+```
+
+If you are at Ai2 and want to linearize millions of PDFs efficiently using [beaker](https://www.beaker.org), just add the `--beaker`
+flag. This will prepare the workspace on your local machine, and then launch N GPU workers in the cluster to start
+converting PDFs.
+
+For example:
+```bash
+python -m olmocr.pipeline s3://my_s3_bucket/pdfworkspaces/exampleworkspace --pdfs s3://my_s3_bucket/jakep/gnarly_pdfs/*.pdf --beaker --beaker_gpus 4
+```
+
+### Full documentation for the pipeline
+
+```bash
+python -m olmocr.pipeline --help
+usage: pipeline.py [-h] [--pdfs PDFS] [--workspace_profile WORKSPACE_PROFILE] [--pdf_profile PDF_PROFILE] [--pages_per_group PAGES_PER_GROUP]
+                   [--max_page_retries MAX_PAGE_RETRIES] [--max_page_error_rate MAX_PAGE_ERROR_RATE] [--workers WORKERS] [--apply_filter] [--stats] [--model MODEL]
+                   [--model_max_context MODEL_MAX_CONTEXT] [--model_chat_template MODEL_CHAT_TEMPLATE] [--target_longest_image_dim TARGET_LONGEST_IMAGE_DIM]
+                   [--target_anchor_text_len TARGET_ANCHOR_TEXT_LEN] [--beaker] [--beaker_workspace BEAKER_WORKSPACE] [--beaker_cluster BEAKER_CLUSTER]
+                   [--beaker_gpus BEAKER_GPUS] [--beaker_priority BEAKER_PRIORITY]
+                   workspace
+
+Manager for running millions of PDFs through a batch inference pipeline
+
+positional arguments:
+  workspace             The filesystem path where work will be stored, can be a local folder, or an s3 path if coordinating work with many workers, s3://bucket/prefix/
+
+options:
+  -h, --help            show this help message and exit
+  --pdfs PDFS           Path to add pdfs stored in s3 to the workspace, can be a glob path s3://bucket/prefix/*.pdf or path to file containing list of pdf paths
+  --workspace_profile WORKSPACE_PROFILE
+                        S3 configuration profile for accessing the workspace
+  --pdf_profile PDF_PROFILE
+                        S3 configuration profile for accessing the raw pdf documents
+  --pages_per_group PAGES_PER_GROUP
+                        Aiming for this many pdf pages per work item group
+  --max_page_retries MAX_PAGE_RETRIES
+                        Max number of times we will retry rendering a page
+  --max_page_error_rate MAX_PAGE_ERROR_RATE
+                        Rate of allowable failed pages in a document, 1/250 by default
+  --workers WORKERS     Number of workers to run at a time
+  --apply_filter        Apply basic filtering to English pdfs which are not forms, and not likely seo spam
+  --stats               Instead of running any job, reports some statistics about the current workspace
+  --model MODEL         List of paths where you can find the model to convert this pdf. You can specify several different paths here, and the script will try to use the
+                        one which is fastest to access
+  --model_max_context MODEL_MAX_CONTEXT
+                        Maximum context length that the model was fine tuned under
+  --model_chat_template MODEL_CHAT_TEMPLATE
+                        Chat template to pass to sglang server
+  --target_longest_image_dim TARGET_LONGEST_IMAGE_DIM
+                        Dimension on longest side to use for rendering the pdf pages
+  --target_anchor_text_len TARGET_ANCHOR_TEXT_LEN
+                        Maximum amount of anchor text to use (characters)
+  --beaker              Submit this job to beaker instead of running locally
+  --beaker_workspace BEAKER_WORKSPACE
+                        Beaker workspace to submit to
+  --beaker_cluster BEAKER_CLUSTER
+                        Beaker clusters you want to run on
+  --beaker_gpus BEAKER_GPUS
+                        Number of gpu replicas to run
+  --beaker_priority BEAKER_PRIORITY
+                        Beaker priority level for the job
+```
+
+
+## Team
+
+<!-- start team -->
+
+**olmOCR** is developed and maintained by the AllenNLP team, backed by [the Allen Institute for Artificial Intelligence (AI2)](https://allenai.org/).
+AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.
+To learn more about who specifically contributed to this codebase, see [our contributors](https://github.com/allenai/olmocr/graphs/contributors) page.
+
+<!-- end team -->
+
+## License
+
+<!-- start license -->
+
+**olmOCR** is licensed under [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0).
+A full copy of the license can be found [on GitHub](https://github.com/allenai/olmocr/blob/main/LICENSE).
+
+<!-- end license -->
+
+## Citing
+
+```bibtex
+@misc{olmocr,
+      title={{olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models}},
+      author={Jake Poznanski and Jon Borchardt and Jason Dunkelberger and Regan Huff and Daniel Lin and Aman Rangapur and Christopher Wilhelm and Kyle Lo and Luca Soldaini},
+      year={2025},
+      eprint={2502.18443},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2502.18443},
+}
+```

+ 131 - 0
olmocr.egg-info/SOURCES.txt

@@ -0,0 +1,131 @@
+LICENSE
+README.md
+pyproject.toml
+olmocr/__init__.py
+olmocr/check.py
+olmocr/datatypes.py
+olmocr/image_utils.py
+olmocr/loadertest.py
+olmocr/metrics.py
+olmocr/pipeline.py
+olmocr/py.typed
+olmocr/repeatdetect.py
+olmocr/s3_utils.py
+olmocr/version.py
+olmocr/work_queue.py
+olmocr.egg-info/PKG-INFO
+olmocr.egg-info/SOURCES.txt
+olmocr.egg-info/dependency_links.txt
+olmocr.egg-info/requires.txt
+olmocr.egg-info/top_level.txt
+olmocr/bench/__init__.py
+olmocr/bench/benchmark.py
+olmocr/bench/convert.py
+olmocr/bench/prompts.py
+olmocr/bench/report.py
+olmocr/bench/review_app.py
+olmocr/bench/review_app_latex.py
+olmocr/bench/tests.py
+olmocr/bench/utils.py
+olmocr/bench/katex/__init__.py
+olmocr/bench/katex/render.py
+olmocr/bench/miners/check_headers_footers.py
+olmocr/bench/miners/check_multicolumn.py
+olmocr/bench/miners/check_old_scans_math.py
+olmocr/bench/miners/cleanup_data.py
+olmocr/bench/miners/cleanup_urls.py
+olmocr/bench/miners/delete_rejected.py
+olmocr/bench/miners/download_math.py
+olmocr/bench/miners/mine_diffs.py
+olmocr/bench/miners/mine_headers_footers.py
+olmocr/bench/miners/mine_long_tiny_text.py
+olmocr/bench/miners/mine_math.py
+olmocr/bench/miners/mine_multi_column.py
+olmocr/bench/miners/mine_old_scan_pdf.py
+olmocr/bench/miners/mine_old_scans.py
+olmocr/bench/miners/mine_old_scans_math.py
+olmocr/bench/miners/mine_reading_order.py
+olmocr/bench/miners/mine_tables_gemini.py
+olmocr/bench/miners/mine_tables_gpt.py
+olmocr/bench/miners/pick_mediod.py
+olmocr/bench/runners/__init__.py
+olmocr/bench/runners/run_chatgpt.py
+olmocr/bench/runners/run_claude.py
+olmocr/bench/runners/run_docling.py
+olmocr/bench/runners/run_gemini.py
+olmocr/bench/runners/run_gotocr.py
+olmocr/bench/runners/run_marker.py
+olmocr/bench/runners/run_mineru.py
+olmocr/bench/runners/run_mistral.py
+olmocr/bench/runners/run_olmocr_pipeline.py
+olmocr/bench/runners/run_rolmocr.py
+olmocr/bench/runners/run_server.py
+olmocr/bench/runners/run_transformers.py
+olmocr/bench/scripts/difference_viewer.py
+olmocr/bench/scripts/run_difference.py
+olmocr/bench/scripts/url_matcher.py
+olmocr/bench/synth/__init__.py
+olmocr/bench/synth/mine_html_templates.py
+olmocr/bench/synth/test_mine.py
+olmocr/data/__init__.py
+olmocr/data/buildsilver.py
+olmocr/data/buildsilverdatasummary.py
+olmocr/data/buildtestset.py
+olmocr/data/convertsilver_birr.py
+olmocr/data/convertsilver_openai.py
+olmocr/data/renderpdf.py
+olmocr/data/runopenaibatch.py
+olmocr/eval/__init__.py
+olmocr/eval/buildelo.py
+olmocr/eval/evalhtml.py
+olmocr/eval/evalhtml_template.html
+olmocr/eval/runeval.py
+olmocr/eval/scoreelo.py
+olmocr/eval/dolma_refine/aligners.py
+olmocr/eval/dolma_refine/metrics.py
+olmocr/eval/dolma_refine/registry.py
+olmocr/eval/dolma_refine/segmenters.py
+olmocr/filter/__init__.py
+olmocr/filter/coherency.py
+olmocr/filter/filter.py
+olmocr/prompts/__init__.py
+olmocr/prompts/anchor.py
+olmocr/prompts/prompts.py
+olmocr/train/__init__.py
+olmocr/train/dataloader.py
+olmocr/train/dataprep.py
+olmocr/train/fixqwen25vlcheckpoint.py
+olmocr/train/inference.py
+olmocr/train/loaddataset.py
+olmocr/train/train.py
+olmocr/train/utils.py
+olmocr/train/core/__init__.py
+olmocr/train/core/adapters.py
+olmocr/train/core/cli.py
+olmocr/train/core/compression.py
+olmocr/train/core/config.py
+olmocr/train/core/errors.py
+olmocr/train/core/loggers.py
+olmocr/train/core/paths.py
+olmocr/train/core/state.py
+olmocr/train/hf/__init__.py
+olmocr/train/hf/convertjsontoparquet.py
+olmocr/train/hf/hfhub_upload.py
+olmocr/train/hf/warc_parser.py
+olmocr/train/molmo/__init__.py
+olmocr/train/molmo/config_molmo.py
+olmocr/train/molmo/image_processing_molmo.py
+olmocr/train/molmo/modeling_molmo.py
+olmocr/train/molmo/preprocessing_molmo.py
+olmocr/viewer/__init__.py
+olmocr/viewer/dolmaviewer.py
+olmocr/viewer/dolmaviewer_template.html
+tests/test_anchor.py
+tests/test_dataloader.py
+tests/test_dataprep.py
+tests/test_filter.py
+tests/test_integration.py
+tests/test_molmo.py
+tests/test_s3_work_queue.py
+tests/test_sglang.py
+tests/test_tests.py

+ 1 - 0
olmocr.egg-info/dependency_links.txt

@@ -0,0 +1 @@
+

+ 81 - 0
olmocr.egg-info/requires.txt

@@ -0,0 +1,81 @@
+cached-path
+smart_open
+pypdf>=5.2.0
+pypdfium2
+cryptography
+lingua-language-detector
+Pillow
+ftfy
+bleach
+markdown2
+filelock
+orjson
+requests
+zstandard
+boto3
+httpx
+torch>=2.5.1
+transformers==4.46.2
+img2pdf
+beaker-py
+
+[bench]
+tinyhost
+fuzzysearch
+rapidfuzz
+sequence_align
+syntok
+openai
+google-genai
+playwright
+mistralai
+lxml
+flask
+
+[dev]
+ruff
+mypy
+black
+isort
+pytest
+pytest-sphinx
+pytest-cov
+twine>=1.11.0
+build
+setuptools
+wheel
+Sphinx<7.1.0,>=4.3.0
+furo==2023.7.26
+myst-parser<2.1,>=1.0
+sphinx-copybutton==0.5.2
+sphinx-autobuild==2021.3.14
+sphinx-autodoc-typehints==1.23.3
+packaging
+necessary
+peft
+datasets
+omegaconf
+spacy
+
+[elo]
+numpy
+scipy
+pandas
+matplotlib
+
+[gpu]
+sgl-kernel==0.0.3.post1
+sglang[all]==0.4.2
+
+[train]
+torch
+torchvision
+accelerate
+datasets
+peft
+wandb
+omegaconf
+s3fs
+necessary
+einops
+transformers>=4.45.1

+ 6 - 0
olmocr.egg-info/top_level.txt

@@ -0,0 +1,6 @@
+dolma_previews
+localworkspace
+olmocr
+olmocr_workspace
+test_pdf
+workspace

+ 1 - 0
olmocr/__init__.py

@@ -0,0 +1 @@
+from .version import VERSION, VERSION_SHORT

BIN
olmocr/__pycache__/__init__.cpython-310.pyc


BIN
olmocr/__pycache__/__init__.cpython-311.pyc


BIN
olmocr/__pycache__/check.cpython-311.pyc


BIN
olmocr/__pycache__/image_utils.cpython-311.pyc


BIN
olmocr/__pycache__/metrics.cpython-311.pyc


BIN
olmocr/__pycache__/pipeline.cpython-310.pyc


BIN
olmocr/__pycache__/pipeline.cpython-311.pyc


BIN
olmocr/__pycache__/s3_utils.cpython-311.pyc


BIN
olmocr/__pycache__/version.cpython-310.pyc


BIN
olmocr/__pycache__/version.cpython-311.pyc


BIN
olmocr/__pycache__/work_queue.cpython-311.pyc


+ 116 - 0
olmocr/bench/README.md

@@ -0,0 +1,116 @@
+# olmOCR-Bench
+
+We develop olmOCR-Bench in order to automatically and effectively evaluate document-level OCR of various tools.
+
+olmOCR-Bench works by testing various "facts" about document pages at the PDF-level.
+Our intention is that each "fact" is very simple, unambiguous, and machine-checkable. For example, once your document has been OCRed, we may check that a particular sentence appears somewhere on the page.
+
+We stay away from soft metrics like edit distance comparisons, because they may assign lower scores for parses of the document that differ from the reference, but may in fact still be correct. For example, on a document containing multiple distinct articles: you want the text of each article to be grouped together, but the relative order of the two articles may not be critical. Also, some documents may have critical details, like switching x and y in an equation that can make all the difference in understanding, but would appear as just a single character edit in an edit-distance metric.
+
+olmOCR-bench operates on single page PDFs directly. We make this choice because PDFs do preserve some digital metadata and information which may be helpful to some OCR systems. Almost any other format can be converted to a PDF, but not the reverse, so we try to preserve these original documents where possible.
+
+## Benchmark Principles
+
+As we created olmOCR-bench, we also kept a few general rules in mind:
+
+- We expect your OCR system to output a plain-text Unicode document in a reading order that would be considered natural.
+- Documents from the benchmark should fit on a standard A4 piece of paper and still be readable to a human.
+- Markdown syntax is allowed, but ignored. Ex. if we are looking for the word "enlightenment" to appear on a page, and your system outputs "**\*\*enlightenment\*\***" in Markdown bold, that still counts. 
+- olmOCR-bench is not position sensitive, ex. we check that a sentence or math equation appears anywhere on a page. The exception to this is header/footer tests where we want to find simple page numbers appearing in the first or last few characters of a page.
+- Tables can be in either Markdown syntax, or as an html `<table>`.
+- Math equations must render with [Katex](https://katex.org/) and be delimeted with $, $$, \\(, or \\[. 
+- Math equations are not position sensitive either, so if we are checking for 
+$ 3x^2 $ to appear on a page, then outputting $ \int_a^b{ 3x ^ 2dx} $ counts.
+- We normalize all Unicode to NFC before running the benchmark, so if your OCR model outputs é vs e + ◌́ then either way should not affect your benchmark score.
+- We normalize all the different variants of hyphens to the ascii -, all the variants of double quoets to ascii " and all variants of single quotes/apostrophes to ascii '. You should score the same on the benchmark if you output - vs —
+- All facts checked about documents are either pass/fail. We want it to be very clear if your OCR system fails a test, and if so, what output would make it pass.
+
+## olmOCR-Bench Fact classes
+
+- Text presence
+  - This task makes sure that a given small piece of text (ex. 1-3 sentence level) is present within
+    a parsed document. Soft/fuzzy matching is allowed, as well as specifying if the text must be in the first N or last N characters of the document. Case sensitive by default.
+- Text absense
+  - This task makes sure that a given piece of next does NOT appear in the OCR'ed version of a document. We generally want our OCR systems to filter out content like headers/footers/page numbers from documents. The same fuzzy matching as in Text Presence tests is allowed.
+- Natural Reading Order
+  - This task ensures that blocks of text which are present have a defined order relative to one another. For example,
+  on a document that contains multiple news articles on one page, you'd want to see that the first sentence of the 
+  first article appears after the heading of that article. But, you may be okay with swapping the order of those 
+  two articles.
+- Table Accuracy
+  - Both Markdown and HTML based tables are supported. These tests check that a cell with a given text exists somewhere in the table, and that its neighbors have certain properties. Ex. A cell exists on this page with text "4.5%" and above that is a cell with the text "2.4%"
+- Math Formula Accuracy
+  - We render a given Latex style equation using Katex in a headless browser. And then see if it exists anywhere in the final OCRed document. Matching is performed on a relative symbol level, ex. in "\f\relax{x} = \int_{-\infty}^\infty
+    x^2dx" we check that a ∫ appears to the left of a x, x appears to the left of dx, etc...
+  
+## Downloading and running the benchmark
+
+Currently the full benchmark data is located here, but it's private until we are done reviewing and checking all of the tests:
+https://huggingface.co/datasets/allenai/olmOCR-bench
+
+To run a benchmark, first install the bench requirements
+```bash
+conda create -n olmocr python=3.11
+conda activate olmocr
+
+git clone https://github.com/allenai/olmocr.git
+cd olmocr
+
+pip install -e .[bench]
+
+# Now clone the benchmark data
+git clone https://huggingface.co/datasets/allenai/olmOCR-bench
+```
+
+Convert your documents
+```bash
+# convert using a single OCR-engine, see the olmocr/bench/runners directory for options
+python -m olmocr.bench.convert olmocr_pipeline --dir ./olmOCR-bench/bench_data
+
+# or use convert_all.sh to run OCR with many common frameworks all at once, API keys will be required
+./olmocr/bench/scripts/convert_all.sh
+```
+
+Now run the benchmark
+```bash
+python -m olmocr.bench.benchmark --dir ./olmOCR-bench/bench_data
+```
+
+## Previewing the benchmark questions
+
+We have an internal data annotation tool that can be used to review the questions in the benchmark, and make edits.
+
+<img width="700" alt="image" src="https://github.com/user-attachments/assets/dd24fd88-a642-4379-b5a1-9911717bf5b1" />
+
+
+```bash
+python -m olmocr.bench.review_app --port 5000 --debug ./olmOCR-bench/bench_data/multi_column.jsonl --force
+```
+
+## How were the tests made
+
+Several categories of tests have been made so far:
+1. arxiv_math -> We downloaded recent math papers from arxiv, filtered to those which had a single tex source file, and a rendered pdf, using https://github.com/allenai/olmocr/blob/main/olmocr/bench/miners/download_math.py. Then we matched up the text on a pdf page to the location in the tex source mostly likely to match to it using a dynamic programming matching algorithm in https://github.com/allenai/olmocr/blob/main/olmocr/bench/miners/mine_math.py. From there, Latex equations from the matching page were then parsed out, and we checked they rendered in Katex before adding them as test cases. We did a final quick scan over the data manually to remove any cases where the Latex parsing may have failed egregiously.
+2. headers_footers -> We sampled documents from our internal crawled PDF repository. (The same from which olmOCR-mix was derived, though the likelyhood of duplicates is low, as there are 200M+ pdfs in this set). Then we used [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO) to identify regions of the pages which were marked as headers/footers using the abandon category. We then got the text of those headers/footers regions by extracting them out and prompting Gemini, and we added them as test cases which should be absent. Manual review was then performed to remove mistakenly filtered text, and to set conditions such as limiting the search area to the first N or last N characters. Ex. if a page number "5" appears on the bottom a page, you want to test that your OCR system does not output a "5" in the last 20 characters of the page, but "5" could apepar earlier if in the actual body text.
+3. table_tests -> We sampled documents from our internal crawled PDF repository, and found those which had tables using gemini-flash-2.0. https://github.com/allenai/olmocr/blob/main/olmocr/bench/miners/mine_tables_gemini.py On pages that had tables, we then further asked gemini-flash-2.0 to tell us the relationships between randomly chosen cells. Those tests were then manually checked.
+4. multi_column -> We sampled documents from our internal crawled PDF repository manually, to find documents which had multi-column layouts and multiple articles on one page. Then, we used claude-sonnet-3.7 to render those pages to html, and from that html, we extracted text segments which were before/after one another. Then we manually reviewed each entry.
+5. old_scans -> We sampled documents from the Library of Congress which contained handwriting or typewritten text. Then we priortized creating rules that check for reading order. (TODO)
+6. old_scans_math -> We found old math textbooks in the public domain from the Internet Archive. We then extracted random pages from them, OCRed them, filtered down to pages which contained equations, and picked several random equations from each page to use as test cases. We then manually checked each test case to see that it was accurate capturing what was on the page.
+7. long_tiny_text -> We found documents from the Internet Archive which contained a large amount of dense small print on a single page. Ex. pages from a dictionary, or pages of references from an academic paper. We then generated test cases using an LLM, and verified them manually.
+
+
+## TODO List for release
+ - [X] Check all tests for duplicates
+ - [X] Make absense tests not case sensitive by default
+ - [X] Check that we have URLs for all tests
+ - [X] Write a script to verify that all baseline tests that actually have weird unicodes have exemptions
+ - [X] Review math equations in old_scans_math.jsonl using chat gpt script
+ - [X] Add test category of long_texts which are still ~1 standard printed page, but with dense/small text
+ - [X] Review multicolumn_tests, make sure they are correct, clean, and don't have order tests between regions
+ - [X] Run automated check of multicolumn tests for: #1 sub/super scripts #2 max diffs calibrations #3 mixing across different distinct regions of text 
+ - [X] Remove [] and other special symbols from old_scans
+ - [X] Full review of old_scans, somehow, chatgpt or prolific
+ - [X] Adjust scoring to weight each test category equally in final score distribution
+ - [X] Double check marker inline math outputs
+ - [ ] Remove any PII documents
+ - [ ] Run against final set of comparison tools, and check list of all-pass and all-fail tests

+ 0 - 0
olmocr/bench/__init__.py


Някои файлове не бяха показани, защото твърде много файлове са промени