This is the first run of PressArchiver on this install. Choose a password for the global admin account (admin@local). You can change it later from Settings.
Reset your password
Enter the email address tied to your account. If it's registered we'll email you a reset link within a minute.
Choose a new password
Pick a new password for your PressArchiver account. The previous password will stop working immediately, and any other devices you were signed in on will be logged out.
⚠Verify your email —…. OCR and detection are blocked until you confirm.
?›Year?›Number?
All saved💰…
‹
Library
no image
↖Selection
Auto-Segmentation
Choose engine:
Manual Segmentation
Type
Clicks = points · Dbl/Enter = done · Esc = cancel
Band ±px
Straight separator
Cleanup
⚠Danger Zone
This action cannot be undone.
OCR Engine
Model
Baselines
Confidence
Symbols80%
Words80%
⚠Danger Zone
Regions and grouping are kept.
Manual Grouping
Ungrouped Regions: 0/0
Auto Grouping
AI Articles Detection
Model
—
Confidence
—
—
Draw lines · Enter = done · Esc = cancel
Auto Deskew
Choose Deskew Algorithm
Manual Deskew
Rotation: 0.00
Export
Border Width
px
Font Size
px
Visibility
‹
🗞
Open a JPG / PNG image
⚡Bulk Operations — 0 pages selected
BL · select
Line Width
px
Brush Size
px
Eraser Size
px
›
›
0
regions
Structure0 regions
Groups
Regions0
Baselines0
Tokens consumed:
· app ? · front ?
📐 Set Polygon Height…
📐 Set All Polygons Height…
Region type
↕️ Split horizontally
↔️ Split vertically
⊞ Merge to one region
✕ Delete
⊞ Merge into one group
↺ Ungroup
✂ Split into two groups
Press Archiver—
All newspapers
0selected
📚
Select a newspaper on the left
Or create a new one
📁 Library:—
Change library folder
The folder will be created if it doesn't exist. The current library stays on disk — the server will switch to the new one.
Confirm
New newspaper
Add New Issue
Add new year (volume)
If a number is filled, label defaults to that number and year_from / year_to are set to it. Custom label wins when both are present.
Add to Ground Truth Set
Assign a Year
Move pages to issue
Pick a target issue below. Issues are grouped by volume; the
"Unsorted" bucket holds issues without a year/volume assignment.
+ Create new issue and move here
Export validation report
Critical issues block a valid export. Medium / Warning are advisory.
Export settings
Include
PDF settings
Smaller files at lower resolution. A value larger than the original is ignored (no upscaling).
Folder names
Saved in this browser.
Export
Edit newspaper
ExampleItalic grey = example hint, not a saved value.
Basic info
Language & script
Identifiers
For DDB (German Newspaper Portal), at least one ZDB identifier is required.
Holding institution
Where the physical original is held — used for DFG-Viewer rights block.
Default rights / license
Applied to every issue unless overridden. For pre-1910 newspapers, Public Domain Mark is recommended.
NDNP (US / Library of Congress)
For Chronicling America (US) export only. The LCCN goes in the Identifiers section above (type LCCN). Leave blank for non-US titles.
Edit issue
ExampleItalic grey = example hint, not a saved value.
Identification
Printed page numbers
As printed on the paper — may differ from scan page count (e.g. "[1]" or "iii").
External links
Persistent URL is required by DDB (goes into dv:presentation / dv:reference).
Digitization provenance
Documents who and when produced the scan (for digiprovMD).
⬇
Upload pages
Newspaper:—
→ Direct upload to issue:
⬇
Drop a folder or archive here
Supported: JPG, PNG, TIFF, BMP. Drop or choose files to continue.
⬇Drop more files here, or…
📄 PDF render quality
Each PDF becomes one issue; every page is rendered to JPG at this resolution. Higher DPI uses more disk + time.
How should we process these files?
Tip: click a Year/Month/Day/Issue cell to edit it. Drag or Shift+click to select a range, type to fill it, or drag the corner handle down. Manual edits win over the pattern.
0 files
#
Filename
Size
Type
Folder
Year
Month
Day
Issue №
📄Inbox0Drop into issues →
Issues
⌖
PickYear
Click or drag across the characters that hold this field.
Snap grabs the whole value between separators — handles files where this field has different lengths (issue 7 vs 123).
This file →—Other file →—Will insert →—
Tip: select inside one field. Literals you type into the field afterwards are kept (e.g. [N1-2]20).
Polygon height (px)
⚙ Detection settings
CONFconfidence threshold
IOUNMS threshold
MAX_DETmax detections
IMGSZmodel input size
DEVICEcpu / cuda:0
Crop adjacent pagespread
CONFconfidence threshold
IOUNMS threshold
MAX_DETmax detections
IMGSZmodel input size
Crop adjacent pagespread
Modal deployment
APPmodal app name
GPUpicks a class in the app
First-time setup: see modal_apps/README.md for
modal token new → modal volume put
→ modal deploy modal_apps/yolo.py.
Modal deployment
APPmodal app name
GPUpicks a class in the app
Layout
CONFdrop regions below this
OCR
LANGScomma list, e.g. en,de
First-time setup: modal deploy modal_apps/surya_app.py.
License: GPL-3.0 code + Open Rail-M models (free under $2M
revenue + $2M funding, no competing product — see
modal_apps/README.md).
Datalab API
MODEaccuracy / cost trade-off
Put your API key in keys/datalab_key.txt
(single line, get one from datalab.to/app/keys).
Returns one TextRegion per block with one TextLine spanning the
whole region (no per-line bboxes from the API). Re-run any other
OCR engine to get proper line-level breakdown afterwards.
License: same $2M revenue + $2M funding + no-competing-product
clause as self-host Surya.
Modal deployment
APPmodal app name
GPUpicks a class in the app
Layout
Toggles Eynollah's DPI-gated processing mode by patching
cache_images to force dpi=400
(≥ Eynollah's internal threshold of 298). Real effect varies:
• narrow multi-column pages (~2000-4500 px wide, 5+ cols)
where Eynollah would otherwise upscale → ~25% speedup from
skipping the upscale.
• most other pages (3-col, wider scans, or those that
hit the height-cap=8000 guard) → no-op or slightly different output
(marginalia filter skipped → a few extra "regions" Eynollah would
otherwise drop).
Default off. Try on a slow page and compare timings &
regions in the Activity Log before enabling globally.
Auto-resize large pages
if long-side >px → resize topx
⚠ force_high_dpi will be ignored while resize is on (telling
Eynollah "already high-DPI" on a freshly-downscaled image kills
small-text recognition).
Speeds up Eynollah on huge scans (~3× on 12000+ px pages).
Mid-size text (8+ pt) survives the LANCZOS downscale fine, but
small captions and page numbers may lose recognition. Off
by default; opt-in for batches of uniform layout newspapers. The
Activity Log surfaces a "resized from W×H" chip per page so you
can audit which calls were downscaled.
Post-process
% of page height
Eynollah occasionally mistakes the page border (or a thick framing
bar) for a single huge separator. This drops any separator whose
smaller bbox side exceeds the threshold. Acts on the response
client-side — no re-run.
First-time setup: modal deploy modal_apps/eynollah_app.py.
Heavier than YOLO/Surya — ~8 Keras checkpoints, cold start
60-90s, warm run 30-60s on L4. License: Apache-2.0 code +
Apache-2.0 weights (SBB Berlin). No revenue / funding /
no-compete clauses — clean for SaaS.
Same engine as Eynollah, but its internal deskew step is always
off (≈ 29s/page saved). Straighten pages in Scans Prep
first; on already-straight scans Eynollah's deskew is wasted work.
Runs in a separate Modal app so it can't disturb the production
Eynollah.
Patches cache_images to force
dpi=400 so Eynollah skips its column-based upscale. Same opt-in fast
path as the Eynollah panel; default off.
Post-process
% of page height
First-time setup: modal deploy modal_apps/eynollah_basic_app.py.
Reuses the shared eynollah-weights
volume — no re-download. License same as Eynollah (Apache-2.0).
Same engine as Eynollah Basic (deskew always off), but the Modal
container holds 2 instances and processes 2 pages at once,
so one page's CPU steps overlap another's GPU work — a
density/cost win (~2× pages per container-second). Batch runs
send pages a few at a time, so even a single issue uses both
instances and runs ~2× faster. (The very first pages of a cold batch
still pay a one-time warm-up.)
No deskew and no force-high-dpi controls — both are off by design under
Multi (concurrency-safety; deskew is done in Scans Prep).
Post-process
% of page height
First-time setup: modal deploy modal_apps/eynollah_multi_app.py.
Reuses the shared eynollah-weights
volume — no re-download. License same as Eynollah (Apache-2.0).
Fast bounding-box detector trained on Eynollah ground truth (text /
heading / image / table). Box-only, whole-page at 2048 px.
Admin-only, experimental — use it to compare a trained model's
output against Eynollah.
Model is trained at 2048; RTMDet is fully-convolutional, so inferring at
3072/4096 sharpens dense column gutters without re-training (~0.4–0.8 s,
still 50–100× faster than Eynollah).
Score0.30
Confidence cutoff — lower finds more (and more false) regions. Models are
listed newest-first; the latest trained run is selected by default.