📝

PDF to Text

Extract plain text from PDF files with options for layout preservation, multi-column handling, and OCR for scanned documents.. Free, private — all processing in your browser.

PDF text extraction requires a specialized library.

For now, use the PDF Split tool to extract individual pages, then copy text from your PDF viewer.

The PDF to Text tool extracts readable text from PDF files in plain text format. Useful for copying content out of PDFs for editing, searching, analysis, or repurposing. Works on text-based PDFs (where text is embedded as actual characters) directly; for scanned PDFs (images of text), optional OCR converts the images back to text.

Upload your PDF and get extracted text. Layout preservation options: plain (just raw text, simplest), with line breaks (preserves paragraph structure), with columns (attempts to detect multi-column layouts and reorder for correct reading order). Headers, footers, and page numbers can be stripped for cleaner output. All processing runs in your browser using PDF.js or similar libraries. Extracted text can be copied to clipboard or downloaded as .txt file.

Features at a glance

Embedded text extraction

Extracts text directly from PDFs that have embedded text.

OCR for scanned PDFs

Optional OCR converts scanned images of text back to readable text.

Layout preservation

Options to maintain paragraph structure, multi-column reading order, or get plain concatenated text.

Header/footer stripping

Automatic detection and removal of repeating headers and footers.

Page range selection

Extract text from specific pages or the whole document.

Text statistics

Word count, character count, and reading time estimate for extracted text.

Multiple output formats

Plain text (.txt), Markdown, or structured JSON with page boundaries.

Client-side only

PDFs never leave your browser — confidential content stays local.

Step by step

1
Upload PDF
Drag or click to select the PDF to extract text from.
2
Choose extraction mode
Plain text, preserve layout, or columns-aware.
3
Set options
Strip headers/footers, specific pages, OCR if needed for scans.
4
Extract
Click extract; text appears in output area.
5
Copy or download
Copy to clipboard or save as .txt file.

Where this helps

Content reuse

→Quote extraction: Copy sections of PDF content for quoting in documents or articles.
→Translation prep: Extract text for translation services that require plain text input.
→Summary generation: Get text to summarize manually or via AI tools.

Research

→Academic papers: Extract abstracts, methodology, or specific sections from research PDFs.
→Legal documents: Copy clauses or sections from contracts for analysis.
→Book chapters: Extract text from scanned books for study notes.

Data analysis

→Text mining: Extract content from many PDFs as input for text analysis tools.
→Search indexing: Build searchable index from PDF content by first extracting text.
→Keyword analysis: Extract text to feed keyword density analyzers or SEO tools.

Worked examples

Simple text PDF

Standard document extraction.

Input

10-page text-based PDF

Output

full text content extracted with paragraph breaks preserved

Multi-column paper

Academic journal layout.

Input

research paper with 2 columns

Output

text reordered to read each column top-to-bottom, left column then right

Scanned book

OCR required.

Input

scanned book PDF, OCR enabled

Output

text recovered via OCR, 95% accuracy on clean scans

Strip headers

Clean content only.

Input

document with page numbers and header

Output

main body text without repeating headers and page numbers

Specific pages

Just abstract and conclusion.

Input

pages 1 and last

Output

only those pages’ text extracted

Under the hood

PDF text extraction depends on how the text is stored:

1. Embedded text (most PDFs): text is stored as character codes with position information. Extraction reads these directly.

2. Image-based PDFs (scanned documents): text exists only as pixels in images. Requires OCR to convert back to text.

3. Hybrid PDFs: some pages have embedded text, others are scans. Extraction handles embedded text directly, OCR for scan pages.

For embedded text, PDF.js reads the text stream from each page. Each text run has position (x, y), font, and content. Raw extraction gives the text in the order it appears in the PDF stream, which may not match reading order for complex layouts (multi-column, sidebars, tables).

Reading order reconstruction: sort text runs by position. Typically top-to-bottom, left-to-right. For multi-column layouts, detect column boundaries first, then read each column top-to-bottom.

Layout options:
- Plain: all text concatenated, some whitespace preserved
- Line breaks: preserve newlines for paragraph structure
- Columns: attempt to detect and preserve column structure
- Table extraction: detect tabular data and output as CSV-like format (imperfect)

Headers/footers: often appear on every page. Detection: text that repeats at same position across pages is likely header or footer; optionally stripped.

Page numbers: similar detection. Numbers at consistent bottom/top positions across pages.

OCR (for scanned PDFs): uses Tesseract.js or similar JavaScript OCR library. Slower than text extraction but handles scanned documents. Accuracy varies by scan quality — clean scans get 95%+, poor scans can be 70-85%.

Character encoding: PDF text can use various encodings (WinAnsi, MacRoman, Unicode). Extraction normalizes to UTF-8. Some PDFs with custom encodings may produce garbled text; those need manual encoding fix.

Font handling: PDFs may use custom fonts with non-standard character mappings. Extraction reads Unicode equivalents via the font\u2019s ToUnicode map. If ToUnicode is missing, extraction may fail or produce garbage.

Performance: text extraction of 100-page PDF takes 2-10 seconds depending on complexity. OCR of same takes 30 seconds to several minutes.

Common problems and solutions

⚠Garbled output

Some PDFs use custom fonts without proper ToUnicode mapping. Text comes out as random characters. Unfortunately, hard to fix without the original font; OCR of rendered pages may be the only recourse.

⚠Reading order wrong

Multi-column layouts, tables, or text boxes can produce disordered output in plain extraction mode. Try column-aware mode, or accept the raw output and reorder manually.

⚠OCR slow on large documents

OCR is computationally expensive. 100-page scanned PDF may take 10-20 minutes. For big OCR jobs, use server-side tools like Adobe Acrobat or Google Cloud Document AI.

⚠OCR accuracy varies

Clean scans: 95-99% accurate. Poor scans (low resolution, skewed, noisy): 70-85%. Pre-process scans (deskew, enhance contrast) before OCR for best results.

⚠Embedded vs image text

The tool tries embedded text first. If a page appears blank after extraction, it’s probably an image; enable OCR for those pages.

⚠Special characters missing

Some symbols, non-Latin scripts, or ligatures may be encoded in ways PDF extractors don’t handle. Check output for missing or replacement characters.

⚠Page break mysterious

Text flows across pages in PDFs. Extraction may put awkward line breaks at page boundaries. Post-process to rejoin sentences that cross pages.

How it compares

Compared to Adobe Acrobat Pro text extraction, this tool is free. Acrobat has superior handling of complex layouts and non-standard fonts; this tool covers common cases well.

Compared to pdftotext (command line), this tool has a browser UI with layout options. pdftotext is great for automation; this tool for interactive extraction.

Compared to copying text manually from a PDF viewer, this tool processes whole document at once with layout cleanup. Manual copy works page by page but is slow for long documents.

Questions and answers

▶How do I extract text from a PDF?

Upload your PDF, pick extraction options (layout, OCR for scans), and the tool extracts all readable text. Copy to clipboard or download as .txt.

▶Does it work on scanned PDFs?

Yes with OCR enabled. Scanned PDFs have text stored as images; OCR converts images back to text. Accuracy depends on scan quality — clean scans work well.

▶Is the layout preserved?

Three modes: plain (raw text), line breaks (preserves paragraphs), columns (attempts multi-column reading order). Choose based on your document type and needs.

▶Why does my extraction produce gibberish?

PDFs with custom fonts missing ToUnicode maps produce garbled text on extraction. OCR of the rendered pages may be a fallback. Or use a commercial tool (Adobe Acrobat) that has better font handling.

▶Can I extract specific pages only?

Yes. Specify page ranges (e.g., 1-5, 10) to extract only those pages.

▶Does it preserve tables?

Basic table detection is attempted but imperfect. PDF tables often don’t have clear structure, so extracted tables may lose formatting. For critical table data, manually verify or use specialized PDF table extractors.

▶Is my PDF private?

Yes. All extraction runs in your browser. PDFs never leave your machine.

▶How accurate is OCR?

Depends on scan quality. Clean, high-resolution scans: 95-99% accurate. Low-quality scans: 70-85%. Always review OCR output for errors before relying on it.

Useful references

PDF.js — Mozilla JavaScript library used for in-browser PDF parsing.
Tesseract.js — JavaScript port of Tesseract OCR engine for in-browser OCR.
pdftotext utility — Command-line PDF text extraction tool.
PDF structure — Adobe PDF reference covering text extraction details.
Google Cloud Document AI — Enterprise document processing service with superior OCR and layout understanding.

Related tools

All PDF Tools

📄

Image to PDF

Convert images to PDF documents. Supports multiple images combined into one PDF with configurable page size, orientation, and margins.

📖

Image to Text (OCR)

Extract text from images using OCR technology. Works with photos, screenshots, scanned documents, and supports dozens of languages.

📄

Image to PDF

Combine multiple images into a single PDF with configurable page size, orientation, margins, and reorder via drag-and-drop.

🗜️

PDF Compress

Reduce PDF file size with configurable compression levels. Balance between file size and quality for email, web, and storage.

📄

PDF Merge

Combine multiple PDFs into one file with drag-to-reorder — no upload needed

📋

PDF Page Reorder

Reorder PDF pages by drag-and-drop, remove unwanted pages, or duplicate pages with visual thumbnail preview.

Explore more tools

200+ free tools that run in your browser.

Browse all tools →

Features at a glance

Embedded text extraction

OCR for scanned PDFs

Layout preservation

Header/footer stripping

Page range selection

Text statistics

Multiple output formats

Client-side only

Step by step

Upload PDF

Choose extraction mode

Set options

Extract

Copy or download

Where this helps

Content reuse

Research

Data analysis

Worked examples

Simple text PDF

Multi-column paper

Scanned book

Strip headers

Specific pages

Under the hood

Common problems and solutions

⚠Garbled output

⚠Reading order wrong

⚠OCR slow on large documents

⚠OCR accuracy varies

⚠Embedded vs image text

⚠Special characters missing

⚠Page break mysterious

How it compares

Questions and answers

Useful references

Related tools

Image to PDF

Image to Text (OCR)

Image to PDF

PDF Compress

PDF Merge

PDF Page Reorder

Learn more

MD5, SHA-1, SHA-256: Three Kinds of Hashing Everyone Confuses

The canonical tag is a hint, not a promise

Why a single emoji breaks string length

Explore more tools