🔗

Extract URLs from Text

Extract every URL from pasted text with deduplication, validation, and export to CSV, JSON, or newline-separated list.. Free, private — all processing in your browser.

Paste any text containing URLs

The Extract URLs tool pulls every URL out of any pasted text — HTML source, log files, email threads, documents, code files, chat transcripts. Every developer and content worker hits this problem: an HTML source has dozens of links you want to audit, a log file contains request URLs for traffic analysis, an email thread has shared links you want to collect, a content export has URLs embedded in free text fields. Manual copying is slow; this tool extracts all of them in milliseconds.

Output is deduplicated and optionally validated. The regex catches http://, https://, ftp://, and other common schemes, plus www. links that omit the protocol. Duplicate URLs across the text are collapsed into unique entries (with a count of how many times each appeared). Optional cleanup removes URL parameters, fragments, and trailing punctuation that often accompanies URLs in sentences (\"visit https://example.com.\" → the URL is https://example.com, not https://example.com.). Export formats include plain list, CSV with counts, JSON array, or a sitemap-style XML. All processing runs locally in your browser — no request is made to any URL, no data is sent to a server.

Extract URLs from Text — key features

HTTP, HTTPS, and FTP

Extracts URLs with common schemes. Optional support for custom schemes (mailto, tel, ssh, etc.).

HTML mode

For HTML input, parses anchor tags correctly using a DOM parser for robust extraction with optional relative-to-absolute resolution.

Deduplication

Collapses duplicate URLs across the text with normalization of trailing slash and host case.

Strip query/fragment

Optional canonical mode removes ?query strings and #fragments to extract unique resource URLs only.

Trailing punctuation removal

Strips sentence punctuation that often follows URLs in prose for cleaner extraction.

Multiple export formats

Plain list, CSV with occurrence counts, JSON array, or XML sitemap.

Counts and stats

Shows total URLs found, unique URLs, and occurrence counts per URL.

Client-side only

URLs are extracted in your browser with no external requests — safe for sensitive internal logs.

Using the Extract URLs from Text

1
Paste the text
Drop any text containing URLs into the input — log file, HTML source, email, document.
2
Choose mode
Plain text mode for log files and documents; HTML mode for web page source code.
3
Set options
Toggle canonical mode (strip query/fragment), scheme filtering (http only, https only, or all), and deduplication preferences.
4
Extract
Click extract — the tool shows every unique URL with occurrence counts and format options.
5
Copy or download
Copy the list to clipboard or download as CSV, JSON, or sitemap XML for further processing.

Common use cases for the Extract URLs from Text

SEO and content

→Outbound link audit: Extract all links from a blog post or article to audit destinations, check for broken links, or verify link quality.
→Content migration: Extract URLs from old CMS exports to build a redirect map when migrating to a new URL structure.
→Competitor link analysis: Pull all outgoing links from a competitor’s HTML to understand their external linking strategy.

DevOps and logs

→Traffic analysis: Extract request URLs from access logs to identify top pages, detect anomalous paths, or build traffic reports.
→Security scanning: Pull URLs from phishing emails or suspicious logs for threat intelligence and IOC analysis.
→API endpoint discovery: Extract every unique API endpoint from mixed log data for documentation or rate limit planning.

Research and archival

→Citation collection: Extract all cited URLs from a paper or report for bibliography verification.
→Social media analysis: Pull URLs shared in posts or comments to analyze link-sharing patterns.
→Email thread analysis: Collect every link shared across a long email thread for reference.

Extract URLs from Text — examples

Mixed prose

URLs extracted from a paragraph.

Input

Check out https://example.com and https://tooleras.com. Also www.github.com has great content.

Output

https://example.com
https://tooleras.com
www.github.com

HTML source

Extracted from anchor tags.

Input

<a href="https://page1.com">page 1</a> and <a href="https://page2.com">page 2</a>

Output

https://page1.com
https://page2.com

With query strings

Canonical mode strips parameters.

Input

https://shop.com/item?id=42&ref=home and https://shop.com/item?id=42&ref=promo

Output

canonical: https://shop.com/item (1 unique)
with params: 2 distinct URLs

From log file

Extracting access log URLs.

Input

192.168.1.1 - GET /api/users?page=1 HTTP/1.1 200
192.168.1.2 - GET /api/users?page=2 HTTP/1.1 200

Output

/api/users?page=1
/api/users?page=2
(relative paths extracted separately in path-only mode)

Punctuation cleanup

Trailing sentence punctuation removed.

Input

See https://example.com. It works.

Output

https://example.com
(the trailing period is stripped since it is sentence punctuation, not URL content)

Under the hood

URL extraction regex needs to handle several edge cases correctly.

The core pattern: (https?|ftp):\\/\\/[^\\s<>\"\\(\\)]+

Schemes: http, https, ftp are the common ones. mailto, tel, and other custom schemes can be added via options. The tool also catches www.-prefixed URLs (lacking explicit protocol) by default for common web text.

Path and query: everything up to whitespace, angle brackets, quotes, or parentheses is treated as part of the URL. Punctuation at the very end is typically trimmed (periods, commas, semicolons, colons) because URLs in prose often end right before sentence punctuation.

Trailing punctuation heuristic: URLs rarely end in a period or comma, so the tool strips these by default. But some URLs legitimately contain trailing slashes or punctuation-like characters. The heuristic is configurable.

Deduplication: normalize trailing slashes (https://example.com and https://example.com/ treated as same) and case on scheme/host (HTTPS://Example.COM and https://example.com same). Case-sensitivity of the path is preserved because URLs are case-sensitive below the host level per RFC 3986.

Fragment and query handling: by default, fragments (#section) and query strings (?utm_source=...) are preserved. Options strip them for canonical URL extraction where you want to see unique resources rather than unique parameterized views.

Anchor tag extraction: for HTML input, parsing <a href=\"...\"> with a DOM parser is more robust than regex. The tool offers HTML mode that uses DOMParser for accurate anchor extraction (including relative URLs resolved to absolute if a base URL is provided).

Shortened URLs: the tool extracts but does not resolve short URLs (bit.ly, t.co, etc.). For resolution you need HTTP requests which are out of scope for a client-side extractor.

Performance: regex extraction on multi-megabyte text runs in milliseconds. HTML parsing is slower but still fast for normal documents.

Common problems and solutions

⚠Trailing punctuation included

URLs followed by periods, commas, or closing brackets often include that punctuation in the match. The tool strips common trailing punctuation, but rare URLs that legitimately end in such characters are affected. Verify output.

⚠Parentheses in URLs

Wikipedia URLs sometimes have parentheses in the path. Regex that treats ) as a terminator truncates these. The tool has a Wikipedia-safe mode that allows balanced parentheses.

⚠Relative URLs missed

In HTML, <a href="/page"> is a relative URL. Plain text mode doesn’t match /page without scheme; HTML mode with a base URL can resolve these to absolute.

⚠Query parameter dedup

https://example.com/page and https://example.com/page?utm=x are different URLs with the same resource. Enable canonical mode to collapse them.

⚠URL-encoded characters

%20 (space), %2F (slash), and other encoded characters are valid URL content. The extractor captures them; decode them separately if you need readable output.

⚠International domain names

IDN domains often appear in punycode (xn--...) form in URLs. The extractor handles both punycode and Unicode forms if the scheme supports them.

⚠Mailto and tel links

mailto:user@example.com and tel:+15551234 are technically URIs but not web URLs. Enable the extended scheme option to include them; otherwise they are skipped.

How it compares

Compared to Unix grep with a URL regex, this tool has a visual UI, automatic deduplication, and multiple export formats. grep wins for scripting and log processing; this tool wins for interactive extraction from pasted text.

Compared to a custom script in Python or JavaScript, this tool is faster when you just need a one-time extraction. For automated pipelines, write code; for right-now extraction, this tool is faster to use.

Compared to specialized SEO crawlers (Screaming Frog, Sitebulb), this tool extracts URLs from text you already have rather than crawling a live site. Crawlers work at a different level of the stack; this tool is for working with captured content.

Questions and answers

▶How do I extract all URLs from a document?

Paste the document text into the input, choose plain text or HTML mode, and click extract. Every URL is found, deduplicated, and displayed with occurrence counts. Works on log files, HTML source, emails, and documents of any reasonable size.

▶Can the tool extract relative URLs?

In HTML mode, yes — if you provide a base URL, relative hrefs (/page, ./image.png) are resolved to absolute URLs. In plain text mode, only URLs with a scheme (http://, https://, ftp://) or www. prefix are extracted.

▶Are duplicate URLs removed?

Yes. The tool deduplicates by default, collapsing multiple occurrences of the same URL into one entry with an occurrence count. Normalization rules handle trailing slashes and host case so https://example.com/ and HTTPS://EXAMPLE.COM match.

▶What is canonical URL mode?

Canonical mode strips query strings (?key=value) and fragments (#section) from extracted URLs, leaving only the resource URL. Useful when you want unique pages rather than unique parameterized views. Turn off canonical mode if you need to preserve parameters (for traffic analysis, for example).

▶Can I export as a sitemap?

Yes. After extraction, pick sitemap XML as the output format. The tool produces a standard sitemap.xml structure with your URLs that you can save and submit to search engines.

▶Does it make HTTP requests to verify URLs?

No. The tool only extracts URLs from text — it does not fetch them or check their status. For broken-link checking, use a dedicated tool that makes HEAD requests to each URL. All operations here are purely text processing.

▶Is it safe for internal logs?

Yes. Extraction runs in your browser with no network activity. URLs — including internal API paths, auth tokens in query strings, and private resources — never leave your machine. For regulated environments, always confirm with your security team.

▶How large a file can I process?

Up to several megabytes of text in modern browsers with delay under one second. For much larger files (log archives, bulk exports), use a command-line grep with URL regex for efficient streaming extraction.

Additional resources

RFC 3986 URI Generic Syntax — Internet standard defining URL syntax and structure.
WHATWG URL Standard — Living standard for URL parsing used by modern browsers, more lenient than RFC 3986.
MDN URL constructor — Browser URL API useful for programmatic URL parsing and normalization.
Sitemap XML protocol — Reference for sitemap XML format, one of this tool’s export formats.
IANA URI schemes — Official list of registered URI schemes (http, https, ftp, mailto, and many more).

Related tools

All Text Tools

📧

Extract Emails from Text

Extract every email address from pasted text with validation, deduplication, and clean export to CSV or line-separated list.

📱

Extract Phone Numbers

Extract phone numbers from pasted text in US, UK, EU, and international formats, deduplicate, and export to CSV, JSON, or plain list.

🔎

Find and Replace

Find and replace text with regex support, case sensitivity, whole-word matching, and preview of all changes before applying.

📏

Line Counter

Count lines in text with separate totals for blank lines, non-blank lines, words, characters, and paragraphs for detailed statistics.

🎯

Regex Tester

Test and debug regular expressions with live matching and explanation

🧹

Remove Duplicate Lines

Remove duplicate lines from text with case-sensitive or case-insensitive matching, preserving original order or sorting the result.

Explore more tools

200+ free tools that run in your browser.

Browse all tools →

Extract URLs from Text — key features

HTTP, HTTPS, and FTP

HTML mode

Deduplication

Strip query/fragment

Trailing punctuation removal

Multiple export formats

Counts and stats

Client-side only

Using the Extract URLs from Text

Paste the text

Choose mode

Set options

Extract

Copy or download

Common use cases for the Extract URLs from Text

SEO and content

DevOps and logs

Research and archival

Extract URLs from Text — examples

Mixed prose

HTML source

With query strings

From log file

Punctuation cleanup

Under the hood

Common problems and solutions

⚠Trailing punctuation included

⚠Parentheses in URLs

⚠Relative URLs missed

⚠Query parameter dedup

⚠URL-encoded characters

⚠International domain names

⚠Mailto and tel links

How it compares

Questions and answers

Additional resources

Related tools

Extract Emails from Text

Extract Phone Numbers

Find and Replace

Line Counter

Regex Tester

Remove Duplicate Lines

Learn more

How to Decode a JWT: A Practical Debugging Guide (with the Base64URL Gotcha Nobody Warns You About)

UUID v4 vs v7: The Default Has Quietly Changed

MD5, SHA-1, SHA-256: Three Kinds of Hashing Everyone Confuses

Explore more tools