Extract URLs from Text
Text ToolsExtract every URL from pasted text with deduplication, validation, and export to CSV, JSON, or newline-separated list.. Free, private — all processing in your browser.
The Extract URLs tool pulls every URL out of any pasted text — HTML source, log files, email threads, documents, code files, chat transcripts. Every developer and content worker hits this problem: an HTML source has dozens of links you want to audit, a log file contains request URLs for traffic analysis, an email thread has shared links you want to collect, a content export has URLs embedded in free text fields. Manual copying is slow; this tool extracts all of them in milliseconds.
Output is deduplicated and optionally validated. The regex catches http://, https://, ftp://, and other common schemes, plus www. links that omit the protocol. Duplicate URLs across the text are collapsed into unique entries (with a count of how many times each appeared). Optional cleanup removes URL parameters, fragments, and trailing punctuation that often accompanies URLs in sentences (\"visit https://example.com.\" → the URL is https://example.com, not https://example.com.). Export formats include plain list, CSV with counts, JSON array, or a sitemap-style XML. All processing runs locally in your browser — no request is made to any URL, no data is sent to a server.
Extract URLs from Text — key features
HTTP, HTTPS, and FTP
Extracts URLs with common schemes. Optional support for custom schemes (mailto, tel, ssh, etc.).
HTML mode
For HTML input, parses anchor tags correctly using a DOM parser for robust extraction with optional relative-to-absolute resolution.
Deduplication
Collapses duplicate URLs across the text with normalization of trailing slash and host case.
Strip query/fragment
Optional canonical mode removes ?query strings and #fragments to extract unique resource URLs only.
Trailing punctuation removal
Strips sentence punctuation that often follows URLs in prose for cleaner extraction.
Multiple export formats
Plain list, CSV with occurrence counts, JSON array, or XML sitemap.
Counts and stats
Shows total URLs found, unique URLs, and occurrence counts per URL.
Client-side only
URLs are extracted in your browser with no external requests — safe for sensitive internal logs.
How to use the Extract URLs from Text
- 1
Paste the text
Drop any text containing URLs into the input — log file, HTML source, email, document.
- 2
Choose mode
Plain text mode for log files and documents; HTML mode for web page source code.
- 3
Set options
Toggle canonical mode (strip query/fragment), scheme filtering (http only, https only, or all), and deduplication preferences.
- 4
Extract
Click extract — the tool shows every unique URL with occurrence counts and format options.
- 5
Copy or download
Copy the list to clipboard or download as CSV, JSON, or sitemap XML for further processing.
Common use cases for the Extract URLs from Text
SEO and content
- →Outbound link audit: Extract all links from a blog post or article to audit destinations, check for broken links, or verify link quality.
- →Content migration: Extract URLs from old CMS exports to build a redirect map when migrating to a new URL structure.
- →Competitor link analysis: Pull all outgoing links from a competitor’s HTML to understand their external linking strategy.
DevOps and logs
- →Traffic analysis: Extract request URLs from access logs to identify top pages, detect anomalous paths, or build traffic reports.
- →Security scanning: Pull URLs from phishing emails or suspicious logs for threat intelligence and IOC analysis.
- →API endpoint discovery: Extract every unique API endpoint from mixed log data for documentation or rate limit planning.
Research and archival
- →Citation collection: Extract all cited URLs from a paper or report for bibliography verification.
- →Social media analysis: Pull URLs shared in posts or comments to analyze link-sharing patterns.
- →Email thread analysis: Collect every link shared across a long email thread for reference.
Extract URLs from Text — examples
Mixed prose
URLs extracted from a paragraph.
Check out https://example.com and https://tooleras.com. Also www.github.com has great content.
https://example.com https://tooleras.com www.github.com
HTML source
Extracted from anchor tags.
<a href="https://page1.com">page 1</a> and <a href="https://page2.com">page 2</a>
https://page1.com https://page2.com
With query strings
Canonical mode strips parameters.
https://shop.com/item?id=42&ref=home and https://shop.com/item?id=42&ref=promo
canonical: https://shop.com/item (1 unique) with params: 2 distinct URLs
From log file
Extracting access log URLs.
192.168.1.1 - GET /api/users?page=1 HTTP/1.1 200 192.168.1.2 - GET /api/users?page=2 HTTP/1.1 200
/api/users?page=1 /api/users?page=2 (relative paths extracted separately in path-only mode)
Punctuation cleanup
Trailing sentence punctuation removed.
See https://example.com. It works.
https://example.com (the trailing period is stripped since it is sentence punctuation, not URL content)
Technical details
URL extraction regex needs to handle several edge cases correctly.
The core pattern: (https?|ftp):\\/\\/[^\\s<>\"\\(\\)]+
Schemes: http, https, ftp are the common ones. mailto, tel, and other custom schemes can be added via options. The tool also catches www.-prefixed URLs (lacking explicit protocol) by default for common web text.
Path and query: everything up to whitespace, angle brackets, quotes, or parentheses is treated as part of the URL. Punctuation at the very end is typically trimmed (periods, commas, semicolons, colons) because URLs in prose often end right before sentence punctuation.
Trailing punctuation heuristic: URLs rarely end in a period or comma, so the tool strips these by default. But some URLs legitimately contain trailing slashes or punctuation-like characters. The heuristic is configurable.
Deduplication: normalize trailing slashes (https://example.com and https://example.com/ treated as same) and case on scheme/host (HTTPS://Example.COM and https://example.com same). Case-sensitivity of the path is preserved because URLs are case-sensitive below the host level per RFC 3986.
Fragment and query handling: by default, fragments (#section) and query strings (?utm_source=...) are preserved. Options strip them for canonical URL extraction where you want to see unique resources rather than unique parameterized views.
Anchor tag extraction: for HTML input, parsing <a href=\"...\"> with a DOM parser is more robust than regex. The tool offers HTML mode that uses DOMParser for accurate anchor extraction (including relative URLs resolved to absolute if a base URL is provided).
Shortened URLs: the tool extracts but does not resolve short URLs (bit.ly, t.co, etc.). For resolution you need HTTP requests which are out of scope for a client-side extractor.
Performance: regex extraction on multi-megabyte text runs in milliseconds. HTML parsing is slower but still fast for normal documents.
Common problems and solutions
⚠Trailing punctuation included
URLs followed by periods, commas, or closing brackets often include that punctuation in the match. The tool strips common trailing punctuation, but rare URLs that legitimately end in such characters are affected. Verify output.
⚠Parentheses in URLs
Wikipedia URLs sometimes have parentheses in the path. Regex that treats ) as a terminator truncates these. The tool has a Wikipedia-safe mode that allows balanced parentheses.
⚠Relative URLs missed
In HTML, <a href="/page"> is a relative URL. Plain text mode doesn’t match /page without scheme; HTML mode with a base URL can resolve these to absolute.
⚠Query parameter dedup
https://example.com/page and https://example.com/page?utm=x are different URLs with the same resource. Enable canonical mode to collapse them.
⚠URL-encoded characters
%20 (space), %2F (slash), and other encoded characters are valid URL content. The extractor captures them; decode them separately if you need readable output.
⚠International domain names
IDN domains often appear in punycode (xn--...) form in URLs. The extractor handles both punycode and Unicode forms if the scheme supports them.
⚠Mailto and tel links
mailto:user@example.com and tel:+15551234 are technically URIs but not web URLs. Enable the extended scheme option to include them; otherwise they are skipped.
Extract URLs from Text — comparisons and alternatives
Compared to Unix grep with a URL regex, this tool has a visual UI, automatic deduplication, and multiple export formats. grep wins for scripting and log processing; this tool wins for interactive extraction from pasted text.
Compared to a custom script in Python or JavaScript, this tool is faster when you just need a one-time extraction. For automated pipelines, write code; for right-now extraction, this tool is faster to use.
Compared to specialized SEO crawlers (Screaming Frog, Sitebulb), this tool extracts URLs from text you already have rather than crawling a live site. Crawlers work at a different level of the stack; this tool is for working with captured content.
Frequently asked questions about the Extract URLs from Text
▶How do I extract all URLs from a document?
Paste the document text into the input, choose plain text or HTML mode, and click extract. Every URL is found, deduplicated, and displayed with occurrence counts. Works on log files, HTML source, emails, and documents of any reasonable size.
▶Can the tool extract relative URLs?
In HTML mode, yes — if you provide a base URL, relative hrefs (/page, ./image.png) are resolved to absolute URLs. In plain text mode, only URLs with a scheme (http://, https://, ftp://) or www. prefix are extracted.
▶Are duplicate URLs removed?
Yes. The tool deduplicates by default, collapsing multiple occurrences of the same URL into one entry with an occurrence count. Normalization rules handle trailing slashes and host case so https://example.com/ and HTTPS://EXAMPLE.COM match.
▶What is canonical URL mode?
Canonical mode strips query strings (?key=value) and fragments (#section) from extracted URLs, leaving only the resource URL. Useful when you want unique pages rather than unique parameterized views. Turn off canonical mode if you need to preserve parameters (for traffic analysis, for example).
▶Can I export as a sitemap?
Yes. After extraction, pick sitemap XML as the output format. The tool produces a standard sitemap.xml structure with your URLs that you can save and submit to search engines.
▶Does it make HTTP requests to verify URLs?
No. The tool only extracts URLs from text — it does not fetch them or check their status. For broken-link checking, use a dedicated tool that makes HEAD requests to each URL. All operations here are purely text processing.
▶Is it safe for internal logs?
Yes. Extraction runs in your browser with no network activity. URLs — including internal API paths, auth tokens in query strings, and private resources — never leave your machine. For regulated environments, always confirm with your security team.
▶How large a file can I process?
Up to several megabytes of text in modern browsers with delay under one second. For much larger files (log archives, bulk exports), use a command-line grep with URL regex for efficient streaming extraction.
Additional resources
- RFC 3986 URI Generic Syntax — Internet standard defining URL syntax and structure.
- WHATWG URL Standard — Living standard for URL parsing used by modern browsers, more lenient than RFC 3986.
- MDN URL constructor — Browser URL API useful for programmatic URL parsing and normalization.
- Sitemap XML protocol — Reference for sitemap XML format, one of this tool’s export formats.
- IANA URI schemes — Official list of registered URI schemes (http, https, ftp, mailto, and many more).
Related tools
All Text ToolsExtract Emails from Text
Extract every email address from pasted text with validation, deduplication, and clean export to CSV or line-separated list.
Extract Phone Numbers
Extract phone numbers from pasted text in US, UK, EU, and international formats, deduplicate, and export to CSV, JSON, or plain list.
Find and Replace
Find and replace text with regex support, case sensitivity, whole-word matching, and preview of all changes before applying.
Line Counter
Count lines in text with separate totals for blank lines, non-blank lines, words, characters, and paragraphs for detailed statistics.
Regex Tester
Test and debug regular expressions with live matching and explanation
Remove Duplicate Lines
Remove duplicate lines from text with case-sensitive or case-insensitive matching, preserving original order or sorting the result.
Learn more
Explore more tools
200+ free tools that run in your browser.
Browse all tools →