How to Extract Text from a PDF in JavaScript (2026)
Extracting text from a PDF is a common requirement for searching, data processing, and document analysis. In a JavaScript environment, you have two primary options: using open-source libraries like PDF.js or leveraging a commercial SDK like Aoexl for more robust, production-grade results.
This guide provides complete code examples for both approaches and explains when to use each.
1. Using PDF.js (Open Source)
PDF.js is an industry-standard open-source library maintained by Mozilla. It is excellent for basic text extraction from well-structured PDFs.
Setup
Include the library from a CDN:
<script type="module">
import * as pdfjsLib from 'https://mozilla.github.io/pdf.js/build/pdf.mjs';
pdfjsLib.GlobalWorkerOptions.workerSrc = 'https://mozilla.github.io/pdf.js/build/pdf.worker.mjs';
</script>Extraction Logic
The following function loads a PDF and iterates through every page to extract selectable text content:
async function extractText(pdfUrl) {
const loadingTask = pdfjsLib.getDocument(pdfUrl);
const pdf = await loadingTask.promise;
let fullText = "";
for (let i = 1; i <= pdf.numPages; i++) {
const page = await pdf.getPage(i);
const textContent = await page.getTextContent();
const pageText = textContent.items.map(item => item.str).join(" ");
fullText += pageText + "\n\n";
}
return fullText;
}
extractText('https://example.com/sample.pdf').then(text => console.log(text));Limitations of PDF.js:
- Complex Layouts: It extracts text chunks as positioned elements; reconstructing the reading order (columns, tables) can be difficult.
- Scanned Documents: No built-in OCR (Optical Character Recognition) support.
- Performance: Heavy documents can slow down the main thread if not handled carefully with Workers.
2. Using the Aoexl SDK (Commercial)
The Aoexl SDK provides a higher-level API designed for commercial applications. It uses heuristic grouping to reconstruct text lines and blocks, making it far superior for complex document layouts.
Setup
Load the SDK into your project:
<script src="https://cdn.aoexl.com/sdk/aoexl-viewer.js"></script>
<div id="viewer" style="height: 100vh"></div>Extraction Logic
The Aoexl SDK provides the textLinesForPageIndex method, which returns structured line objects with text and bounding box information.
import { AoexlViewer } from '@aoexl/sign';
async function extractWithAoexl(container, pdfUrl) {
const instance = await AoexlViewer.load({
container: container,
document: pdfUrl
});
const pageIndex = 0; // First page
const textLines = await instance.textLinesForPageIndex(pageIndex);
const text = textLines.map(line => line.contents).join('\n');
console.log("Extracted Text:", text);
}Why Choose Aoexl over open-source?
- Structural Integrity: Automatically detects paragraphs, columns, and lists.
- OCR Support: Built-in engine to extract text from scanned images and non-selectable PDFs.
- Unified API: The same code works across Web, iOS, and Android.
- Additional Tools: Easily combine extraction with redaction, annotation, or signing workflows.
Conclusion
For simple personal projects or strictly linear documents, PDF.js is a great choice. However, if your application needs to handle "real-world" PDFs with complex layouts, tables, or scanned content, the Aoexl SDK provides the reliability and structural awareness required for professional document processing.