Skip to content

sona-gj/pdf-highlighter

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDF Highlighter POC — README

A drop-in React component that renders a PDF (including pre-signed S3 URLs) and highlights all occurrences of one or more search terms with pixel-accurate overlays and hover tooltips. Built on top of react-pdf (pdf.js).


What you get

  • <PdfHighlighter /> component:

    • Displays a PDF inline in your React app.
    • Highlights matches for one or many search terms.
    • Ultra-precise geometry using the rendered text layer + DOM Ranges (correct across split spans, kerning, ligatures, and line wraps).
    • Transparent yellow (or per-term colors), tooltip on hover.
    • Works with pre-signed S3 URLs (1-hour validity etc.).

Quick start

1) Install

npm i react-pdf pdfjs-dist

Node ≥ 18 recommended.

2) PDF worker (required)

react-pdf must use the same pdf.js worker version you installed.

// e.g. in your PdfHighlighter.jsx
import { pdfjs } from "react-pdf";
import workerSrc from "pdfjs-dist/build/pdf.worker.mjs?url";
pdfjs.GlobalWorkerOptions.workerSrc = workerSrc;

3) Import text/annotation CSS (Vite/CRA)

import "react-pdf/dist/Page/AnnotationLayer.css";
import "react-pdf/dist/Page/TextLayer.css";

4) Use the component

import PdfHighlighter from "./components/PdfHighlighter";

export default function Example() {
  const url = "/docs/sample.pdf"; // or pre-signed S3 URL
  const terms = ["submit", "Virginia Tech"]; // one or many

  return <PdfHighlighter url={url} searchTexts={terms} scale={1.25} />;
}

That’s it—highlights should render inline.


Component API

<PdfHighlighter
  url={string}                  // PDF url (supports S3 pre-signed)
  searchText={string}           // single term (optional)
  searchTexts={string[]}        // multiple terms (optional)
  scale={number}                // zoom factor; default 1.25
/>

Notes:

  • Pass either searchText or searchTexts (the latter wins if both provided).
  • Terms are trimmed and de-duplicated.
  • Matching is case-insensitive by default and tolerant of line breaks and end-of-line hyphens (e.g., sub-\nmit → “submit”).

How it works (architecture)

  1. react-pdf:

    • Streams bytes to the pdf.js worker, renders the page graphics onto a <canvas>.
    • Builds a text layer (.react-pdf__Page__textContent) of absolutely positioned <span>s so text is selectable.
  2. Our overlay:

    • For each page, we add an absolutely positioned overlay <div> above the text layer.

    • We walk the text layer DOM (text nodes in visual order) to build a virtual page string and an index mapping each virtual character → {node, offset}.

    • For each search term:

      • We normalize whitespace (so multi-line queries work) and find all matches.
      • For each match we create a DOM Range, call range.getClientRects() to obtain the actual on-screen rectangles, and draw translucent <div>s at those coordinates.
    • A MutationObserver recomputes highlights when the text layer finishes rendering or reflows (e.g., after zoom).

This yields pixel-perfect highlights without guessing character widths.


S3 pre-signed URLs

Alignment is identical regardless of origin. Ensure:

  • Content-Type: application/pdf

  • Bucket CORS allows your app origin (GET/HEAD).

  • Optional response headers when generating the URL:

    • response-content-type=application/pdf
    • response-content-disposition=inline

Example CORS rule (minimal):

[
  {
    "AllowedOrigins": ["https://your-app.example"],
    "AllowedMethods": ["GET", "HEAD"],
    "AllowedHeaders": ["*"],
    "ExposeHeaders": [
      "Accept-Ranges",
      "Content-Range",
      "Content-Length",
      "ETag"
    ],
    "MaxAgeSeconds": 300
  }
]

Styling & customization

  • Color per term — the component hashes each term to an HSL color. Swap the colorForTerm(term) helper to enforce your design tokens.
  • Tooltip — shown on hover; easy to replace with your UI library (e.g., Radix Tooltip). The tooltip content is the matched term (React escapes text by default).

Performance guidance

This POC is fast for typical docs. For very long PDFs:

  • Compute on visible pages only Add an IntersectionObserver to run highlight computation for pages within/near the viewport.
  • Debounce inputs Debounce searchText(s) changes (150–250ms) to avoid recomputing on every keystroke.
  • Avoid unnecessary re-renders Keep url, scale, and searchTexts stable where possible; memoize term arrays.

Accessibility

  • Tooltips supplement but do not replace native selection/accessibility of the text layer.
  • Highlight overlays use pointer-events: auto only for hover; they otherwise allow text selection beneath.
  • If keyboard tooltip access is required, use a11y-ready tooltip components.

Browser support

  • Requires browsers that support Range.getClientRects() (all modern evergreen browsers).
  • Mobile Safari works, but heavy PDFs + many highlights may benefit from the perf tips above.

Troubleshooting

“Setting up fake worker” / worker version mismatch Ensure you import the worker from your installed pdfjs-dist and remove any CDN worker lines.

import workerSrc from "pdfjs-dist/build/pdf.worker.mjs?url";
pdfjs.GlobalWorkerOptions.workerSrc = workerSrc;

Text loads but “Found 0 matches” The text layer may not be ready. The component uses a MutationObserver to recompute on text layer population; if you fork it, keep that logic. Also confirm .react-pdf__Page__textContent exists in the DOM.

Highlights offset / misaligned Ensure:

  • The overlay is a sibling inside the same per-page wrapper with position: relative.
  • You’re importing the correct TextLayer.css.
  • You’re not wrapping the page with transforms that change its bounding box.

CORS errors with S3 Set bucket CORS as above and verify the URL is not expired. Avoid file:// URLs; serve via HTTP.

Scanned PDFs (no text) No text layer → nothing to match. You’d need OCR (out of scope for this POC).


Security & reliability

  • Pre-signed S3 URLs are short-lived; the component simply reads them—it doesn’t store or proxy.
  • React escapes tooltip text; do not dangerously set HTML.
  • Do not log full pre-signed URLs in production logs.

File layout (reference)

src/
  components/
    PdfHighlighter.jsx       # the component (overlay + observer + rendering)
    PdfHighlighter.css       # highlight + tooltip styles
  utils/
    domHighlights.js         # multi-term, line-break/ hyphen-aware DOM geometry

Integrating into your app

  1. Add the files above.
  2. Ensure the worker import and CSS imports exist exactly once (e.g., inside the component).
  3. Render:
<PdfHighlighter
  url={s3Url} // or public path
  searchTexts={["submit", "Yes/True", "Virginia Tech"]}
  scale={1.25}
/>
  1. (Optional) Wrap in your design system shell and wire search inputs to searchTexts.

Roadmap (nice-to-haves)

  • Case-sensitive / whole-word toggles (regex in the finder).
  • Next/Prev navigation & scroll-into-view.
  • Persist/export highlight geometry (page, rects) for analytics/audit.
  • Compute-on-visibility (IntersectionObserver) for very long docs.

About

Pixel-accurate multi-term PDF highlighting for React. Renders a PDF (incl. pre-signed S3 URLs) and highlights multiple search terms across all pages with tooltips using react-pdf/pdf.js.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • JavaScript 84.4%
  • CSS 13.5%
  • HTML 2.1%