Find missing words in Spotlight search
Why can’t Spotlight find words in the content of some of my PDF documents?
PDF docs can contain two notional layers, the first containing images perhaps from printed pages that were originally scanned in, and a second containing laid–out text that might have been generated by OCR from those scanned images. Those created directly from Word and Pages normally only have the second layer, together with any images embedded in their text.
Currently, when Spotlight indexes PDF documents, it only uses text from that second layer, whereOCR on any page images. It also faces the problem that both layers are intended to appear accurate visual representations of the document, rather than providing structured access to text contents, although that’s starting to change with the growing use of newer standards such as PDF/A.