Editing pdfs

Semi-automatically

R
PDF
Author

Christian Knudsen

Published

June 22, 2025

You have a pdf. One of the nice, text based ones, where you can copy the text. But for some reason you want to remove some of the text. Preferably without investing in expensive software from Adobe.

So, is there a better way. Maybe even a way that can be scriptet so you can remove unnessecary text from multiple pdfs.

Such a way exist. There are some restrictions. Some pdfs are encrypted or password protected. And the solution below will not fix that.

A program, qpdf, that can perform transformations on PDFs. Without changing the content.

This is a command-line program. That is daunting for some. But it means that we are able to script it.

We could do that directly in a shell-script. But I’m an R kind of guy.

Begin by installing.

sudo apt update
sudo apt install qpdf

There is some documentation on the repo of qpdf that might be useful if you are on a windows macine.

PDFs are compressed. The process consist of three steps:

  1. Decompres
  2. Remove the fluff
  3. (re)Compres the file

Lets script it. Begin by setting parameters and temporary files (and now we’re back in R):

input_pdf <- "fluffy.pdf"
output_pdf <- "non-fluffy.pdf"
fluff <- "\\(he/him\\)"
temp_pdf <- tempfile(fileext = ".pdf")

Note that fluff is a regular expression, and since we are in R, we need to double-escape the parentheses.

Next, we construct the string we need to run to decompress the file:

cli_decompress <- paste0(
  "qpdf --qdf --object-streams=disable '", 
  input_pdf, "' ' ", temp_pdf, "'")

The result looks like this:

cli_decompress

We send it to the commandline:

system(cli_decompress)

Now construct what we should send to the commandline to do a case-insentive search and replace, using sed:

cmd_sed <- paste0("sed -i 's/", fluff, "//Ig' '", temp_pdf, "'")

Send that to the commandline:

system(cmd_sed)

Finally we re-compress the pdf, constructing the command line command:

cmd_compress <- paste0("qpdf '", temp_pdf, "' '", output_pdf)
)
cmd_compress

And run it.

system(cmd_compress)

Hey presto, fluffy text has been removed.

This can now be automated, removing other kinds fluff, and iterating over more than one pdf. This is left as an exercise for the interested student.