Editing pdfs

Semi-automatically

R
PDF
Author

Christian Knudsen

Published

June 22, 2025

You have a pdf. One of the nice, text based ones, where you can copy the text. But for some reason you want to remove some of the text. Without investing in expensive software from Adobe to do that, is there a better way. Maybe even a way that can be scriptet so you can remove unnessecary text from multiple pdfs.

Such a way exist. There are some restrictions. Some pdfs are encrypted or password protected. And the solution below will not fix that.

We are not able to do it directly in R, but need a helper program, qpdf. Since it is able to run on the commandline, we can script it. We could do that directly in BASH, but I’m an R-kind of guy.

Begin by installing.

sudo apt update
sudo apt install qpdf

There is some documentation on the repo of qpdf that might be useful if you are on a windows macine.

PDFs are compressed. The process consist of three steps:

  1. Decompres
  2. Remove the fluff
  3. Recompres the file

Lets script it. Begin by setting parameters and temporary files:

input_pdf <- "fluffy.pdf"
output_pdf <- "non-fluffy.pdf"
fluff <- "\\(he/him\\)"
temp_pdf <- tempfile(fileext = ".pdf")

Note that fluff is a regular expression, and since we are in R, we need to double-escape the parentheses.

Next, we construct the string we need to run to decompress the file:

cli_decompress <- paste0(
  "qpdf --qdf --object-streams=disable '", 
  input_pdf, "' ' ", temp_pdf, "'")

We send that to the commandline:

system(cli_decompress)

Now construct what we should send to the commandline to do a case-insentive search and replace, using sed:

cmd_sed <- paste0("sed -i 's/", fluff, "//Ig' '", temp_pdf, "'")

Send that to the commandline:

system(cmd_sed)

Finally we re-compress the pdf, constructing the command line kommand:

cmd_compress <- paste0("qpdf '", temp_pdf, "' '", output_pdf)
)

And run it.

system(cmd_compress)

Hey presto, fluffy text has been removed.

This can now be automated, removing other fluff, and iterating over more than one pdf. This is left as an exercise for the interested student.