sudo apt update
sudo apt install qpdf
Editing pdfs
Semi-automatically
You have a pdf. One of the nice, text based ones, where you can copy the text. But for some reason you want to remove some of the text. Preferably without investing in expensive software from Adobe.
So, is there a better way. Maybe even a way that can be scriptet so you can remove unnessecary text from multiple pdfs.
Such a way exist. There are some restrictions. Some pdfs are encrypted or password protected. And the solution below will not fix that.
A program, qpdf, that can perform transformations on PDFs. Without changing the content.
This is a command-line program. That is daunting for some. But it means that we are able to script it.
We could do that directly in a shell-script. But I’m an R kind of guy.
Begin by installing.
There is some documentation on the repo of qpdf that might be useful if you are on a windows macine.
PDFs are compressed. The process consist of three steps:
- Decompres
- Remove the fluff
- (re)Compres the file
Lets script it. Begin by setting parameters and temporary files (and now we’re back in R):
<- "fluffy.pdf"
input_pdf <- "non-fluffy.pdf"
output_pdf <- "\\(he/him\\)"
fluff <- tempfile(fileext = ".pdf") temp_pdf
Note that fluff
is a regular expression, and since we are in R, we need to double-escape the parentheses.
Next, we construct the string we need to run to decompress the file:
<- paste0(
cli_decompress "qpdf --qdf --object-streams=disable '",
"' ' ", temp_pdf, "'") input_pdf,
The result looks like this:
cli_decompress
We send it to the commandline:
system(cli_decompress)
Now construct what we should send to the commandline to do a case-insentive search and replace, using sed
:
<- paste0("sed -i 's/", fluff, "//Ig' '", temp_pdf, "'") cmd_sed
Send that to the commandline:
system(cmd_sed)
Finally we re-compress the pdf, constructing the command line command:
<- paste0("qpdf '", temp_pdf, "' '", output_pdf)
cmd_compress )
cmd_compress
And run it.
system(cmd_compress)
Hey presto, fluffy text has been removed.
This can now be automated, removing other kinds fluff, and iterating over more than one pdf. This is left as an exercise for the interested student.