Try   HackMD

David learns how to create a pdf with python

The project

The project is to automate to some extent creating pdfs with the photos of my dog Moncho.

They come also with some text

Image Not Showing Possible Reasons
  • The image was uploaded to a note which you don't have access to
  • The note which the image was originally uploaded to has been deleted
Learn More →

La foto de hoy es un modelo 3D de perrito que puedes descargar en la asset store por el módico precio de 5.99 pollitos. Aquí se puede ver en un fondo de prueba. Oferta disponible por tiempo limitado. #fotoMoncho

This is an idea of the layout done in inkscape in 2 minutes

imagen

pypdf

Let's try pypdf, I have 2 sources of information

The docs https://pypdf.readthedocs.io/en/stable/index.html

And this website, since is from 2025 is suspicious of be AI generated
https://realpython.com/creating-modifying-pdf/

Read an existing pdf

I created a pdf with inkscape, so let's see if I can use it as a template.

I used the instructions from the docs. I actually didn't use the Pathlib since the pdf was in the same folder

from pypdf import PdfReader pdf_path = "prueba maquetación inkscape.pdf" pdf_reader = PdfReader(pdf_path) print(len(pdf_reader.pages)) #1

Since it only has one page, then is only 1 page. But at least it reads the pdf.

Extract the text

Reference: https://realpython.com/creating-modifying-pdf/#extracting-text-from-a-page

Welp, this gets more complicated. The example was plain simple but it doesn't work

The example extracts the text simple and neat, all I get is a complex function like this:

imagen

And the mistake was that I didn't wrote properly the method with the end parenthesis (I wrote in the line 10: print(page.extractext))

from pypdf import PdfReader pdf_path = "prueba maquetación inkscape.pdf" pdf_reader = PdfReader(pdf_path) for page in pdf_reader.pages: print(page.extract_text())

And the output:

06 jul 2025, 11:54
La foto de hoy es un modelo 3D de perrito que    
puedes descargar en la asset store por el módico 
precio de 5.99 pollitos. Aquí se puede ver en un 
fondo de prueba. Oferta disponible por tiempo    
limitado. #fotoMoncho

Extract image from a pdf

For this we need to also instat pillow with the command pip install pypdf[image]

Reference: https://pypdf.readthedocs.io/en/stable/user/extract-images.html

With this snippet I can create a png with the Moncho's image extracted from the pdf.

from pypdf import PdfReader reader = PdfReader("prueba maquetación inkscape.pdf") page = reader.pages[0] for count, image_file_object in enumerate(page.images): with open(str(count) + image_file_object.name, "wb") as fp: fp.write(image_file_object.data)

The name of the file is 0x9.png written in the same folder of the script

imagen

Copying a pdf

We need to use the pdf writer for this, so first I'm going to try just to copy the existing pdf.

from pypdf import PaperSize, PdfReader, PdfWriter, Transformation # Read source file reader = PdfReader("prueba maquetación inkscape.pdf") page = reader.pages[0] # Write the new file writer = PdfWriter() writer.add_page(page) writer.write("write.pdf")

works! = D

Substituting the image

For this I need to use PIL (Python Image Library) that it seems included but I need to add that import.

You need to use the replace in the writer, not in the reader.

from pypdf import PdfReader, PdfWriter from PIL import Image # Read source file reader = PdfReader("prueba maquetación inkscape.pdf") page = reader.pages[0] # Write the new file writer = PdfWriter() writer.add_page(page) writer.pages[0].images[0].replace(Image.open("PruebaMonchoEntrada.jpg"), quality=100) writer.write("write.pdf")

And it works, now I have a copy pdf with a different image!

imagen

Reference:

https://pypdf.readthedocs.io/en/stable/modules/PageObject.html#pypdf._page.PageObject.images

Substituting the text

It seems that extract_text() it just well, extract the text but doesn't modify it.

I've had also tried page.get_contents() but returns {}

imagen

Buuuut we can try to use "getObject" and try to find the specific objects.

In this case we're looking for 2 text boxes so I write this code

from pypdf import PdfReader # Read source file reader = PdfReader("prueba maquetación inkscape.pdf") page = reader.pages[0] for key in page: print(key, page[key]) print("----") for x in range(0,9): print(x, reader.get_object(x)) print("----")

imagen

Seems like a tree.

I couldn't access to the specifics. And the very documentation points to somewhere else.

Borb

https://github.com/jorisschellekens/borb

This has a really good documentation in borb-examples, pretty useful!

https://github.com/jorisschellekens/borb-examples

This would be the next step, later.