The project is to automate to some extent creating pdfs with the photos of my dog Moncho.
They come also with some text
La foto de hoy es un modelo 3D de perrito que puedes descargar en la asset store por el módico precio de 5.99 pollitos. Aquí se puede ver en un fondo de prueba. Oferta disponible por tiempo limitado. #fotoMoncho
This is an idea of the layout done in inkscape in 2 minutes
Let's try pypdf, I have 2 sources of information
The docs https://pypdf.readthedocs.io/en/stable/index.html
And this website, since is from 2025 is suspicious of be AI generated
https://realpython.com/creating-modifying-pdf/
I created a pdf with inkscape, so let's see if I can use it as a template.
I used the instructions from the docs. I actually didn't use the Pathlib since the pdf was in the same folder
Since it only has one page, then is only 1 page. But at least it reads the pdf.
Reference: https://realpython.com/creating-modifying-pdf/#extracting-text-from-a-page
Welp, this gets more complicated. The example was plain simple but it doesn't work
The example extracts the text simple and neat, all I get is a complex function like this:
And the mistake was that I didn't wrote properly the method with the end parenthesis (I wrote in the line 10: print(page.extractext)
)
And the output:
For this we need to also instat pillow with the command pip install pypdf[image]
Reference: https://pypdf.readthedocs.io/en/stable/user/extract-images.html
With this snippet I can create a png with the Moncho's image extracted from the pdf.
The name of the file is 0x9.png written in the same folder of the script
We need to use the pdf writer for this, so first I'm going to try just to copy the existing pdf.
works! = D
For this I need to use PIL (Python Image Library) that it seems included but I need to add that import.
You need to use the replace in the writer, not in the reader.
And it works, now I have a copy pdf with a different image!
Reference:
https://pypdf.readthedocs.io/en/stable/modules/PageObject.html#pypdf._page.PageObject.images
It seems that extract_text() it just… well, extract the text but doesn't modify it.
I've had also tried page.get_contents() but returns {}
Buuuut we can try to use "getObject" and try to find the specific objects.
In this case we're looking for 2 text boxes so I write this code
Seems like a tree.
I couldn't access to the specifics. And the very documentation points to somewhere else.
https://github.com/jorisschellekens/borb
This has a really good documentation in borb-examples, pretty useful!
https://github.com/jorisschellekens/borb-examples
This would be the next step, later.