# David learns how to create a pdf with python ## The project The project is to automate to some extent creating pdfs with the photos of my dog Moncho. They come also with some text :::info ![aca49540dd1eb4e9](https://hackmd.io/_uploads/rkHfmedBxx.jpg) La foto de hoy es un modelo 3D de perrito que puedes descargar en la asset store por el módico precio de 5.99 pollitos. Aquí se puede ver en un fondo de prueba. Oferta disponible por tiempo limitado. #fotoMoncho ::: This is an idea of the layout done in inkscape in 2 minutes ![imagen](https://hackmd.io/_uploads/ryzMUlOBxx.png) ## pypdf Let's try pypdf, I have 2 sources of information The docs https://pypdf.readthedocs.io/en/stable/index.html And this website, since is from 2025 is suspicious of be AI generated https://realpython.com/creating-modifying-pdf/ ### Read an existing pdf I created a pdf with inkscape, so let's see if I can use it as a template. I used the instructions from the docs. I actually didn't use the Pathlib since the pdf was in the same folder ```python= from pypdf import PdfReader pdf_path = "prueba maquetación inkscape.pdf" pdf_reader = PdfReader(pdf_path) print(len(pdf_reader.pages)) #1 ``` Since it only has one page, then is only 1 page. But at least it _reads_ the pdf. ### Extract the text Reference: https://realpython.com/creating-modifying-pdf/#extracting-text-from-a-page Welp, this gets more complicated. The example was plain simple but it doesn't work The example extracts the text simple and neat, all I get is a complex function like this: ![imagen](https://hackmd.io/_uploads/HkLMReOHeg.png) And the mistake was that I didn't wrote properly the method with the end parenthesis (I wrote in the line 10: ` print(page.extractext)`) ```python= from pypdf import PdfReader pdf_path = "prueba maquetación inkscape.pdf" pdf_reader = PdfReader(pdf_path) for page in pdf_reader.pages: print(page.extract_text()) ``` And the output: ``` 06 jul 2025, 11:54 La foto de hoy es un modelo 3D de perrito que puedes descargar en la asset store por el módico precio de 5.99 pollitos. Aquí se puede ver en un fondo de prueba. Oferta disponible por tiempo limitado. #fotoMoncho ``` ### Extract image from a pdf :::warning For this we need to also instat pillow with the command `pip install pypdf[image]` ::: Reference: https://pypdf.readthedocs.io/en/stable/user/extract-images.html With this snippet I can create a png with the Moncho's image extracted from the pdf. ```python= from pypdf import PdfReader reader = PdfReader("prueba maquetación inkscape.pdf") page = reader.pages[0] for count, image_file_object in enumerate(page.images): with open(str(count) + image_file_object.name, "wb") as fp: fp.write(image_file_object.data) ``` The name of the file is 0x9.png written in the same folder of the script ![imagen](https://hackmd.io/_uploads/SJu8bZdBll.png) ### Copying a pdf We need to use the pdf writer for this, so first I'm going to try just to copy the existing pdf. ```python= from pypdf import PaperSize, PdfReader, PdfWriter, Transformation # Read source file reader = PdfReader("prueba maquetación inkscape.pdf") page = reader.pages[0] # Write the new file writer = PdfWriter() writer.add_page(page) writer.write("write.pdf") ``` works! = D ### Substituting the image For this I need to use PIL (Python Image Library) that it seems included but I need to add that import. :::info You need to use the replace in the writer, not in the reader. ::: ```python= from pypdf import PdfReader, PdfWriter from PIL import Image # Read source file reader = PdfReader("prueba maquetación inkscape.pdf") page = reader.pages[0] # Write the new file writer = PdfWriter() writer.add_page(page) writer.pages[0].images[0].replace(Image.open("PruebaMonchoEntrada.jpg"), quality=100) writer.write("write.pdf") ``` And it works, now I have a copy pdf with a different image! ![imagen](https://hackmd.io/_uploads/S1n0WzuHex.png) Reference: https://pypdf.readthedocs.io/en/stable/modules/PageObject.html#pypdf._page.PageObject.images ### Substituting the text It seems that extract_text() it just... well, extract the text but doesn't modify it. I've had also tried page.get_contents() but returns {} ![imagen](https://hackmd.io/_uploads/BkdP_zdBex.png) Buuuut we can try to use "getObject" and try to find the specific objects. In this case we're looking for 2 text boxes so I write this code ```python= from pypdf import PdfReader # Read source file reader = PdfReader("prueba maquetación inkscape.pdf") page = reader.pages[0] for key in page: print(key, page[key]) print("----") for x in range(0,9): print(x, reader.get_object(x)) print("----") ``` ![imagen](https://hackmd.io/_uploads/By46aMOrex.png) Seems like a tree. I couldn't access to the specifics. And the very documentation points to somewhere else. ## Borb https://github.com/jorisschellekens/borb This has a really good documentation in borb-examples, pretty useful! https://github.com/jorisschellekens/borb-examples This would be the next step, later. ### Notas en castellano Como la documentación de Borb es tan buena, voy a pasar a hacerla en castellano como notas mías. #### Importar hay que sacar las cosas de borp.pdf, es curioso que tenga _extensión_ En el ejemplito de hola mundo saca solo unas pocas clases (intuyo para hacerlo todo más veloz). `Document`, representa un documento `Page`, una página en el documento. `PageLayout` es una clase abstracta que define como eñes van las cosas en una página. `SingleColumnLayout` es una implmentación de `PageLayout` `Paragraph` es un cacharro par añadir texto en un pdf. Y luego `PDF` se dedica a convertir el `Document` en un archivo de verdad. :::info Algo de OOP, aquí se ve que utiliza cosas de herencia y clases porque especifica intencionadamente qué es cada variable. Se nota cuando utilza la lína `l: PageLayout = SingleColumnLayout(p)` Donde PageLayout es la clase abstracta y la SingleColumnLayout es el constructor de la clase concreta. ::: #### Fuentes Parece que solo acepta ttfs, hay que importarlas en el script. Para hacerlo, además hay que instalarse "fonttools" con un pip install fonttools. #### Colores También hay que importar la wea. Los básicos RGBColor, HexColor, HSVColor, CMYKColor Los otros: PantoneColor, FarrowAndBallColor, X11Color https://www.w3schools.com/colors/colors_x11.asp https://www.farrow-ball.com/eu/paint/all-paint-colours la diferencia entre hexcolor y rgb es que el rgb le enchifas red green y blue y el otro solo el hexadecimal en string con el hashtag (por ejemplo "#FF003D" ) `ColorScheme` es una clase guapa que permite utilizando los otros colores generar complementarios y análogos. :::info Methods available in `ColorScheme`: - **`analogous_colors`**: Generates colors adjacent to the base color on the color wheel. - **`complementary_color`**: Provides the color opposite to the base color on the color wheel. - **`monochromatic`**: Creates a palette based on variations in the lightness or darkness of a single color. - **`split_complementary_color`**: Returns colors that are adjacent to the complementary color of the base color. - **`square_colors`**: Generates four colors that are evenly spaced around the color wheel. - **`tetradic_colors`**: Creates a palette of four colors using two complementary color pairs. - **`tints`**: Creates lighter variations of the base color by adding white. - **`triadic_colors`**: Generates three colors that are evenly spaced around the color wheel. - **`shades`**: Creates darker variations of the base color by adding black. ::: Espaciado de letras y palabras https://github.com/borb-pdf/borb-examples/blob/master/02/README.md#23-setting-word_spacing-and-character_spacing