# The Invisible Teacher: What AI Can Learn from Librarians ## [Brian George Thomas](https://www.linkedin.com/in/briangeorgethomas/) ![image alt](https://i.imgur.com/hFRcXNA.png) ## Artificial intelligence is changing how we read, research, and remember. But **librarians were organizing the world's knowledge long before there were algorithms to automate it** - building the systems of classification, description, and access that computational methods would later attempt to replicate. This project models that connection. It’s part essay, part lab notebook: an experiment in *learning through metadata*. :::success Each section builds a bridge between how **libraries structure knowledge** and how **AI systems learn from those same structures**. In doing so, it asks a simple but urgent question: ### What if the way we describe information could also teach us to see it more ethically? ::: #### On this page, you’ll find examples drawn from archival practice, digital curation, and classroom design. These interactive demonstrations show **how metadata, tokenization, and fuzzy matching reveal not just how machines learn, but how *we* do**. --- :::info ## What Is Metadata? **Metadata** literally means *“data about data.”* Think of it as the story *about* a piece of information. It gives us the clues that tell us *what* something is, *where* it came from, and *how* it connects to other things. We use metadata all the time without realizing it: - The **title, artist, and duration** shown when you stream a song - The **hashtags and timestamps** on a social media post - The **author, date, and keywords** attached to a research article - The hidden **camera data** (location, lighting, date) inside a photo file > Metadata is the invisible framework that makes discovery possible. > It turns raw information into knowledge we can search, analyze, and share. Whether in a **library catalog**, a **digital archive**, or an **app on your phone**, metadata tells the story of data: what it means, who it serves, and why it exists. **In this project**, I approach metadata as both a *technical system* and a *human practice*: a way of organizing meaning, tracing connections, and re-examining how knowledge itself is built. ::: --- ## Metadata as a Mode of Learning ![image alt](https://images.pexels.com/photos/1370295/pexels-photo-1370295.jpeg) ### Libraries and archives are more than places to store knowledge: they are *laboratories of discovery.* Every catalog entry, subject heading, or metadata field is an interpretation of the world. It expresses what we value, how we classify information, and who we believe deserves to find it. ### When we design with metadata in mind, we are teaching machines to see. ### When we teach research literacy, we are teaching humans to question. Metadata sits between these two acts: it is pedagogy encoded in systems. It teaches us what counts as knowable, what relationships matter, and whose voices appear (or disappear) in the record. --- ## How Do Machines “See” a Record? Let’s look at a simple example. Text: `"Learning through metadata connects archives, AI, and pedagogy."` Metadata (a simplified sample): ```json { "author": "Brian G. Thomas", "date_created": "2025-11-12", "language": "en", "keywords": ["metadata", "learning", "archives", "AI", "pedagogy"], "model_tokens": 12, "readability_score": 58.4 } ``` Each layer adds meaning: - The text carries the content. - The metadata adds context—when, by whom, and how it was made. - Together, they form a teaching system: a human sentence that a machine can learn from. --- :::info ## What Is Tokenization? ### Understanding Tokenization: The Proxy System for Digital Privacy Think about how academic libraries manage access to licensed databases. When you log in through the library portal, you’re not sending your student ID or personal credentials to JSTOR or ProQuest. Instead, the system sends a **proxy** - a temporary credential that says *“this person has access”* without revealing who you are. > The vendors get what they need. > You get access without exposure. **Tokenization** works on the same principle - but applies it to data itself. When a library system stores your account information, it replaces identifying details with a placeholder like `B7K-9PO-4RQ`. If someone breaches the database, all they find are tokens: keys that unlock nothing. The real information lives separately in a secure vault, accessible only to authorized systems when absolutely necessary. --- ### Why This Matters Beyond Libraries You rely on tokenization every day: - When you tap your phone to pay, that’s a tokenized card number. - When hospitals share anonymized patient data for research. - When institutions exchange information across systems they don’t fully control. > The token travels. The truth stays locked away! For library workers, tokenization underpins every data flow, from interlibrary loans to analytics dashboards. Each API call or data export is a potential exposure point. Tokens keep the work moving while minimizing the risk. --- ### Trust by Design Tokenization isn’t just about preventing theft; it’s about designing **trust into infrastructure**. Unlike encryption, which sends a locked box through the mail, tokenization sends only the claim ticket. The box - like a rare manuscript - stays safe in the vault. ::: --- ## Seeing Through Machines: How AI Tokenizes Meaning If tokenization in libraries protects *who* we are, tokenization in AI reveals *how* machines read what we say. Both depend on substitution: one hides identity; the other abstracts language. In both, meaning becomes a system of stand-ins. --- ### From Privacy Tokens to Language Tokens When you tap your phone to pay, your real card number stays hidden behind a token. When you type a sentence into an AI model, your words are also transformed — not for security, but for computation. Here’s how that looks in miniature: Text: `"Every schema is a syllabus. Every search interface is a lesson plan."` Machine-readable version (simplified): ```json { "text": "Every schema is a syllabus. Every search interface is a lesson plan.", "token_count": 14, "tokens": [2196, 3457, 318, 257, 33063, 13, 2196, 1079, 6744, 318, 257, 6324, 1219, 13] } ``` ## To a human, the text is language. To a machine, it’s a pattern of numbers. --- # What "Reading" Means for a Model ![image alt](https://images.pexels.com/photos/415071/pexels-photo-415071.jpeg) ## Let’s unpack what’s happening! ```python import tiktoken enc = tiktoken.get_encoding("cl100k_base") text = "Every schema is a syllabus. Every search interface is a lesson plan." tokens = enc.encode(text) print(f"Tokens: {tokens}") print(f"Token count: {len(tokens)}") ``` When run, this code shows *how a model tokenizes* breaking text into the smallest units it can recognize. ## It doesn’t see ideas or arguments. It sees structure, sequence, and statistical likelihood. > AI doesn’t read the way we do. It mirrors our syntax, not our consciousness. --- # Reintroducing Context: Metadata as Meaning :::success This is where metadata returns to the story. Just as library tokens need a vault to preserve identity, language tokens need metadata to preserve context. Without it, meaning collapses into math. Metadata answers the questions a model can’t ask: - Who wrote this? - When and why? - What larger system of knowledge does it belong to? ## In this sense, metadata is the ethical layer that teaches machines (and humans) to see responsibly. ::: --- ## Fuzzy Matching: Connection Beyond Precision If metadata restores context, **fuzzy matching** restores connection. It’s how search systems—and understanding itself—handle imperfection. A user types *“ekcole”* and still finds results for *“école.”* The system bridges difference—not just between characters, but between languages and histories. > In a multilingual archive, fuzzy logic becomes an ethics of connection. These systems learn through variation: spelling, dialect, uncertainty. They remind us that precision isn’t always equitable and that discovery is often interpretive. Like tokenization, fuzzy matching is a philosophy of trust—acknowledging that meaning often lives in the near miss. --- :::warning ## Learning Through Making The process of **creating metadata** mirrors how we learn through making in the classroom. Designing a metadata schema means asking: - **Purpose:** What story is this data telling? - **Scope:** What stays, what goes, and why? - **Interaction:** How will someone find meaning here later? Each decision encodes value. Each field expresses judgment. > Metadata is scholarship rendered in code—an argument expressed through structure. ::: --- ## Ethical Pedagogy and Digital Literacy When students use AI as an assistant that ethically helps them to delve deeper into literary analysis, trace historical sources, or design digital exhibits, they are performing acts of metadata creation. They decide what counts, what connects, and what deserves context. ## AI, guided by ethical pedagogy, becomes a collaborator in inquiry rather than a shortcut for it. | Teaching Goal | Metadata Parallel | |----------------|------------------| | Encourage interpretation over extraction | Promote user-centered access | | Emphasize transparency of sources | Document provenance and versioning | | Support multimodal expression | Design interoperable metadata formats | > Both teaching and cataloging are forms of care - frameworks that invite discovery without closing meaning. --- :::info ## Unlearning as Design Unlearning is the willingness to question the structure itself. - In **archives**, it means re-describing materials to include silenced voices. - In **education**, it means rebuilding lessons around curiosity and reflection instead of output. Metadata becomes a *living curriculum* that is constantly revised, ethically responsive, and shaped by those who search within it. ## To unlearn is to reimagine how we describe. ::: --- ## Toward a Human-Centered Information Future As I continue my MLIS research in digital curation and archival description, I keep returning to one question: > How can we design information systems that teach as they store? The answer lies somewhere between **code and conversation** between algorithmic discovery and human interpretation. The next generation of metadata will not only classify knowledge but **cultivate awareness.** Every tag, every field, every query will become part of a global dialogue about meaning. > Information systems can be classrooms—if we teach them to be. --- :::success ## References & Further Reading - [CONTENTdm Record Design Exercises](https://help.oclc.org/Librarian_Toolbox/OCLC_training/Learner_guides/CONTENTdm_learner_guides/Learner_guide%3A_CONTENTdm_Basic_Skills_1_-_Getting_started_with_CONTENTdm) - [Collection Management in the Digital Age](https://www.illinoismuseums.org/Blogs/13546474) - [Buckland, M. “What is a Document?” (1998)](https://doi.org/10.1002/(SICI)1097-4571(199709)48:9<804::AID-ASI5>3.0.CO;2-V) - [Metadata as the new interface between IT and AI](https://www.dataversity.net/articles/why-metadata-is-the-new-interface-between-it-and-ai/) ::: --- ## Reflection Prompt - What does your metadata say about what you’ve learned? - What might it conceal? - How can AI and pedagogy work together to make our descriptions—and our discoveries—more human? --- > “Information is never finished. It is always being re-authored by those who seek it.” *— Brian G. Thomas, MLIS Candidate | Educator | Digital Curation Researcher*