---
tags: wuerzburg2019, diacollo
---
# Würzburg Workshop 2019-07-06
[TOC]
# General Questions
- what exactly do y'all mean by *"Wortfeld"*?
- what are its characteristic (necessary and/or sufficient) properties?
- how do I define the term explicitly?
- in a sufficiently abstract manner to apply to all usages of the term?
- in a sufficiently precise manner to be interpretable by a math geek Platonist like myself?
- how can I (operationally) model any given *Wortfeld* given only observable data?
# General Remarks
## ¬∃*x*( `size`(*x*) & ∀*y* `fit`(*x*,*y*) )
> _"there is no one size which fits all"_
- technology is distinct (and distinguishable) from magic (and arguably therefore [insufficiently advanced](https://en.wikipedia.org/wiki/Clarke%27s_three_laws))
- do not expect *any* software tool (including DiaCollo) to answer *all* your (research) questions at the click of a button
- expect to spend a good deal of *time*, *energy*, and *frustration tolerance* learning to use a new tool (stubbornness is a virtue)
- slope of the learning curve tends to rise with the complexity of the tool
- maybe consider a more traditional small-scale sampling study:
- choose a source corpus (or corpora); see e.g. kaskade.dwds.de/~jurish/diacollo/corpora/
- use [DDC](#DDC), DiaCollo, [GermaNet](#GermaNet), and/or [Wortprofil](#DWDS-Wortprofil), in addition to your own linguistic competence & domain expertise in order to select a small(-ish) sample of potentially "interesting" corpus hits for further study ("close reading")
- use `#RANDOM #LIMIT[100] #IN p` for a sample of 100 paragraphs
- use `#RANDOM #LIMIT[100] #CNTXT 1` for a sample of 100 sentences with 1 sentence of context (right+left)
- use the dstar or www.dwds.de/r "export" functionality to save sample data for further manual study
- manually sort the sample hits into (research-relevant) categories by "close" reading
- count & report how many hits (sentences, paragraphs, distinct files, ...) you found in each category
## accidents will happen
![](https://vignette.wikia.nocookie.net/ttte/images/7/79/TrustThomas47.png)
- "forewarned is forearmed"
- corpus annotation tools *will* make errors (tokenization EOS-detection, morpholgy, tagging, lemmatization, ...)
- you may also find "real" bugs
- please exercise [due diligence](https://en.wikipedia.org/wiki/Due_diligence) before reporting them as such
- see slide #64 (~89), *"Forensic Analysis Questions: Bugs"*
## sparse data, sample size, & statistical reliability
- lower-frequency corpus phenomena require *larger* sample sizes for reliable results
- for DiaCollo, this typically means raising the value of the `slice` parameter, or setting it to `0` (zero) for a corpus-global synchronic profile
+ even then, corpora may not provide a sufficient sample for detection of rare phenonmena
- check total corpus frequency for your phenomena in
[DDC](#DDC) or [LexDB](#dstar-LexDB) if you're concerned
+ looking at the *"Steckbriefe"*, most of you probably *should* be concerned about this
- see slide #56 (~74): "How large does my corpus need to be ..."
## DWDS corpus annotations
### what we've got
+ ... is corpus-dependent; see [`https://kaskade.dwds.de/dstar/`***CORPUS***`/details.perl`](https://kaskade.dwds.de/dstar/dta/details.perl)
+ corpus metadata (collection label (e.g. `dta`, `zeit`), collection description)
+ corpus segmentation:
- files (e.g. book, volume, article): `#IN file`
- paragraphs: `#IN p`
- sentences: `#IN s` (default)
+ document metadata (by file, e.g. author, date, text-class, ...)
+ token attributes (automatically assigned)
- surface form (`$Token,$w`)
- ... sometimes (`$Utf8,$u`) and (`$CanonicalToken,$v`) as well
- lemma (`$Lemma,$l`: default for bareword queries)
- part-of-speech tag (`$Pos,$p`) - typically [STTS](https://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/TagSets/stts-table.html)
+ term expanders
- run-time query transformation "pipelines"
- examples:
* `Hause -> $Lemma=Hause|Lemma -> $Lemma=@{Haus}`
* `Hause|eqlemma -> $Lemma=Hause|eqlemma -> $Lemma=@{Haus,Hause,Häuser,Häusern,Hauses}`
* `Haus|gn-syn -> $Lemma=Haus|gn-syn -> $Lemma=@{Sternzeichen,...,Dynastie,...,Haus}`
### what we do *not* have
+ searchable morphological structure (segmentation, affixation, free morpheme categories, compounding, etc.)
+ searchable morphosyntactic properties (case, number, gender, animacy, &c)
+ named-entity recognition (beyond "NE" tag predicted by PoS-tagger - which tends to be unreliable in particular for NEs)
+ syntactic parsing (not even clause- or phrase-boundaries)
+ semantic annotations of any sort
+ word-sense disambiguation ("WSD")
+ DWDS DiaCollo instances' "native" relations (`collocations`,`term-document matrix`) do **not** include:
- literal or normalized surface forms (`$w`,`$u`,`$v`)
- any type (=tuple) containing a functional category PoS-tag (e.g. `ART,APPR,KON,`...)
- for access to these annotations, you may need to use the (**slow and expensive**) [DiaCollo DDC relation](https://kaskade.dwds.de/dstar/dta/diacollo/help.perl#prf-ddc)
## semasiology vs. onmasiology
- DiaCollo & other corpus-linguistic software tools are basically [*semasiological*](https://en.wiktionary.org/wiki/semasiology) (=word-primary) methods
+ we start from **words** (types); DWDS corpora are also annoated with *lemmata*, and *part-of-speech* (STTS)
+ if we [ain't got it](#what-we-do-not-have), you can't query it (directly)
+ if you don't query it, you're [unlikely to find it](http://rg.rg.mpg.de/en/article_id/1038) (Schwandt, 2016)
+ beware lexical ambiguity (no [WSD](#what-we-do-not-have))
- [*onomasiological*](https://en.wiktionary.org/wiki/onomasiology) tools (=concept-primary) are worth pursuing... but don't hold your breath (im(ns)ho, it ain't going to happen any time soon)
+ DWDS corpora are [not annotated](#what-we-do-not-have) with "concepts"
* we don't even have a good universal vocabulary of "concepts" to start from
* even if we start from (say) GermaNet or [GFO](https://en.wikipedia.org/wiki/General_formal_ontology), we don't have the **granularity** or **coverage** required for reliable (statistical) results on "interesting" questions
+ if you have your *own* corpus annotated with your *own* concepts, you *can* feed it to DiaCollo and have it crunch the numbers & generate visualizations
# Other (potentially useful) tools
## DWDS Wortprofil
- static collocation database (synchronic)
- single source corpus (ca. 2.7M tokens)
- supports various syntactic relations (e.g. `has-adjective-attribute`, `has-dative-object`, etc.)
- may be helpful for detecting syntactic phenomena (e.g. predication)
- aggressive *compile-time frequency filters* mean that [data sparsity](#sparse-data-sample-size-amp-statistical-reliability) may bite you here too
- Interface: https://www.dwds.de/wp
- Documentation: https://www.dwds.de/d/ressources#wortprofil
- currently **orphaned**: development stalled, future uncertain
## DDC
- Corpus search engine
- Interfaces:
- www.dwds.de: https://www.dwds.de/r/
- dstar: [https://kaskade.dwds.de/dstar/***CORPUS***/](https://kaskade.dwds.de/dstar/dta/)
- Documentation:
- https://www.dwds.de/d/suche
- **WARNING:** bareword concatenation is handled **differently** by www.dwds.de than by "pure" DDC or dstar
- www.dwds.de: `(A B)` = `("A B")`: phrase query
- DDC & kasakde.dwds.de/dstar/: `(A B)` = `(A && B)`: Boolean conjunction (within a single sentence) - *computationally expensive* and probably *not what you want*
- https://www.dwds.de/d/korpussuche (work in progress)
- http://odo.dwds.de/~jurish/software/ddc/ddc_query.html
- **TODO?**: syntax primer
- **motivation**: if you can't use the search engine effectively, you won't be able to use DiaCollo's DDC back-end effectively either ... which pretty much all of your research sketches seem to me to require
- set-valued term-queries
- phrase queries
- NEAR()
- Boolean operations and `#WITHIN`
- term expansion
- subscripts and wildcards (`=1`,`=2`)
## dstar/hist
- 2d corpus-dependent time series histograms (~ google n-grams)
- supports arbitrary [DDC](#DDC) searches (-> much more powerful than n-grams)
- interfaces (corpus-dependent)
- [kaskade.dwds.de/dstar/***CORPUS***/hist.perl](https://kaskade.dwds.de/dstar/dta/hist.perl) (all available corpora)
- www.dwds.de/r/plot (only selected corpora)
- documentation
- www.dwds.de/d/plot
- kaskade.dwds.de/dstar/dta/help-hist.perl
## dstar/LexDB
- corpus-dependent lexical inventory (all annotated token-level attributes, with frequencies)
- URL: [kaskade.dwds.de/dstar/***CORPUS***/lexdb/](https://kaskade.dwds.de/dstar/dta/lexdb/)
- useful for exploring corpus vocabulary, e.g.
+ which lemma(ta) get(s) assigned to a given surface form?
+ which surface form(s) instantiate a given lemma?
+ which PoS-tags get assigned to a given lemma and/or surface form?
+ how often does a particular term actually occur in a given corpus?
## dta SemCloud
- distributional semantic vector-space model (DTA only)
- terms (=lemmata) x documents (=pages) x categories (=volumes)
- used e.g. for "similar works" recommendations in [DTAQ](http://www.deutschestextarchiv.de/dtaq/)
- URL: https://kaskade.dwds.de/dstar/dta/semcloud/
- see also [these slides](http://kaskade.dwds.de/~moocow/mirror/pubs/jurish2014semantics-talk.pdf)
## Thesauri
### Tips & Caveats
- only (currently) works with [DiaCollo DDC relation](https://kaskade.dwds.de/dstar/dta/diacollo/help.perl#prf-ddc) (or in direct [DDC](#DDC) queries)
- no "user-friendly" search interface in DWDS wrappers
- user creativity & perseverance required
- no WSD - lexical ambiguity can lead to precision errors
- no support for multi-word units (MWEs)
+ thesauri may *contain* MWEs (e.g. "Öffentliches_Recht", "übrig haben" in GermaNet), but dstar/DDC can't handle them correctly
+ "term-expansion" does just what the name says: expands a single-term query to a disjunction over a set of single-term queries:
* e.g. `Haus|gn-syn -> {Sternzeichen,...,Dynastie,...,Haus}`
* corpora are *not* annotated with MWEs (so "Öffentliches_Recht" is *not* a single token, and can*not* be queried by a single term)
* upshot: a search for e.g. (`Öffentliches_Recht`) will find **no hits**
* you can search for a literal phrase using e.g. `"Öffentliches Recht"` but:
- phrase queries ([qc_phrase](http://odo.dwds.de/~moocow/software/ddc/ddc_query.html#rule_qc_phrase)) are *not* single-term queries ([qc_word](http://odo.dwds.de/~moocow/software/ddc/ddc_query.html#rule_qc_word)): each phrase-hit covers **multiple tokens**
- phrase queries are *not* valid within `{...}` (only disjunctions of atomic values for the given token attribute)
- the term expansion mechanism does **not** support generic query transformations (e.g. `haben|gn-sub1 -> ({aufbewahren,...,zusammenhaben} || "übrig haben"`)
- if you need this, you'll have to do it **manually**
### GermaNet
- curated ontology, Tübingen (well-structured, deep hierarchy)
- requires registration; cost-free for academic use
- official url http://www.sfs.uni-tuebingen.de/GermaNet/
- user-level [tools](http://www.sfs.uni-tuebingen.de/GermaNet/tools.shtml): GermNet-Explorer, GernEdiT
- www wrappers:
- DWDS-internal: https://kaskade.dwds.de/germanet/
+ e.g. https://kaskade.dwds.de/germanet/?q=Haus
- CLARIN: https://shibboleth.bbaw.de/proxied/germanet/
- dstar "query lizard"
- e.g. https://kaskade.dwds.de/dstar/dta/lizard.perl?x=gn-asi&q=Haus
- size (v11.0)
+ ca. 123k lexemes (lex2orth: lemmta, MWEs)
+ ca. 143k total lexical senses (lex2syn: lexemes+synsets)
+ ca. 110k synsets
+ ca. 115k hyperonymy/hyponomy relations
### OpenThesaurus
- crowd-sourced ontology project (loosely structured, flat hierarchy)
- official url: https://www.openthesaurus.de/
- www wrappers:
- DWDS (public): http://kaskade.dwds.de/openthesaurus/
+ e.g. http://kaskade.dwds.de/openthesaurus/?q=Haus
- "query lizard" e.g. http://kaskade.dwds.de/dstar/dta/lizard.perl?x=ot-asi&q=Haus
- size (dump from 2019-04-17)
+ ca. 137k lexemes (lex2orth)
+ ca. 167k lexical senses (lex2syn)
+ ca. 43k synsets
+ ca. 16k hyperonymy/hyponomy relations
## DiaCollo
- diachronic collocation profiling tool
- what I've been talking about all morning
- see kaskade.dwds.de/~jurish/diacollo/
### DiaCollo Tips
- [color=#ff0000]**caveat**: many of these techniques require using the (slow & expensive) [DiaCollo DDC relation](https://kaskade.dwds.de/dstar/dta/diacollo/help.perl#prf-ddc)
- **please** use with *caution & consideration*!
- you probably usually want to specify `GROUPBY:l` to aggregate candidate collocates by lemma only (disregarding PoS)
- to mitigate data sparsity issues & use a paragraph-wide co-occurrence window, use the [DiaCollo TDF relation](https://kaskade.dwds.de/dstar/dta/diacollo/help.perl#prf-tdf)
- [`{Polen,Ungarn,Böhmen,Schlesien}`](https://kaskade.dwds.de/dstar/dta/diacollo/?query=%7BPolen%2CUngarn%2CB%C3%B6hmen%2CSchlesien%7D&_s=submit&date=&slice=100&score=ld&kbest=10&cutoff=&profile=tdf&format=cloud&groupby=l&eps=0)
- use `$p` (PoS) and [phrase-queries](http://odo.dwds.de/~moocow/software/ddc/ddc_query.html#rule_qc_phrase) to approximate syntactic constraints
- [color=#ff0000][`"Angst vor #2 $p=NN=2" #fmin 1`](https://kaskade.dwds.de/dstar/dta/diacollo/?query=%22Angst+vor+%232+%24p%3DNN%3D2%22+%23fmin+1&_s=submit&date=&slice=100&score=ld&kbest=10&cutoff=&profile=ddc&format=cloud&groupby=l&eps=0)
- [color=#ff0000][`"Bedrohung #4 {weil,deshalb,deswegen,denn} #4 $p=NN=2" #fmin 1`](https://kaskade.dwds.de/dstar/zeit/diacollo/?query=%22Bedrohung+%234+%7Bweil%2Cdeshalb%2Cdeswegen%2Cdenn%7D+%234+%24p%3DNN%3D2%22+%23fmin+1&_s=submit&date=&slice=10&score=ld&kbest=10&cutoff=&profile=ddc&format=cloud&groupby=l&eps=0)
- use `$p` (PoS) and [`NEAR()` queries](http://odo.dwds.de/~moocow/software/ddc/ddc_query.html#rule_qw_set_infl) for finer-grained proximity queries
- [`Balkan && {westlich,umfassen,einschließen,annektieren}`](https://kaskade.dwds.de/dstar/zeit/diacollo/?query=Balkan+%26%26+%7Bwestlich%2Cumfassen%2Ceinschlie%C3%9Fen%2Cannektieren%7D&_s=submit&date=&slice=0&score=ld&kbest=50&cutoff=&profile=tdf&format=cloud&groupby=l&eps=0)
- [color=#f00][`NEAR(Wald|gn-asi, $p=ADJ*=2, 4) #fmin 2`](https://kaskade.dwds.de/dstar/dta/diacollo/?query=NEAR%28Wald%7Cgn-asi%2C%24p%3DADJ*%3D2%2C4%29+%23fmin+2&_s=submit&date=1600%3A1899&slice=50&score=ld&kbest=10&cutoff=&profile=ddc&format=cloud&groupby=l&eps=0)
- to approximate "semantic fields", maybe try
- [set-valued term-queries](http://odo.dwds.de/~moocow/software/ddc/ddc_query.html#rule_qw_set_infl) (`{...}`) e.g.
- [`{unmännlich,Unzucht,widernatürlich}`](https://kaskade.dwds.de/dstar/dta/diacollo/?query=%7Bunm%C3%A4nnlich%2CUnm%C3%A4nnlichkeit%2CSodomie%2CUnzucht%2Cwidernat%C3%BCrlich%7D+&slice=50&kbest=30&profile=tdf&format=cloud&global=1&groupby=l)
- [`{Fleisch,Fisch}` vs. `{Tofu,Soja}`](http://kaskade.dwds.de/dstar/ibk_web_2016c/diacollo/?query=%7BFleisch%2CFisch%7D&_s=submit&bquery=%7BTofu%2CSoja%7D&date=&slice=0&bdate=&bslice=0&score=ld&kbest=50&diff=adiff&profile=diff-2&format=cloud&groupby=&eps=0)
- [thesaurus expansion](#Thesauri) (`|gn-asi`,`|ot-asi`) with an appropriate synset
- [color=#ff0000][`NEAR(Balkan, s44177|gn-asi=2, 4)`](https://kaskade.dwds.de/dstar/zeit/diacollo/?query=near%28Balkan%2C+s44177%7Cgn-asi%3D2%2C+4%29&_s=submit&date=&slice=0&score=ld&kbest=50&cutoff=&profile=ddc&format=html&groupby=l&eps=0)
- [DTA SemCloud](#dta-SemCloud) expansion (`|sem`)
- [color=#ff0000][`NEAR(Autorität@50|sem, $p=NN=2, 4)`,+global](https://kaskade.dwds.de/dstar/dta_beta/diacollo/?query=NEAR%28Autorit%C3%A4t%4050%7Csem%2C+%24p%3DNN%3D2%2C+4%29&_s=submit&date=1500%3A1899&slice=50&score=ld&kbest=50&cutoff=&profile=ddc&format=cloud&global=1&groupby=l&eps=0)
---
# See also
- [_"Steckbriefe"_ notes, hints, comments, & snarks](https://hackmd.io/lxSBctp7TKaG8D0QQKd-fg)
- kaskade.dwds.de/~jurish/diacollo/
<!--
Local Variables:
mode: Markdown
coding: utf-8
End:
-->