Würzburg Workshop 2019-07-06

--- tags: wuerzburg2019, diacollo --- # Würzburg Workshop 2019-07-06 [TOC] # General Questions - what exactly do y'all mean by *"Wortfeld"*? - what are its characteristic (necessary and/or sufficient) properties? - how do I define the term explicitly? - in a sufficiently abstract manner to apply to all usages of the term? - in a sufficiently precise manner to be interpretable by a math geek Platonist like myself? - how can I (operationally) model any given *Wortfeld* given only observable data? # General Remarks ## ¬∃*x*( `size`(*x*) & ∀*y* `fit`(*x*,*y*) ) > _"there is no one size which fits all"_ - technology is distinct (and distinguishable) from magic (and arguably therefore [insufficiently advanced](https://en.wikipedia.org/wiki/Clarke%27s_three_laws)) - do not expect *any* software tool (including DiaCollo) to answer *all* your (research) questions at the click of a button - expect to spend a good deal of *time*, *energy*, and *frustration tolerance* learning to use a new tool (stubbornness is a virtue) - slope of the learning curve tends to rise with the complexity of the tool - maybe consider a more traditional small-scale sampling study: - choose a source corpus (or corpora); see e.g. kaskade.dwds.de/~jurish/diacollo/corpora/ - use [DDC](#DDC), DiaCollo, [GermaNet](#GermaNet), and/or [Wortprofil](#DWDS-Wortprofil), in addition to your own linguistic competence & domain expertise in order to select a small(-ish) sample of potentially "interesting" corpus hits for further study ("close reading") - use `#RANDOM #LIMIT[100] #IN p` for a sample of 100 paragraphs - use `#RANDOM #LIMIT[100] #CNTXT 1` for a sample of 100 sentences with 1 sentence of context (right+left) - use the dstar or www.dwds.de/r "export" functionality to save sample data for further manual study - manually sort the sample hits into (research-relevant) categories by "close" reading - count & report how many hits (sentences, paragraphs, distinct files, ...) you found in each category ## accidents will happen ![](https://vignette.wikia.nocookie.net/ttte/images/7/79/TrustThomas47.png) - "forewarned is forearmed" - corpus annotation tools *will* make errors (tokenization EOS-detection, morpholgy, tagging, lemmatization, ...) - you may also find "real" bugs - please exercise [due diligence](https://en.wikipedia.org/wiki/Due_diligence) before reporting them as such - see slide #64 (~89), *"Forensic Analysis Questions: Bugs"* ## sparse data, sample size, & statistical reliability - lower-frequency corpus phenomena require *larger* sample sizes for reliable results - for DiaCollo, this typically means raising the value of the `slice` parameter, or setting it to `0` (zero) for a corpus-global synchronic profile + even then, corpora may not provide a sufficient sample for detection of rare phenonmena - check total corpus frequency for your phenomena in [DDC](#DDC) or [LexDB](#dstar-LexDB) if you're concerned + looking at the *"Steckbriefe"*, most of you probably *should* be concerned about this - see slide #56 (~74): "How large does my corpus need to be ..." ## DWDS corpus annotations ### what we've got + ... is corpus-dependent; see [`https://kaskade.dwds.de/dstar/`***CORPUS***`/details.perl`](https://kaskade.dwds.de/dstar/dta/details.perl) + corpus metadata (collection label (e.g. `dta`, `zeit`), collection description) + corpus segmentation: - files (e.g. book, volume, article): `#IN file` - paragraphs: `#IN p` - sentences: `#IN s` (default) + document metadata (by file, e.g. author, date, text-class, ...) + token attributes (automatically assigned) - surface form (`$Token,$w`) - ... sometimes (`$Utf8,$u`) and (`$CanonicalToken,$v`) as well - lemma (`$Lemma,$l`: default for bareword queries) - part-of-speech tag (`$Pos,$p`) - typically [STTS](https://www.ims.uni-stuttgart.de/forschung/ressourcen/lexika/TagSets/stts-table.html) + term expanders - run-time query transformation "pipelines" - examples: * `Hause -> $Lemma=Hause|Lemma -> $Lemma=@{Haus}` * `Hause|eqlemma -> $Lemma=Hause|eqlemma -> $Lemma=@{Haus,Hause,Häuser,Häusern,Hauses}` * `Haus|gn-syn -> $Lemma=Haus|gn-syn -> $Lemma=@{Sternzeichen,...,Dynastie,...,Haus}` ### what we do *not* have + searchable morphological structure (segmentation, affixation, free morpheme categories, compounding, etc.) + searchable morphosyntactic properties (case, number, gender, animacy, &c) + named-entity recognition (beyond "NE" tag predicted by PoS-tagger - which tends to be unreliable in particular for NEs) + syntactic parsing (not even clause- or phrase-boundaries) + semantic annotations of any sort + word-sense disambiguation ("WSD") + DWDS DiaCollo instances' "native" relations (`collocations`,`term-document matrix`) do **not** include: - literal or normalized surface forms (`$w`,`$u`,`$v`) - any type (=tuple) containing a functional category PoS-tag (e.g. `ART,APPR,KON,`...) - for access to these annotations, you may need to use the (**slow and expensive**) [DiaCollo DDC relation](https://kaskade.dwds.de/dstar/dta/diacollo/help.perl#prf-ddc) ## semasiology vs. onmasiology - DiaCollo & other corpus-linguistic software tools are basically [*semasiological*](https://en.wiktionary.org/wiki/semasiology) (=word-primary) methods + we start from **words** (types); DWDS corpora are also annoated with *lemmata*, and *part-of-speech* (STTS) + if we [ain't got it](#what-we-do-not-have), you can't query it (directly) + if you don't query it, you're [unlikely to find it](http://rg.rg.mpg.de/en/article_id/1038) (Schwandt, 2016) + beware lexical ambiguity (no [WSD](#what-we-do-not-have)) - [*onomasiological*](https://en.wiktionary.org/wiki/onomasiology) tools (=concept-primary) are worth pursuing... but don't hold your breath (im(ns)ho, it ain't going to happen any time soon) + DWDS corpora are [not annotated](#what-we-do-not-have) with "concepts" * we don't even have a good universal vocabulary of "concepts" to start from * even if we start from (say) GermaNet or [GFO](https://en.wikipedia.org/wiki/General_formal_ontology), we don't have the **granularity** or **coverage** required for reliable (statistical) results on "interesting" questions + if you have your *own* corpus annotated with your *own* concepts, you *can* feed it to DiaCollo and have it crunch the numbers & generate visualizations # Other (potentially useful) tools ## DWDS Wortprofil - static collocation database (synchronic) - single source corpus (ca. 2.7M tokens) - supports various syntactic relations (e.g. `has-adjective-attribute`, `has-dative-object`, etc.) - may be helpful for detecting syntactic phenomena (e.g. predication) - aggressive *compile-time frequency filters* mean that [data sparsity](#sparse-data-sample-size-amp-statistical-reliability) may bite you here too - Interface: https://www.dwds.de/wp - Documentation: https://www.dwds.de/d/ressources#wortprofil - currently **orphaned**: development stalled, future uncertain ## DDC - Corpus search engine - Interfaces: - www.dwds.de: https://www.dwds.de/r/ - dstar: [https://kaskade.dwds.de/dstar/***CORPUS***/](https://kaskade.dwds.de/dstar/dta/) - Documentation: - https://www.dwds.de/d/suche - **WARNING:** bareword concatenation is handled **differently** by www.dwds.de than by "pure" DDC or dstar - www.dwds.de: `(A B)` = `("A B")`: phrase query - DDC & kasakde.dwds.de/dstar/: `(A B)` = `(A && B)`: Boolean conjunction (within a single sentence) - *computationally expensive* and probably *not what you want* - https://www.dwds.de/d/korpussuche (work in progress) - http://odo.dwds.de/~jurish/software/ddc/ddc_query.html - **TODO?**: syntax primer - **motivation**: if you can't use the search engine effectively, you won't be able to use DiaCollo's DDC back-end effectively either ... which pretty much all of your research sketches seem to me to require - set-valued term-queries - phrase queries - NEAR() - Boolean operations and `#WITHIN` - term expansion - subscripts and wildcards (`=1`,`=2`) ## dstar/hist - 2d corpus-dependent time series histograms (~ google n-grams) - supports arbitrary [DDC](#DDC) searches (-> much more powerful than n-grams) - interfaces (corpus-dependent) - [kaskade.dwds.de/dstar/***CORPUS***/hist.perl](https://kaskade.dwds.de/dstar/dta/hist.perl) (all available corpora) - www.dwds.de/r/plot (only selected corpora) - documentation - www.dwds.de/d/plot - kaskade.dwds.de/dstar/dta/help-hist.perl ## dstar/LexDB - corpus-dependent lexical inventory (all annotated token-level attributes, with frequencies) - URL: [kaskade.dwds.de/dstar/***CORPUS***/lexdb/](https://kaskade.dwds.de/dstar/dta/lexdb/) - useful for exploring corpus vocabulary, e.g. + which lemma(ta) get(s) assigned to a given surface form? + which surface form(s) instantiate a given lemma? + which PoS-tags get assigned to a given lemma and/or surface form? + how often does a particular term actually occur in a given corpus? ## dta SemCloud - distributional semantic vector-space model (DTA only) - terms (=lemmata) x documents (=pages) x categories (=volumes) - used e.g. for "similar works" recommendations in [DTAQ](http://www.deutschestextarchiv.de/dtaq/) - URL: https://kaskade.dwds.de/dstar/dta/semcloud/ - see also [these slides](http://kaskade.dwds.de/~moocow/mirror/pubs/jurish2014semantics-talk.pdf) ## Thesauri ### Tips & Caveats - only (currently) works with [DiaCollo DDC relation](https://kaskade.dwds.de/dstar/dta/diacollo/help.perl#prf-ddc) (or in direct [DDC](#DDC) queries) - no "user-friendly" search interface in DWDS wrappers - user creativity & perseverance required - no WSD - lexical ambiguity can lead to precision errors - no support for multi-word units (MWEs) + thesauri may *contain* MWEs (e.g. "Öffentliches_Recht", "übrig haben" in GermaNet), but dstar/DDC can't handle them correctly + "term-expansion" does just what the name says: expands a single-term query to a disjunction over a set of single-term queries: * e.g. `Haus|gn-syn -> {Sternzeichen,...,Dynastie,...,Haus}` * corpora are *not* annotated with MWEs (so "Öffentliches_Recht" is *not* a single token, and can*not* be queried by a single term) * upshot: a search for e.g. (`Öffentliches_Recht`) will find **no hits** * you can search for a literal phrase using e.g. `"Öffentliches Recht"` but: - phrase queries ([qc_phrase](http://odo.dwds.de/~moocow/software/ddc/ddc_query.html#rule_qc_phrase)) are *not* single-term queries ([qc_word](http://odo.dwds.de/~moocow/software/ddc/ddc_query.html#rule_qc_word)): each phrase-hit covers **multiple tokens** - phrase queries are *not* valid within `{...}` (only disjunctions of atomic values for the given token attribute) - the term expansion mechanism does **not** support generic query transformations (e.g. `haben|gn-sub1 -> ({aufbewahren,...,zusammenhaben} || "übrig haben"`) - if you need this, you'll have to do it **manually** ### GermaNet - curated ontology, Tübingen (well-structured, deep hierarchy) - requires registration; cost-free for academic use - official url http://www.sfs.uni-tuebingen.de/GermaNet/ - user-level [tools](http://www.sfs.uni-tuebingen.de/GermaNet/tools.shtml): GermNet-Explorer, GernEdiT - www wrappers: - DWDS-internal: https://kaskade.dwds.de/germanet/ + e.g. https://kaskade.dwds.de/germanet/?q=Haus - CLARIN: https://shibboleth.bbaw.de/proxied/germanet/ - dstar "query lizard" - e.g. https://kaskade.dwds.de/dstar/dta/lizard.perl?x=gn-asi&q=Haus - size (v11.0) + ca. 123k lexemes (lex2orth: lemmta, MWEs) + ca. 143k total lexical senses (lex2syn: lexemes+synsets) + ca. 110k synsets + ca. 115k hyperonymy/hyponomy relations ### OpenThesaurus - crowd-sourced ontology project (loosely structured, flat hierarchy) - official url: https://www.openthesaurus.de/ - www wrappers: - DWDS (public): http://kaskade.dwds.de/openthesaurus/ + e.g. http://kaskade.dwds.de/openthesaurus/?q=Haus - "query lizard" e.g. http://kaskade.dwds.de/dstar/dta/lizard.perl?x=ot-asi&q=Haus - size (dump from 2019-04-17) + ca. 137k lexemes (lex2orth) + ca. 167k lexical senses (lex2syn) + ca. 43k synsets + ca. 16k hyperonymy/hyponomy relations ## DiaCollo - diachronic collocation profiling tool - what I've been talking about all morning - see kaskade.dwds.de/~jurish/diacollo/ ### DiaCollo Tips - [color=#ff0000]**caveat**: many of these techniques require using the (slow & expensive) [DiaCollo DDC relation](https://kaskade.dwds.de/dstar/dta/diacollo/help.perl#prf-ddc) - **please** use with *caution & consideration*! - you probably usually want to specify `GROUPBY:l` to aggregate candidate collocates by lemma only (disregarding PoS) - to mitigate data sparsity issues & use a paragraph-wide co-occurrence window, use the [DiaCollo TDF relation](https://kaskade.dwds.de/dstar/dta/diacollo/help.perl#prf-tdf) - [`{Polen,Ungarn,Böhmen,Schlesien}`](https://kaskade.dwds.de/dstar/dta/diacollo/?query=%7BPolen%2CUngarn%2CB%C3%B6hmen%2CSchlesien%7D&_s=submit&date=&slice=100&score=ld&kbest=10&cutoff=&profile=tdf&format=cloud&groupby=l&eps=0) - use `$p` (PoS) and [phrase-queries](http://odo.dwds.de/~moocow/software/ddc/ddc_query.html#rule_qc_phrase) to approximate syntactic constraints - [color=#ff0000][`"Angst vor #2 $p=NN=2" #fmin 1`](https://kaskade.dwds.de/dstar/dta/diacollo/?query=%22Angst+vor+%232+%24p%3DNN%3D2%22+%23fmin+1&_s=submit&date=&slice=100&score=ld&kbest=10&cutoff=&profile=ddc&format=cloud&groupby=l&eps=0) - [color=#ff0000][`"Bedrohung #4 {weil,deshalb,deswegen,denn} #4 $p=NN=2" #fmin 1`](https://kaskade.dwds.de/dstar/zeit/diacollo/?query=%22Bedrohung+%234+%7Bweil%2Cdeshalb%2Cdeswegen%2Cdenn%7D+%234+%24p%3DNN%3D2%22+%23fmin+1&_s=submit&date=&slice=10&score=ld&kbest=10&cutoff=&profile=ddc&format=cloud&groupby=l&eps=0) - use `$p` (PoS) and [`NEAR()` queries](http://odo.dwds.de/~moocow/software/ddc/ddc_query.html#rule_qw_set_infl) for finer-grained proximity queries - [`Balkan && {westlich,umfassen,einschließen,annektieren}`](https://kaskade.dwds.de/dstar/zeit/diacollo/?query=Balkan+%26%26+%7Bwestlich%2Cumfassen%2Ceinschlie%C3%9Fen%2Cannektieren%7D&_s=submit&date=&slice=0&score=ld&kbest=50&cutoff=&profile=tdf&format=cloud&groupby=l&eps=0) - [color=#f00][`NEAR(Wald|gn-asi, $p=ADJ*=2, 4) #fmin 2`](https://kaskade.dwds.de/dstar/dta/diacollo/?query=NEAR%28Wald%7Cgn-asi%2C%24p%3DADJ*%3D2%2C4%29+%23fmin+2&_s=submit&date=1600%3A1899&slice=50&score=ld&kbest=10&cutoff=&profile=ddc&format=cloud&groupby=l&eps=0) - to approximate "semantic fields", maybe try - [set-valued term-queries](http://odo.dwds.de/~moocow/software/ddc/ddc_query.html#rule_qw_set_infl) (`{...}`) e.g. - [`{unmännlich,Unzucht,widernatürlich}`](https://kaskade.dwds.de/dstar/dta/diacollo/?query=%7Bunm%C3%A4nnlich%2CUnm%C3%A4nnlichkeit%2CSodomie%2CUnzucht%2Cwidernat%C3%BCrlich%7D+&slice=50&kbest=30&profile=tdf&format=cloud&global=1&groupby=l) - [`{Fleisch,Fisch}` vs. `{Tofu,Soja}`](http://kaskade.dwds.de/dstar/ibk_web_2016c/diacollo/?query=%7BFleisch%2CFisch%7D&_s=submit&bquery=%7BTofu%2CSoja%7D&date=&slice=0&bdate=&bslice=0&score=ld&kbest=50&diff=adiff&profile=diff-2&format=cloud&groupby=&eps=0) - [thesaurus expansion](#Thesauri) (`|gn-asi`,`|ot-asi`) with an appropriate synset - [color=#ff0000][`NEAR(Balkan, s44177|gn-asi=2, 4)`](https://kaskade.dwds.de/dstar/zeit/diacollo/?query=near%28Balkan%2C+s44177%7Cgn-asi%3D2%2C+4%29&_s=submit&date=&slice=0&score=ld&kbest=50&cutoff=&profile=ddc&format=html&groupby=l&eps=0) - [DTA SemCloud](#dta-SemCloud) expansion (`|sem`) - [color=#ff0000][`NEAR(Autorität@50|sem, $p=NN=2, 4)`,+global](https://kaskade.dwds.de/dstar/dta_beta/diacollo/?query=NEAR%28Autorit%C3%A4t%4050%7Csem%2C+%24p%3DNN%3D2%2C+4%29&_s=submit&date=1500%3A1899&slice=50&score=ld&kbest=50&cutoff=&profile=ddc&format=cloud&global=1&groupby=l&eps=0) --- # See also - [_"Steckbriefe"_ notes, hints, comments, & snarks](https://hackmd.io/lxSBctp7TKaG8D0QQKd-fg) - kaskade.dwds.de/~jurish/diacollo/