Würzburg Workshop 2019-07-06

Würzburg Workshop 2019-07-06
General Questions
General Remarks
Other (potentially useful) tools
See also

General Questions

what exactly do y'all mean by "Wortfeld"?
- what are its characteristic (necessary and/or sufficient) properties?
- how do I define the term explicitly?
  - in a sufficiently abstract manner to apply to all usages of the term?
  - in a sufficiently precise manner to be interpretable by a math geek Platonist like myself?
- how can I (operationally) model any given Wortfeld given only observable data?

General Remarks

¬∃x( `size`(x) & ∀y `fit`(x,y) )

"there is no one size which fits all"

technology is distinct (and distinguishable) from magic (and arguably therefore insufficiently advanced)
do not expect any software tool (including DiaCollo) to answer all your (research) questions at the click of a button
- expect to spend a good deal of time, energy, and frustration tolerance learning to use a new tool (stubbornness is a virtue)
- slope of the learning curve tends to rise with the complexity of the tool
maybe consider a more traditional small-scale sampling study:
- choose a source corpus (or corpora); see e.g. kaskade.dwds.de/~jurish/diacollo/corpora/
- use DDC, DiaCollo, GermaNet, and/or Wortprofil, in addition to your own linguistic competence & domain expertise in order to select a small(-ish) sample of potentially "interesting" corpus hits for further study ("close reading")
  - use #RANDOM #LIMIT[100] #IN p for a sample of 100 paragraphs
  - use #RANDOM #LIMIT[100] #CNTXT 1 for a sample of 100 sentences with 1 sentence of context (right+left)
  - use the dstar or www.dwds.de/r "export" functionality to save sample data for further manual study
- manually sort the sample hits into (research-relevant) categories by "close" reading
  - count & report how many hits (sentences, paragraphs, distinct files, …) you found in each category

accidents will happen

"forewarned is forearmed"
corpus annotation tools will make errors (tokenization EOS-detection, morpholgy, tagging, lemmatization, …)
you may also find "real" bugs
- please exercise due diligence before reporting them as such
- see slide #64 (~89), "Forensic Analysis Questions: Bugs"

sparse data, sample size, & statistical reliability

lower-frequency corpus phenomena require larger sample sizes for reliable results
for DiaCollo, this typically means raising the value of the slice parameter, or setting it to 0 (zero) for a corpus-global synchronic profile
- even then, corpora may not provide a sufficient sample for detection of rare phenonmena
check total corpus frequency for your phenomena in
DDC or LexDB if you're concerned
- looking at the "Steckbriefe", most of you probably should be concerned about this
see slide #56 (~74): "How large does my corpus need to be …"

DWDS corpus annotations

what we've got

… is corpus-dependent; see https://kaskade.dwds.de/dstar/CORPUS/details.perl
corpus metadata (collection label (e.g. dta, zeit), collection description)
corpus segmentation:
- files (e.g. book, volume, article): #IN file
- paragraphs: #IN p
- sentences: #IN s (default)
document metadata (by file, e.g. author, date, text-class, …)
token attributes (automatically assigned)
- surface form ($Token,$w)
  - … sometimes ($Utf8,$u) and ($CanonicalToken,$v) as well
- lemma ($Lemma,$l: default for bareword queries)
- part-of-speech tag ($Pos,$p) - typically STTS
term expanders
- run-time query transformation "pipelines"
- examples:
  - Hause -> $Lemma=Hause|Lemma -> $Lemma=@{Haus}
  - Hause|eqlemma -> $Lemma=Hause|eqlemma -> $Lemma=@{Haus,Hause,Häuser,Häusern,Hauses}
  - Haus|gn-syn -> $Lemma=Haus|gn-syn -> $Lemma=@{Sternzeichen,...,Dynastie,...,Haus}

what we do not have

searchable morphological structure (segmentation, affixation, free morpheme categories, compounding, etc.)
searchable morphosyntactic properties (case, number, gender, animacy, &c)
named-entity recognition (beyond "NE" tag predicted by PoS-tagger - which tends to be unreliable in particular for NEs)
syntactic parsing (not even clause- or phrase-boundaries)
semantic annotations of any sort
word-sense disambiguation ("WSD")
DWDS DiaCollo instances' "native" relations (collocations,term-document matrix) do not include:
- literal or normalized surface forms ($w,$u,$v)
- any type (=tuple) containing a functional category PoS-tag (e.g. ART,APPR,KON,…)
- for access to these annotations, you may need to use the (slow and expensive) DiaCollo DDC relation

semasiology vs. onmasiology

DiaCollo & other corpus-linguistic software tools are basically semasiological (=word-primary) methods
- we start from words (types); DWDS corpora are also annoated with lemmata, and part-of-speech (STTS)
- if we ain't got it, you can't query it (directly)
- if you don't query it, you're unlikely to find it (Schwandt, 2016)
- beware lexical ambiguity (no WSD)
onomasiological tools (=concept-primary) are worth pursuing… but don't hold your breath (im(ns)ho, it ain't going to happen any time soon)
- DWDS corpora are not annotated with "concepts"
  - we don't even have a good universal vocabulary of "concepts" to start from
  - even if we start from (say) GermaNet or GFO, we don't have the granularity or coverage required for reliable (statistical) results on "interesting" questions
- if you have your own corpus annotated with your own concepts, you can feed it to DiaCollo and have it crunch the numbers & generate visualizations

Other (potentially useful) tools

DWDS Wortprofil

static collocation database (synchronic)
- single source corpus (ca. 2.7M tokens)
- supports various syntactic relations (e.g. has-adjective-attribute, has-dative-object, etc.)
- may be helpful for detecting syntactic phenomena (e.g. predication)
- aggressive compile-time frequency filters mean that data sparsity may bite you here too
Interface: https://www.dwds.de/wp
Documentation: https://www.dwds.de/d/ressources#wortprofil
currently orphaned: development stalled, future uncertain

DDC

Corpus search engine
Interfaces:
- www.dwds.de: https://www.dwds.de/r/
- dstar: https://kaskade.dwds.de/dstar/CORPUS/
Documentation:
- https://www.dwds.de/d/suche
  - WARNING: bareword concatenation is handled differently by www.dwds.de than by "pure" DDC or dstar
    - www.dwds.de: (A B) = ("A B"): phrase query
    - DDC & kasakde.dwds.de/dstar/: (A B) = (A && B): Boolean conjunction (within a single sentence) - computationally expensive and probably not what you want
- https://www.dwds.de/d/korpussuche (work in progress)
- http://odo.dwds.de/~jurish/software/ddc/ddc_query.html
TODO?: syntax primer
- motivation: if you can't use the search engine effectively, you won't be able to use DiaCollo's DDC back-end effectively either … which pretty much all of your research sketches seem to me to require
- set-valued term-queries
- phrase queries
- NEAR()
- Boolean operations and #WITHIN
- term expansion
- subscripts and wildcards (=1,=2)

dstar/hist

2d corpus-dependent time series histograms (~ google n-grams)
supports arbitrary DDC searches (-> much more powerful than n-grams)
interfaces (corpus-dependent)
- kaskade.dwds.de/dstar/CORPUS/hist.perl (all available corpora)
- www.dwds.de/r/plot (only selected corpora)
documentation
- www.dwds.de/d/plot
- kaskade.dwds.de/dstar/dta/help-hist.perl

dstar/LexDB

corpus-dependent lexical inventory (all annotated token-level attributes, with frequencies)
URL: kaskade.dwds.de/dstar/CORPUS/lexdb/
useful for exploring corpus vocabulary, e.g.
- which lemma(ta) get(s) assigned to a given surface form?
- which surface form(s) instantiate a given lemma?
- which PoS-tags get assigned to a given lemma and/or surface form?
- how often does a particular term actually occur in a given corpus?

dta SemCloud

distributional semantic vector-space model (DTA only)
- terms (=lemmata) x documents (=pages) x categories (=volumes)
- used e.g. for "similar works" recommendations in DTAQ
URL: https://kaskade.dwds.de/dstar/dta/semcloud/
see also these slides

Thesauri

Tips & Caveats

only (currently) works with DiaCollo DDC relation (or in direct DDC queries)
no "user-friendly" search interface in DWDS wrappers
- user creativity & perseverance required
no WSD - lexical ambiguity can lead to precision errors
no support for multi-word units (MWEs)
- thesauri may contain MWEs (e.g. "Öffentliches_Recht", "übrig haben" in GermaNet), but dstar/DDC can't handle them correctly
- "term-expansion" does just what the name says: expands a single-term query to a disjunction over a set of single-term queries:
  - e.g. Haus|gn-syn -> {Sternzeichen,...,Dynastie,...,Haus}
  - corpora are not annotated with MWEs (so "Öffentliches_Recht" is not a single token, and cannot be queried by a single term)
    - upshot: a search for e.g. (Öffentliches_Recht) will find no hits
  - you can search for a literal phrase using e.g. "Öffentliches Recht" but:
    - phrase queries (qc_phrase) are not single-term queries (qc_word): each phrase-hit covers multiple tokens
    - phrase queries are not valid within {...} (only disjunctions of atomic values for the given token attribute)
    - the term expansion mechanism does not support generic query transformations (e.g. haben|gn-sub1 -> ({aufbewahren,...,zusammenhaben} || "übrig haben")
      - if you need this, you'll have to do it manually

GermaNet

curated ontology, Tübingen (well-structured, deep hierarchy)
requires registration; cost-free for academic use
official url http://www.sfs.uni-tuebingen.de/GermaNet/
- user-level tools: GermNet-Explorer, GernEdiT
www wrappers:
- DWDS-internal: https://kaskade.dwds.de/germanet/
  - e.g. https://kaskade.dwds.de/germanet/?q=Haus
- CLARIN: https://shibboleth.bbaw.de/proxied/germanet/
- dstar "query lizard"
  - e.g. https://kaskade.dwds.de/dstar/dta/lizard.perl?x=gn-asi&q=Haus
size (v11.0)
- ca. 123k lexemes (lex2orth: lemmta, MWEs)
- ca. 143k total lexical senses (lex2syn: lexemes+synsets)
- ca. 110k synsets
- ca. 115k hyperonymy/hyponomy relations

OpenThesaurus

crowd-sourced ontology project (loosely structured, flat hierarchy)
official url: https://www.openthesaurus.de/
www wrappers:
- DWDS (public): http://kaskade.dwds.de/openthesaurus/
  - e.g. http://kaskade.dwds.de/openthesaurus/?q=Haus
"query lizard" e.g. http://kaskade.dwds.de/dstar/dta/lizard.perl?x=ot-asi&q=Haus
size (dump from 2019-04-17)
- ca. 137k lexemes (lex2orth)
- ca. 167k lexical senses (lex2syn)
- ca. 43k synsets
- ca. 16k hyperonymy/hyponomy relations

DiaCollo

diachronic collocation profiling tool
what I've been talking about all morning
see kaskade.dwds.de/~jurish/diacollo/

DiaCollo Tips

caveat: many of these techniques require using the (slow & expensive) DiaCollo DDC relation
- please use with caution & consideration!
you probably usually want to specify GROUPBY:l to aggregate candidate collocates by lemma only (disregarding PoS)
to mitigate data sparsity issues & use a paragraph-wide co-occurrence window, use the DiaCollo TDF relation
- {Polen,Ungarn,Böhmen,Schlesien}
use $p (PoS) and phrase-queries to approximate syntactic constraints
- "Angst vor #2 $p=NN=2" #fmin 1
- "Bedrohung #4 {weil,deshalb,deswegen,denn} #4 $p=NN=2" #fmin 1
use $p (PoS) and NEAR() queries for finer-grained proximity queries
- Balkan && {westlich,umfassen,einschließen,annektieren}
- NEAR(Wald|gn-asi, $p=ADJ*=2, 4) #fmin 2
to approximate "semantic fields", maybe try
- set-valued term-queries ({...}) e.g.
  - {unmännlich,Unzucht,widernatürlich}
  - {Fleisch,Fisch} vs. {Tofu,Soja}
- thesaurus expansion (|gn-asi,|ot-asi) with an appropriate synset
  - NEAR(Balkan, s44177|gn-asi=2, 4)
- DTA SemCloud expansion (|sem)
  - NEAR(Autorität@50|sem, $p=NN=2, 4),+global

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`	在筆記中貼入程式碼
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.

Würzburg Workshop 2019-07-06

General Questions

General Remarks

¬∃x( size(x) & ∀y fit(x,y) )

accidents will happen

sparse data, sample size, & statistical reliability

DWDS corpus annotations

what we've got

what we do not have

semasiology vs. onmasiology

Other (potentially useful) tools

DWDS Wortprofil

DDC

dstar/hist

dstar/LexDB

dta SemCloud

Thesauri

Tips & Caveats

GermaNet

OpenThesaurus

DiaCollo

DiaCollo Tips

See also

¬∃x( `size`(x) & ∀y `fit`(x,y) )