changed 6 years ago
Linked with GitHub

Würzburg Workshop 2019-07-06

General Questions

  • what exactly do y'all mean by "Wortfeld"?
    • what are its characteristic (necessary and/or sufficient) properties?
    • how do I define the term explicitly?
      • in a sufficiently abstract manner to apply to all usages of the term?
      • in a sufficiently precise manner to be interpretable by a math geek Platonist like myself?
    • how can I (operationally) model any given Wortfeld given only observable data?

General Remarks

¬∃x( size(x) & ∀y fit(x,y) )

"there is no one size which fits all"

  • technology is distinct (and distinguishable) from magic (and arguably therefore insufficiently advanced)
  • do not expect any software tool (including DiaCollo) to answer all your (research) questions at the click of a button
    • expect to spend a good deal of time, energy, and frustration tolerance learning to use a new tool (stubbornness is a virtue)
    • slope of the learning curve tends to rise with the complexity of the tool
  • maybe consider a more traditional small-scale sampling study:
    • choose a source corpus (or corpora); see e.g. kaskade.dwds.de/~jurish/diacollo/corpora/
    • use DDC, DiaCollo, GermaNet, and/or Wortprofil, in addition to your own linguistic competence & domain expertise in order to select a small(-ish) sample of potentially "interesting" corpus hits for further study ("close reading")
      • use #RANDOM #LIMIT[100] #IN p for a sample of 100 paragraphs
      • use #RANDOM #LIMIT[100] #CNTXT 1 for a sample of 100 sentences with 1 sentence of context (right+left)
      • use the dstar or www.dwds.de/r "export" functionality to save sample data for further manual study
    • manually sort the sample hits into (research-relevant) categories by "close" reading
      • count & report how many hits (sentences, paragraphs, distinct files, ) you found in each category

accidents will happen

  • "forewarned is forearmed"
  • corpus annotation tools will make errors (tokenization EOS-detection, morpholgy, tagging, lemmatization, )
  • you may also find "real" bugs
    • please exercise due diligence before reporting them as such
    • see slide #64 (~89), "Forensic Analysis Questions: Bugs"

sparse data, sample size, & statistical reliability

  • lower-frequency corpus phenomena require larger sample sizes for reliable results
  • for DiaCollo, this typically means raising the value of the slice parameter, or setting it to 0 (zero) for a corpus-global synchronic profile
    • even then, corpora may not provide a sufficient sample for detection of rare phenonmena
  • check total corpus frequency for your phenomena in
    DDC or LexDB if you're concerned
    • looking at the "Steckbriefe", most of you probably should be concerned about this
  • see slide #56 (~74): "How large does my corpus need to be "

DWDS corpus annotations

what we've got

  • is corpus-dependent; see https://kaskade.dwds.de/dstar/CORPUS/details.perl
  • corpus metadata (collection label (e.g. dta, zeit), collection description)
  • corpus segmentation:
    • files (e.g. book, volume, article): #IN file
    • paragraphs: #IN p
    • sentences: #IN s (default)
  • document metadata (by file, e.g. author, date, text-class, )
  • token attributes (automatically assigned)
    • surface form ($Token,$w)
      • sometimes ($Utf8,$u) and ($CanonicalToken,$v) as well
    • lemma ($Lemma,$l: default for bareword queries)
    • part-of-speech tag ($Pos,$p) - typically STTS
  • term expanders
    • run-time query transformation "pipelines"
    • examples:
      • Hause -> $Lemma=Hause|Lemma -> $Lemma=@{Haus}
      • Hause|eqlemma -> $Lemma=Hause|eqlemma -> $Lemma=@{Haus,Hause,Häuser,Häusern,Hauses}
      • Haus|gn-syn -> $Lemma=Haus|gn-syn -> $Lemma=@{Sternzeichen,...,Dynastie,...,Haus}

what we do not have

  • searchable morphological structure (segmentation, affixation, free morpheme categories, compounding, etc.)
  • searchable morphosyntactic properties (case, number, gender, animacy, &c)
  • named-entity recognition (beyond "NE" tag predicted by PoS-tagger - which tends to be unreliable in particular for NEs)
  • syntactic parsing (not even clause- or phrase-boundaries)
  • semantic annotations of any sort
  • word-sense disambiguation ("WSD")
  • DWDS DiaCollo instances' "native" relations (collocations,term-document matrix) do not include:
    • literal or normalized surface forms ($w,$u,$v)
    • any type (=tuple) containing a functional category PoS-tag (e.g. ART,APPR,KON,)
    • for access to these annotations, you may need to use the (slow and expensive) DiaCollo DDC relation

semasiology vs. onmasiology

  • DiaCollo & other corpus-linguistic software tools are basically semasiological (=word-primary) methods

    • we start from words (types); DWDS corpora are also annoated with lemmata, and part-of-speech (STTS)
    • if we ain't got it, you can't query it (directly)
    • if you don't query it, you're unlikely to find it (Schwandt, 2016)
    • beware lexical ambiguity (no WSD)
  • onomasiological tools (=concept-primary) are worth pursuing but don't hold your breath (im(ns)ho, it ain't going to happen any time soon)

    • DWDS corpora are not annotated with "concepts"
      • we don't even have a good universal vocabulary of "concepts" to start from
      • even if we start from (say) GermaNet or GFO, we don't have the granularity or coverage required for reliable (statistical) results on "interesting" questions
    • if you have your own corpus annotated with your own concepts, you can feed it to DiaCollo and have it crunch the numbers & generate visualizations

Other (potentially useful) tools

DWDS Wortprofil

  • static collocation database (synchronic)
    • single source corpus (ca. 2.7M tokens)
    • supports various syntactic relations (e.g. has-adjective-attribute, has-dative-object, etc.)
    • may be helpful for detecting syntactic phenomena (e.g. predication)
    • aggressive compile-time frequency filters mean that data sparsity may bite you here too
  • Interface: https://www.dwds.de/wp
  • Documentation: https://www.dwds.de/d/ressources#wortprofil
  • currently orphaned: development stalled, future uncertain

DDC

dstar/hist

dstar/LexDB

  • corpus-dependent lexical inventory (all annotated token-level attributes, with frequencies)
  • URL: kaskade.dwds.de/dstar/CORPUS/lexdb/
  • useful for exploring corpus vocabulary, e.g.
    • which lemma(ta) get(s) assigned to a given surface form?
    • which surface form(s) instantiate a given lemma?
    • which PoS-tags get assigned to a given lemma and/or surface form?
    • how often does a particular term actually occur in a given corpus?

dta SemCloud

Thesauri

Tips & Caveats

  • only (currently) works with DiaCollo DDC relation (or in direct DDC queries)
  • no "user-friendly" search interface in DWDS wrappers
    • user creativity & perseverance required
  • no WSD - lexical ambiguity can lead to precision errors
  • no support for multi-word units (MWEs)
    • thesauri may contain MWEs (e.g. "Öffentliches_Recht", "übrig haben" in GermaNet), but dstar/DDC can't handle them correctly
    • "term-expansion" does just what the name says: expands a single-term query to a disjunction over a set of single-term queries:
      • e.g. Haus|gn-syn -> {Sternzeichen,...,Dynastie,...,Haus}
      • corpora are not annotated with MWEs (so "Öffentliches_Recht" is not a single token, and cannot be queried by a single term)
        • upshot: a search for e.g. (Öffentliches_Recht) will find no hits
      • you can search for a literal phrase using e.g. "Öffentliches Recht" but:
        • phrase queries (qc_phrase) are not single-term queries (qc_word): each phrase-hit covers multiple tokens
        • phrase queries are not valid within {...} (only disjunctions of atomic values for the given token attribute)
        • the term expansion mechanism does not support generic query transformations (e.g. haben|gn-sub1 -> ({aufbewahren,...,zusammenhaben} || "übrig haben")
          • if you need this, you'll have to do it manually

GermaNet

OpenThesaurus

DiaCollo

DiaCollo Tips


See also

Select a repo