General Points
- highly context dependent: Why, when and by whom was a text written and in what context (LM)?
- filings that have a lack of structural anchors are correlated with firm size and time period
- important to differentiate b/w upper- and lower-case words in some cases, when e.g. looking for uncertainty words “may” and “May” should not be treated equally (LM)
- important to use appropriate word list, e.g. if “death” and “mine” are negative, then mining and health industry might be very negative (LM)
- Consider a dictionary related to the field we are studying
- Weighting of words is important (use tf idf weighting). For example loss is included in a lot of 10k files, it has a high frequency but since it's omminiprescent, it can be informative. When applying weights to Harvard dictionary, results are similar to LM dictionary (weight attenuate the impact of high frequency words)
- Better to use utf-8 in general. SEC files: works well with ascii.
- It is better to use machine learning if the language used includes slang or informal language. This is because the dictionary approach will not be able to identify the tone of the words. With machine learning, it is recommended to include a list of the words that are driven the results.
Uncertinity
- LMD word lists: negative, positivity, uncertainty, litigious, modal strong and modal weak. Modal weak is part of uncertainty.
- important to differentiate b/w upper- and lower-case words in some cases, when e.g. looking for uncertainty words “may” and “May” should not be treated equally (LM)
- "For example, if you are trying to measure uncertainty in the document, some would argue that you should parse out and focus on the MD&A section of the 10-K filing." (LM, 2016)
General Points SEC filings:
- HTML formatting important to avoid systematic errors (LM)
Adjust word list
- List the most frequent positive and negative words and see whether they pass a “smell test” to make sure these actually fit their categories. This is particularly important in ML approaches, e.g. “at”,”by” etc. could be classified as predictive of negative emotions, which is not intuitive. -> Adjust wordlist if needed
- remove typical slang that is not informative. If e.g. working on transcripts, remove “good morning”, “that’s a good question” etc.
- remove single character words
- remove numbers (unless you are interested in these)
- account for negations (“are not”, “aren’t”, “could not”, “couldn’t”)
- control for spelling (cannot vs can not)
Stemming and Stop Words
- One can remove the stem of words to get a higher precision, however this is more complicated to analyze/process
- Stemming doesn't usually affect results -> The professor and LM prefer not to stem
- Stop words: One could remove uninformative words, like “and”, “or” etc. However, this is usually not necessary as they are equally distributed across texts. Results with or without removal of stop words are very similar.
- When stemming a text, first tokenize it, as stemming only works word-by-word. Second, it’s important to note that “\n” gets lost when tokenizing, which makes it hard to recreate the text. Thus to avoid this, one can replace it by a word before tokenizing the text, e.g. “\nHereWasALineBreak”.
Other Recommendation:
- Focus on Negativity: Compared to positive words, negative words are rarely used in an ambiguous way, especially in a business context. I.e. firms are only negative when necessary. Thus it’s recommendable to focus on negative terms.
- "Unless a study can convinc- ingly resolve the problems of negation, positive sentiment is best left untested." (LM, 2016)
Readibility:
What measures exist:
- Fog index
- Critique: Business texts contain many words with more than two syllables that are well understood by investors, i.e. one component of the index is misspecified. Furthermore, in financial documents, measuring sentence length is more difficult than in non-financial texts, i.e. the second component of the index is likely noisy.
- File size as alternative in 10-K filings. But note that as nowadays these contain many pictures, tables etc, i.e. one has to edit the file first!
- Common words: Calculate for each word the frequency of filings in which it occurs -> The more filings it’s in, the more ordinary it is
- Use a dictionary on financial terminology to determine the number of unique words from the dictionary found in the text. Then divide this by the number of total unique words. -> The higher the index, the more relevant is the information.
- Vocabulary: Divide number of unique words in text by number of unique words in dictionary -> Idea that extensive vocabulary makes text less readable.
- Use number of words per sentence
- delete URLs,document names, abbreviations to avoid downward bias of Words per sentence (WPS)
- delete enumerations to avoid upward bias
Similarity:
- Tetlock (2011) removes the most common words before applying the similarity measure. Here stop word can make sense.
- Tetlock (2011) uses stemming. Here of course it makes more sense, as e.g. “make” and “makes” are obviously similar words, i.e. they should be counted as the same word.
Cleaning:
- Remove parts that are not important (header, end, cover, tables, graphics, zip-files, xml code, exhibit)
- Replace “<div>” and “</div>” by “\n” as some files use div instead of \n
- Remove html code
- Remove links
- Remove file-names, e.g. ending with .pdf, .txt,…
- Remove numerations, e.g. “Item 1”, ”Part I”
- Remove numbers
- “-\n” indicates that a word is spread over two lines -> replace this with ““
- Delete single character words?!
- Remove sentences that are completely written in capital letters as these are typically headings that are standardized
Only relevant when interested in number of sentences:
- Replace commonly shortened words by the full word, e.g. Oct. by October.
- Remove instances where dot follows a dot, e.g. “..” and replace this by a single dot “.”
Bayesian Updating:
- good for jargon, slang, informal language
- hard to know what's going on "inside" the algoritm ("black box")
- dictionary better reproducible
- Generally check list of words that drive classification