# ANLY 520 Spring
# Installation of Software
- Which should you use?
- R/RStudio: mostly for users who are good with R and want to keep using RStudio because you've been using it a lot for other classes
- start a rmarkdown assignment and make the assignment
- Anaconda/Jupyter/Spyder: you are already use python in this system a lot
- Start a jupyter notebook and make the assignment
- Datalore: is for people who are newer at python OR bad at installing stuff on their computer OR have an older computer OR have spaces in your username OR you want to be able to work together with your team mates or me
- start a new notebook and make the assignment
- Notes:
- You only need keras and tensorflow if you want to try deep learning in the classification section
# Pycharm
- Install!
- Click learn
- New course
- In marketplace > introduction to python
- Take a screen when you are done with the sections required (not all due at once)
- Be sure you can see the TERMINAL at the bottom in that screen shot
# Processing Text
On class assignments, we all turn in the the same code and data source.
## Libraries
```
# import just a function
from urllib.request import urlopen
from bs4 import BeautifulSoup
# import a whole library as a new name
import pandas as pd
# import package as its own name
import nltk
# other packages
import re
# impurity function
RE_SUSPICIOUS = re.compile(r'[&#<>{}\[\]\\]')
def impurity(text, min_len=10):
"""returns the share of suspicious characters in a text"""
if text == None or len(text) < min_len:
return 0
else:
return len(RE_SUSPICIOUS.findall(text))/len(text)
# rest of stuff
import textacy.preprocessing as tprep
def normalize(text):
text = tprep.normalize.hyphenated_words(text)
text = tprep.normalize.quotation_marks(text)
text = tprep.normalize.unicode(text)
text = tprep.remove.accents(text)
text = tprep.replace.phone_numbers(text)
text = tprep.replace.urls(text)
text = tprep.replace.emails(text)
text = tprep.replace.user_handles(text)
text = tprep.replace.emojis(text)
return text
# install pyspellchecker !!!
from spellchecker import SpellChecker
spell = SpellChecker()
import spacy
nlp = spacy.load("en_core_web_sm")
import textacy
from itertools import chain
from collections import Counter
```
## Find Text
- As a class, we will find a text source to analyze. This text source usually will consist of a webpage or other dataset to examine and clean.
- Import the text into your report.
- If the text is one big long string, first break into sentence segments and store it in a Pandas DataFrame.
```
myurl = "https://www.foxnews.com/sports/patrick-mahomes-fiery-message-win-bills-they-got-what-they-asked-for"
#myurl = "https://www.foxnews.com/lifestyle/newly-elected-school-board-pennsylvania-reclaims-indigenous-mascot-rejects-cancel-culture"
html = urlopen(myurl).read()
soupified = BeautifulSoup(html, "html.parser")
# soupified
# just try get_text()
try_text = soupified.get_text()
try_text[0:100]
```
- Regular expressions
```
# find an exact match for the first time this occurs
text = try_text[
# everything from the end of this sentence and on
re.search("To access the content, check your email and follow the instructions provided.", try_text).end():
# now the end
re.search("CLICK HERE TO GET THE FOX NEWS APP", try_text).start()
]
```
- Breaking down into sentences
```
# break down into sentences and put into DF
sentences = nltk.sent_tokenize(text)
type(sentences)
# convert to dataframe
DF = pd.DataFrame(sentences, columns = ["sentence"])
DF.head()
```
- We've used:
- One big string (one variable)
- A list which uses `[]`
- Dictionaries `{}`
- Tuples `()`
- DataFrame from `pandas`
## Length for Proposal
```
# do this on the full text not broken into sentences
len(nltk.word_tokenize(text))
# be sure to import nltk in the proposal
```
## Fix Errors
- Examine the text for errors or problems by looking at the text.
- Legit, just look at the text.
- Looking for any type of "garbage" - dependent on what you are doing.
- Use the “impurity” function from class to examine the text for potential issues.
```
DF['score'] = DF['sentence'].apply(impurity)
DF
```
- Remove the noise with the regex function.
- Re-examine the impurity to determine if the data has been mostly cleaned.
- Not necessary because it looks fine.
- Normalize the rest of the text by using textacy.
```
DF['clean'] = DF['sentence'].apply(normalize)
DF
```
- Examine spelling errors in at least one row of the dataset.
- Any time you have stuff with names, please do not do spelling.
- Mostly, only do this if you have a specific goals.
```
# find all the unique tokens
# set is find unique
# nltk.word_tokenize is break down into words
# " ".join is combine into one long text
# .to_list() is a function to convert to list
clean_tokens = set(nltk.word_tokenize(" ".join(DF['clean'].to_list())))
# what is wrong?
misspelled = spell.unknown(clean_tokens)
for word in misspelled:
# what's the word
print(word)
print("\n")
# Get the one `most likely` answer
print(spell.correction(word))
# Get a list of `likely` options
print(spell.candidates(word))
# make a dictionary of the misspelled word and the correction
# use find and replace in re to fix them
```
Pre-Processing
- Using spacy and textacy, pre-process the text to end up with a list of tokenized lists.
```
output = []
# only the tagger and lemmatizer
for doc in nlp.pipe(DF['clean'].tolist(), disable=["tok2vec", "ner", "parser"]):
tokens = textacy.extract.words(doc,
filter_stops = True, # default True, no stopwords
filter_punct = True, # default True, no punctuation
filter_nums = True, # default False, no numbers
include_pos = None, # default None = include all
exclude_pos = None, # default None = exclude none
min_freq = 1) # minimum frequency of words
output.append([str(word) for word in tokens]) # close output append
output[0:1]
```
- Create a frequency table of each of the tokens returned in this output. Below is some example code to get us started.
```
# all items
type(output)
# first list
type(output[0])
# first list, first item (this is the issue!)
type(output[0][0])
Counter(chain.from_iterable(output))
```
## Summarize
Write a paragraph explaining the process of cleaning data for an NLP pipeline. You should explain the errors you found in the dataset and how you fixed them. Explain the information that is gathered by using spacy and textacy and the final output. What did you learn from your frequency table? What is the text document about?
# Information Extraction
## Libraries
```
# libraries
import PyPDF2
import pandas as pd
import nltk
#nltk.download("punkt")
import re
import spacy
# only for datalore
import subprocess
#%%
print(subprocess.getoutput("python -m spacy download en_core_web_sm"))
nlp = spacy.load("en_core_web_sm")
import texacy
import summa
from summa import keywords
from snorkel.preprocess import preprocessor
from snorkel.types import DataPoint
from itertools import combinations
from snorkel.labeling import labeling_function
from snorkel.labeling import PandasLFApplier
import networkx as nx
from matplotlib import pyplot as plt
```
## Import Text
```
# creating a pdf file object
pdfFileObj = open('The_Shadow_Over_Innsmouth.pdf', 'rb')
# creating a pdf reader object
pdfReader = PyPDF2.PdfReader(pdfFileObj)
# how many pages
len(pdfReader.pages)
# creating a page object
pageObj = pdfReader.pages
# extracting text from page
# loop here to get it all
text = []
for page in pageObj:
text.append(page.extract_text())
# closing the pdf file object
pdfFileObj.close()
```
## Convert to Sentences and Pandas
- ^ means start with
- [0-9] means any of these digits
- [a-zA-Z] means any alpha latin character lower or upper case
- $ ends with
- . mean any character
- * means zero or more of the previous character (so .* means zero or more of any character)
```
# create a place to save the text
saved_words = []
# loop over each word
for word in nltk.word_tokenize(book):
# if the word starts with a number and ends with a letter
if (re.search(r'^[0-9].*[a-zA-Z]$', word) != "None"):
# take out the numbers and save into our text
saved_words.append(re.sub(r'[0-9]', '', word))
# if not then save just the word
else:
saved_words.append(word)
book = ' '.join(saved_words)
```
```
DF = pd.DataFrame(
nltk.sent_tokenize(book),
columns = ["sentences"]
)
DF.head()
# for IE, we want sentence and/or paragraph level structure
```
## Part of Speech Tagging
- Tag your data with spacy’s part of speech tagger.
- Convert this data into a Pandas DataFrame.
```
# easier to loop over the big text file than loop over words AND rows in pandas
spacy_pos_tagged = [(str(word), word.tag_, word.pos_) for word in nlp(book)]
# each row represents one token
DF_POS = pd.DataFrame(
spacy_pos_tagged,
columns = ["token", "specific_tag", "upos"]
)
```
- Use the dataframe to calculate the most common parts of speech.
```
DF_POS['upos'].value_counts()
```
- Use the dataframe to calculate if words are considered more than one part of speech (crosstabs or groupby).
```
DF_POS2 = pd.crosstab(DF_POS['token'], DF_POS['upos'])
# convert to true false to add up how many times not zero
DF_POS2['total'] = DF_POS2.astype(bool).sum(axis=1)
#print out the rows that aren't 1
DF_POS2[DF_POS2['total'] > 1]
```
- What is the most common part of speech? ANSWER THIS IN YOUR TEXT
- Do you see words that are multiple parts of speech? ANSWER THIS IN YOUR TEXT
## KPE
- Use textacy to find the key phrases in your text.
- in the r window for r people
- library(reticulate)
- py_install("networkx < 3.0", pip = T)
```
# textacy KPE
# build an english language for textacy pipe
en = textacy.load_spacy_lang("en_core_web_sm", disable=("parser"))
# build a processor for textacy using spacy and process text
doc = textacy.make_spacy_doc(book, lang = en)
# text rank algorithm
print([kps for kps, weights in textacy.extract.keyterms.textrank(doc, normalize = "lemma", topn = 5)])
terms = set([term for term, weight in textacy.extract.keyterms.textrank(doc)])
print(textacy.extract.utils.aggregate_term_variants(terms))
```
- Use summa to find the key phrases in your text.
```
#TR_keywords = keywords.keywords(book, scores = True)
#print(TR_keywords[0:10])
```
- What differences do you see in their outputs? COMMENT ON HOW SLOW!
- Using textacy utilities, combine like key phrases. SEE ABOVE
- Do the outputs make sense given your text? ANSWER THIS QUESTION
## NER + Snorkel
- Use spacy to extract named entities.
- Create a summary of your named entities.
```
# easier to loop over the big text file than loop over words AND rows in pandas
spacy_ner_tagged = [(str(word.text), word.label_) for word in nlp(book).ents]
# each row represents one token
DF_NER = pd.DataFrame(
spacy_ner_tagged,
columns = ["token", "entity"]
)
```
```
print(DF_NER['entity'].value_counts())
DF_NER2 = pd.crosstab(DF_NER['token'], DF_NER['entity'])
print(DF_NER2)
# convert to true false to add up how many times not zero
DF_NER2['total'] = DF_NER2.astype(bool).sum(axis=1)
#print out the rows that aren't 1
DF_NER2[DF_NER2['total'] > 1]
```
- Apply Snorkel to your data to show any relationship between names.
### get the data into a good format
```
stored_entities = []
# first get the entities, must be two for relationship matches
def get_entities(x):
"""
Grabs the names using spacy's entity labeler
"""
# get all the entities in this row
processed = nlp(x)
# get the tokens for each sentence
tokens = [word.text for word in processed]
# get all the entities - notice this is only for persons
temp = [(str(ent), ent.label_) for ent in processed.ents if ent.label_ != ""]
# only move on if this row has at least two
if len(temp) > 1:
# finds all the combinations of pairs
temp2 = list(combinations(temp, 2))
# for each pair combination
for (person1, person2) in temp2:
# find the names in the person 1
person1_words = [word.text for word in nlp(person1[0])]
# find the token numbers for person 1
person1_ids = [i for i, val in enumerate(tokens) if val in person1_words]
# output in (start, stop) token tuple format
if len(person1_words) > 1:
person1_ids2 = tuple(idx for idx in person1_ids[0:2])
else:
id_1 = [idx for idx in person1_ids]
person1_ids2 = (id_1[0], id_1[0])
# do the same thing with person 2
person2_words = [word.text for word in nlp(person2[0])]
person2_ids = [i for i, val in enumerate(tokens) if val in person2_words[0:2]]
if len(person2_words) > 1:
person2_ids2 = tuple(idx for idx in person2_ids)
else:
id_2 = [idx for idx in person2_ids[0:2]]
person2_ids2 = (id_2[0], id_2[0])
# store all this in a list
stored_entities.append(
[x, # original text
tokens, # tokens
person1[0], # person 1 name
person2[0], # person 2 name
person1_ids2, # person 1 id token tuple
person2_ids2 # person 2 id token tuple
])
DF['sentences'].apply(get_entities)
# create dataframe in snorkel structure
DF_dev = pd.DataFrame(stored_entities, columns = ["sentence", "tokens", "person1", "person2", "person1_word_idx", "person2_word_idx"])
```
### figure out where to look (between and to the left)
```
# live locate home road roads in at street (locations tied together)
# family terms for people
# get words between the data points
@preprocessor()
def get_text_between(cand: DataPoint) -> DataPoint:
"""
Returns the text between the two person mentions in the sentence
"""
start = cand.person1_word_idx[1] + 1
end = cand.person2_word_idx[0]
cand.between_tokens = cand.tokens[start:end]
return cand
# get words next to the data points
@preprocessor()
def get_left_tokens(cand: DataPoint) -> DataPoint:
"""
Returns tokens in the length 3 window to the left of the person mentions
"""
# TODO: need to pass window as input params
window = 5
end = cand.person1_word_idx[0]
cand.person1_left_tokens = cand.tokens[0:end][-1 - window : -1]
end = cand.person2_word_idx[0]
cand.person2_left_tokens = cand.tokens[0:end][-1 - window : -1]
return cand
```
### figure out what to look for
```
# live locate home road roads in at street (locations tied together)
# family terms for people
found_location = 1
found_family = -1
ABSTAIN = 0
location = {"live", "living", "locate", "located", "home", "road", "roads", "street", "streets", "in", "at", "of"}
@labeling_function(resources=dict(location=location), pre=[get_text_between])
def between_location(x, location):
return found_location if len(location.intersection(set(x.between_tokens))) > 0 else ABSTAIN
@labeling_function(resources=dict(location=location), pre=[get_left_tokens])
def left_location(x, location):
if len(set(location).intersection(set(x.person1_left_tokens))) > 0:
return found_location
elif len(set(location).intersection(set(x.person2_left_tokens))) > 0:
return found_location
else:
return ABSTAIN
family = {"spouse", "wife", "husband", "ex-wife", "ex-husband", "marry",
"married", "father", "mother", "sister", "brother", "son", "daughter",
"grandfather", "grandmother", "uncle", "aunt", "cousin",
"boyfriend", "girlfriend"}
@labeling_function(resources=dict(family=family), pre=[get_text_between])
def between_family(x, family):
return found_family if len(family.intersection(set(x.between_tokens))) > 0 else ABSTAIN
@labeling_function(resources=dict(family=family), pre=[get_left_tokens])
def left_family(x, family):
if len(set(family).intersection(set(x.person1_left_tokens))) > 0:
return found_family
elif len(set(family).intersection(set(x.person2_left_tokens))) > 0:
return found_family
else:
return ABSTAIN
# create a list of functions to run
lfs = [
between_location,
left_location,
between_family,
left_family
]
# build the applier function
applier = PandasLFApplier(lfs)
# run it on the dataset
L_dev = applier.apply(DF_dev)
```
```
L_dev
```
```
DF_combined = pd.concat([DF_dev, pd.DataFrame(L_dev, columns = ["location1", "location2", "family1", "family2"])], axis = 1)
DF_combined
```
```
DF_combined['location_yes'] = DF_combined['location1'] + DF_combined["location2"]
DF_combined['family_yes'] = DF_combined['family1'] + DF_combined["family2"]
print(DF_combined['location_yes'].value_counts())
print(DF_combined['family_yes'].value_counts())
```
- What might you do to improve the default NER extraction?
## Knowledge Graphs
### Slides Version
- Based on the chosen text, add entities to a default spacy model.
- Add a norm_entity, merge_entity, and init_coref pipelines.
- Update and add the alias lookup if necessary for the data.
- Add the name resolver pipeline.
### Or Use Your Snorkel Output
- Create a co-occurrence graph of the entities linked together in your text.
```
# locations only
DF_loc = DF_combined[DF_combined['location_yes'] > 0]
DF_loc = DF_loc[['person1', 'person2']].reset_index(drop = True)
cooc_loc = DF_loc.groupby(by=["person1", "person2"], as_index=False).size()
# family only
DF_fam = DF_combined[DF_combined['family_yes'] > 0]
DF_fam = DF_fam[['person1', 'person2']].reset_index(drop = True)
cooc_fam = DF_fam.groupby(by=["person1", "person2"], as_index=False).size()
# take out issues where entity 1 == entity 2
cooc_loc = cooc_loc[cooc_loc['person1'] != cooc_loc['person2']]
cooc_fam = cooc_fam[cooc_fam['person1'] != cooc_fam['person2']]
print(cooc_loc.head())
print(cooc_fam.head())
```
- This creates a dataframe of node 1 and then node 2 (entity 1 to entity 2) and then frequency (size)
```
# start by plotting the whole thing for location
cooc_loc_small = cooc_loc[cooc_loc['size']>1]
graph = nx.from_pandas_edgelist(
cooc_loc_small[['person1', 'person2', 'size']] \
.rename(columns={'size': 'weight'}),
source='person1', target='person2', edge_attr=True)
pos = nx.kamada_kawai_layout(graph, weight='weight')
_ = plt.figure(figsize=(20, 20))
nx.draw(graph, pos,
node_size=1000,
node_color='skyblue',
alpha=0.8,
with_labels = True)
plt.title('Graph Visualization', size=15)
for (node1,node2,data) in graph.edges(data=True):
width = data['weight']
_ = nx.draw_networkx_edges(graph,pos,
edgelist=[(node1, node2)],
width=width,
edge_color='#505050',
alpha=0.5)
plt.show()
plt.close()
```
```
# start by plotting the whole thing for location
graph = nx.from_pandas_edgelist(
cooc_fam[['person1', 'person2', 'size']] \
.rename(columns={'size': 'weight'}),
source='person1', target='person2', edge_attr=True)
pos = nx.kamada_kawai_layout(graph, weight='weight')
_ = plt.figure(figsize=(20, 20))
nx.draw(graph, pos,
node_size=1000,
node_color='skyblue',
alpha=0.8,
with_labels = True)
plt.title('Graph Visualization', size=15)
for (node1,node2,data) in graph.edges(data=True):
width = data['weight']
_ = nx.draw_networkx_edges(graph,pos,
edgelist=[(node1, node2)],
width=width,
edge_color='#505050',
alpha=0.5)
plt.show()
plt.close()
```
# Text Summarization
## Find Text
```
import pysrt
import pandas as pd
import re
from sentence_transformers import SentenceTransformer
# install faiss-cpu
import faiss
import time
from sumy.summarizers.text_rank import TextRankSummarizer
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.nlp.stemmers import Stemmer
from sumy.utils import get_stop_words
from sumy.summarizers.lsa import LsaSummarizer
import nltk
nltk.download("punkt")
nltk.download("stopwords")
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from gensim.models import LdaModel
from gensim.corpora import Dictionary
from pprint import pprint
#tokenize, remove stopwords, non-alphabetic words, lowercase
def preprocess(textstring):
stops = set(stopwords.words('english'))
tokens = word_tokenize(textstring)
return [token.lower() for token in tokens if token.isalpha()
and token not in stops]
import pyLDAvis
import pyLDAvis.gensim_models #don't skip this
import matplotlib.pyplot as plt
from rouge_score import rouge_scorer #rouge-score to install
```
```
subs = pysrt.open("bodies.srt")
DF = pd.DataFrame([
{
"Text": sub.text
} for sub in subs])
DF
```
```
def remove_noise(text):
text = re.sub("<.*>", " ", text)
text = re.sub("{.*}", " ", text)
text = re.sub("\[.*\]", " ", text)
text = text.strip()
return text
DF['clean'] = DF['Text'].apply(remove_noise)
DF = DF[DF['clean'] != ""]
DF
```
Create A Search Engine
Using each sentence as your “documents”, create a search engine to find specific pieces of text.
```
# this is creating the embeddings
model = SentenceTransformer('msmarco-MiniLM-L-12-v3')
bodies_text_embds = model.encode(DF['clean'].to_list())
```
```
# Create an index using FAISS
index = faiss.IndexFlatL2(bodies_text_embds.shape[1])
index.add(bodies_text_embds)
faiss.write_index(index, 'index_bodies')
bodies_text_embds
```
```
# define a search
def search(query, k):
t=time.time()
query_vector = model.encode([query])
top_k = index.search(query_vector, k)
print('totaltime: {}'.format(time.time()-t))
return [DF['clean'].to_list()[_id] for _id in top_k[1].tolist()[0]]
```
Search for several items.
```
search("cop", 10)
search("gun", 10)
search("car", 10)
```
Examine the results and comment on how well you think the search engine worked.
ANSWER THIS QUESTION
# Create Text Summaries
- Create a human summary of the text.
```
human = "The story starts with the appearance of a dead body in Longharvest Lane in London. This event happens in the same location in four different years – 1890, 1941, 2023 and 2053 – and leads to four different detective investigations that eventually become interlinked with far-reaching consequences."
```
- Create text summaries using LSA, TextRank, and Topic Modeling.
- Most of these methods don't really care what language the text is in
- If you want to stem, you would need a language specific model for that
- May need specific model for parsing for languages without obvious space
- Specific stop words for your language are needed
- Stemming: is cutting off the affixes on a word - warning --> warn, morning --> morn
- In theory, it combines like word forms
- In practice, it's so so at that task
- The other option is lemmatization which is using a dictionary to look up the root word
```
# for textrank or lsa
LANGUAGE = "english" # language
stemmer = Stemmer(LANGUAGE) # stemming
# parse the document
parser = PlaintextParser.from_string(the_text, Tokenizer(LANGUAGE))
# text rank
tr_answer = [] # create a space to save
tr_summary = TextRankSummarizer() # create the summarizer algorithm
for sentence in tr_summary(parser.document, 5):
# need to convert from text rank "sentence" to str
tr_answer.append(str(sentence))
# put them all together
tr_answer = " ".join(tr_answer)
# print it out for our own reading
print(tr_answer)
```
```
lsa_summary = LsaSummarizer(stemmer)
lsa_summary.stop_words = get_stop_words(LANGUAGE)
lsa_answer = []
for sentence in lsa_summary(parser.document, 5):
lsa_answer.append(str(sentence))
lsa_answer = " ".join(lsa_answer)
print(lsa_answer)
```
```
# Create a dictionary representation of the documents.
# tokenized sentences created earlier
sentences = DF['clean'].to_list()
processed_sentences = [preprocess(sent) for sent in DF['clean']]
dictionary = Dictionary(processed_sentences)
corpus = [dictionary.doc2bow(sent) for sent in processed_sentences]
# Train the topic model
LDAmodel = LdaModel(corpus = corpus,
id2word = dictionary,
iterations = 400,
num_topics = 10,
random_state = 100,
update_every = 1,
chunksize = 100,
passes = 10,
alpha = 'auto',
per_word_topics = True)
probs = [LDAmodel.get_document_topics(sentence) for sentence in corpus]
save_probs = []
i = 0
for document in probs:
for (topic, prob) in document:
if topic == 0:
save_probs.append((sentences[i], prob))
i = i + 1
DF = pd.DataFrame(save_probs, columns = ["sentence", "prob"])
topic_summary = " ".join(DF.sort_values(by = ["prob"], ascending = False)[0:5].sentence)
print(topic_summary)
```
```
vis = pyLDAvis.gensim_models.prepare(LDAmodel, corpus, dictionary, n_jobs = 1)
pyLDAvis.save_html(vis, 'LDA_Visualization.html') ##saves the file
```
- Assess those summaries using the Rouge-N analyzer.
```
#print nicely
def print_rouge_score(rouge_score):
for k,v in rouge_score.items():
print (k, 'Precision:', "{:.2f}".format(v.precision), 'Recall:', "{:.2f}".format(v.recall), 'fmeasure:', "{:.2f}".format(v.fmeasure))
# define the scorer
scorer = rouge_scorer.RougeScorer(['rouge1'], use_stemmer=True)
# print the scores
#print_rouge_score(scorer.score(human, tr_summary))
#print_rouge_score(scorer.score(human, lsa_summary))
print_rouge_score(scorer.score(human, topic_summary))
```
- Which summary was the best when compared to the human summary?
ANSWER THIS QUESTION BY LOOKING AT THE SUMMARIES AND MAKING A JUDGMENT
# Classification
Find Text
- As a class, we will find a text source to analyze to predict a categorical outcome.
- Import the text into your report.
## Library
```
import pysrt
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
# nltk.download("stopwords")
# nltk.download("punkt")
from string import punctuation
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from textblob import TextBlob
from sklearn.model_selection import train_test_split
import numpy as np
def document_vectorizer(corpus, model, num_features):
vocabulary = set(model.wv.index_to_key)
def average_word_vectors(words, model, vocabulary, num_features):
feature_vector = np.zeros((num_features,), dtype="float64")
nwords = 0.
for word in words:
if word in vocabulary:
nwords = nwords + 1.
feature_vector = np.add(feature_vector, model.wv[word])
if nwords:
feature_vector = np.divide(feature_vector, nwords)
return feature_vector
features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
for tokenized_sentence in corpus]
return np.array(features)
from gensim.models import Word2Vec
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
import eli5
import lime
from sklearn.pipeline import make_pipeline
from lime.lime_text import LimeTextExplainer
import matplotlib.pyplot as plt
```
```
subs = pysrt.open("barbie.srt")
DF = pd.DataFrame([
{
"Text": sub.text
} for sub in subs])
DF.head()
```
## Set up Text
```
# deal with the errors
def clean_errors(text):
text = re.sub("</?[a-z]+>", "", text)
text = re.sub("♪", "", text)
return text
DF['clean'] = DF['Text'].apply(clean_errors)
```
- Do preprocessing on your text to prepare it for the final machine learning model.
```
stops = set(stopwords.words("english"))
stops.add("...")
stops.add("'s")
stops.add("n't")
stops.add("'m")
stops.add("'re")
stops.add("'ll")
stops.add("'ve")
stops.add("'d")
stemmer = PorterStemmer()
def clean_up(text):
# lower case everything so it's normalized
text = text.lower()
#clean up apostrophes
text = re.sub("'", "'", text)
# remove stop words
text = " ".join([word for word in nltk.word_tokenize(text) if word not in stops])
# get rid of punctuation
text = " ".join([word for word in nltk.word_tokenize(text) if not word.isdigit() and word not in punctuation])
# remove them here
text = re.sub("'", "", text)
text = re.sub("`", "", text)
# stemming might be good
text = stemmer.stem(text)
# take out the white space
return text.strip()
DF['clean2'] = DF['clean'].apply(clean_up)
```
- What items do you think will be important in your preprocessing to clean?
ANSWER THIS QUESTION
## Create Feature Extractions
### Create our labels
#### noun example
```
nouns = r"put|your|nouns|here|with|separators"
def find_noun(text):
if re.match(nouns, text):
return 1 # match
else:
return 0 # no match
def remove_noun(text):
text = re.sub(nouns, " ", text)
# use apply to these on your dataframe
df['noun_included'] = df['text'].apply(find_noun)
df['text_no_noun'] = df['text'].apply(remove_noun)
```
#### sentiment example
```
# remove the empty cells
DF = DF[DF['clean2'].str.len() > 1]
# add fake labels (this is just for class)
# do this on clean with full words not totally processed for sentiment
DF['sent_score'] = [TextBlob(text).sentiment.polarity for text in DF['clean'].to_list()]
# assign a label
DF['sent_label'] = DF['sent_score'] > 0
```
### Balancing
```
DF['sent_label'].value_counts()
# create balanced data
DFB = DF.groupby('sent_label').sample(n = 520)
DFB['sent_label'].value_counts()
```
### Create test-train
```
X_train, X_test, Y_train, Y_test = train_test_split(DF['clean2'], # X values
DF['sent_label'], # Y values
test_size = 0.2, # test size
random_state = 42)
print('Size of Training Data ', X_train.shape[0])
print('Size of Test Data ', X_test.shape[0])
XB_train, XB_test, YB_train, YB_test = train_test_split(DFB['clean2'], # X values
DFB['sent_label'], # Y values
test_size = 0.2, # test size
random_state = 42)
print('Size of Training Data ', XB_train.shape[0])
print('Size of Test Data ', XB_test.shape[0])
```
### Feature extract
- Create a “one-hot” encoding using the count vectorizer and binary options.
```
# build a blank setup
oh = CountVectorizer(binary = True)
# fit the training data
oh_u_train = oh.fit_transform(X_train)
# transform the testing data to look like the training data
oh_u_test = oh.transform(X_test)
oh_u_train.shape
oh_u_test.shape
# build a blank setup
oh = CountVectorizer(binary = True)
# fit the training data
oh_b_train = oh.fit_transform(XB_train)
# transform the testing data to look like the training data
oh_b_test = oh.transform(XB_test)
oh_b_train.shape
oh_b_test.shape
```
- Create the bag of words encoding using the count vectorizer.
```
# build a blank setup
bow = CountVectorizer()
# fit the training data
bow_u_train = bow.fit_transform(X_train)
# transform the testing data to look like the training data
bow_u_test = bow.transform(X_test)
bow_u_train.shape
bow_u_test.shape
# build a blank setup
bow = CountVectorizer()
# fit the training data
bow_b_train = bow.fit_transform(XB_train)
# transform the testing data to look like the training data
bow_b_test = bow.transform(XB_test)
bow_b_train.shape
bow_b_test.shape
```
- Create the TF-IDF normalization using the tfidf vectorizer.
```
# build a blank setup
tfidf = TfidfVectorizer()
# fit the training data
tfidf_u_train = tfidf.fit_transform(X_train)
# transform the testing data to look like the training data
tfidf_u_test = tfidf.transform(X_test)
tfidf_u_train.shape
tfidf_u_test.shape
# build a blank setup
tfidf = TfidfVectorizer()
# fit the training data
tfidf_b_train = tfidf.fit_transform(XB_train)
# transform the testing data to look like the training data
tfidf_b_test = tfidf.transform(XB_test)
tfidf_b_train.shape
tfidf_b_test.shape
```
- Create two word2vec models:
- Use a large number of dimensions that matches your tfidf.
- Using cbow and skipgram embeddings.
- Using a 5 window size.
```
# train the model on the data
wv_u = Word2Vec(X_train,
vector_size = 1600, #dimensions
window = 5, #window size
sg = 0, #cbow
min_count = 1,
workers = 4)
# generate averaged word vector features from word2vec model
wv_u_train_c = document_vectorizer(corpus = X_train,
model = wv_u,
num_features = 1600)
# generate averaged word vector features from word2vec model
wv_u_test_c = document_vectorizer(corpus = X_test,
model = wv_u,
num_features = 1600)
wv_u_train_c.shape
wv_u_test_c.shape
# train the model on the data
wv_b = Word2Vec(XB_train,
vector_size = 1000, #dimensions
window = 5, #window size
sg = 0, #cbow
min_count = 1,
workers = 4)
# generate averaged word vector features from word2vec model
wv_b_train_c = document_vectorizer(corpus = XB_train,
model = wv_b,
num_features = 1000)
# generate averaged word vector features from word2vec model
wv_b_test_c = document_vectorizer(corpus = XB_test,
model = wv_b,
num_features = 1000)
wv_b_train_c.shape
wv_b_test_c.shape
```
```
# train the model on the data
wv_u = Word2Vec(X_train,
vector_size = 1600, #dimensions
window = 5, #window size
sg = 1, #cbow
min_count = 1,
workers = 4)
# generate averaged word vector features from word2vec model
wv_u_train_s = document_vectorizer(corpus = X_train,
model = wv_u,
num_features = 1600)
# generate averaged word vector features from word2vec model
wv_u_test_s = document_vectorizer(corpus = X_test,
model = wv_u,
num_features = 1600)
wv_u_train_s.shape
wv_u_test_s.shape
# train the model on the data
wv_b = Word2Vec(XB_train,
vector_size = 1000, #dimensions
window = 5, #window size
sg = 1, #cbow
min_count = 1,
workers = 4)
# generate averaged word vector features from word2vec model
wv_b_train_s = document_vectorizer(corpus = XB_train,
model = wv_b,
num_features = 1000)
# generate averaged word vector features from word2vec model
wv_b_test_s = document_vectorizer(corpus = XB_test,
model = wv_b,
num_features = 1000)
wv_b_train_s.shape
wv_b_test_s.shape
```
## Classify
- Use at least two classification algorithms to predict the outcome of the data.
- Include the model assessment of these predictions for all models.
### Log Unbalanced
```
## one hot log unbalanced
logreg = LogisticRegression(max_iter = 10000)
logreg.fit(oh_u_train, Y_train) #training features not X, Y_train
y_log = logreg.predict(oh_u_test) #testing features not X
print(classification_report(Y_test, y_log))
## bow log unbalanced
logreg = LogisticRegression(max_iter = 10000)
logreg.fit(bow_u_train, Y_train) #training features not X, Y_train
y_log = logreg.predict(bow_u_test) #testing features not X
print(classification_report(Y_test, y_log))
## tfidf log unbalanced
logreg = LogisticRegression(max_iter = 10000)
logreg.fit(tfidf_u_train, Y_train) #training features not X, Y_train
y_log = logreg.predict(tfidf_u_test) #testing features not X
print(classification_report(Y_test, y_log))
## w cbow log unbalanced
logreg = LogisticRegression(max_iter = 10000)
logreg.fit(wv_u_train_c, Y_train) #training features not X, Y_train
y_log = logreg.predict(wv_u_test_c) #testing features not X
print(classification_report(Y_test, y_log))
## w skip log unbalanced
logreg = LogisticRegression(max_iter = 10000)
logreg.fit(wv_u_train_s, Y_train) #training features not X, Y_train
y_log = logreg.predict(wv_u_test_s) #testing features not X
print(classification_report(Y_test, y_log))
```
### Log Balanced
```
## one hot log unbalanced
logreg = LogisticRegression(max_iter = 10000)
logreg.fit(oh_b_train, YB_train) #training features not X, YB_train
y_log = logreg.predict(oh_b_test) #testing features not X
print(classification_report(YB_test, y_log))
## bow log unbalanced
logreg = LogisticRegression(max_iter = 10000)
logreg.fit(bow_b_train, YB_train) #training features not X, YB_train
y_log = logreg.predict(bow_b_test) #testing features not X
print(classification_report(YB_test, y_log))
## tfidf log unbalanced
logreg = LogisticRegression(max_iter = 10000)
logreg.fit(tfidf_b_train, YB_train) #training features not X, YB_train
y_log = logreg.predict(tfidf_b_test) #testing features not X
print(classification_report(YB_test, y_log))
## w cbow log unbalanced
logreg = LogisticRegression(max_iter = 10000)
logreg.fit(wv_b_train_c, YB_train) #training features not X, YB_train
y_log = logreg.predict(wv_b_test_c) #testing features not X
print(classification_report(YB_test, y_log))
## w skip log unbalanced
logreg = LogisticRegression(max_iter = 10000)
logreg.fit(wv_b_train_s, YB_train) #training features not X, YB_train
y_log = logreg.predict(wv_b_test_s) #testing features not X
print(classification_report(YB_test, y_log))
```
### NB Unbalanced
```
## one hot log unbalanced
nb = MultinomialNB()
nb.fit(oh_u_train, Y_train) #training features not X, Y_train
y_nb = nb.predict(oh_u_test) #testing features not X
print(classification_report(Y_test, y_nb))
## bow log unbalanced
nb = MultinomialNB()
nb.fit(bow_u_train, Y_train) #training features not X, Y_train
y_nb = nb.predict(bow_u_test) #testing features not X
print(classification_report(Y_test, y_nb))
## tfidf log unbalanced
nb = MultinomialNB()
nb.fit(tfidf_u_train, Y_train) #training features not X, Y_train
y_nb = nb.predict(tfidf_u_test) #testing features not X
print(classification_report(Y_test, y_nb))
# have to fix negatives in our word2vec
wv_u_train_c.min()
wv_u_test_c.min()
wv_u_train_s.min()
wv_u_test_s.min()
wv_u_train_c = wv_u_train_c + 1
wv_u_test_c = wv_u_test_c + 1
wv_u_train_s = wv_u_train_s + 1
wv_u_test_s = wv_u_test_s + 1
## w cbow log unbalanced
nb = MultinomialNB()
nb.fit(wv_u_train_c, Y_train) #training features not X, Y_train
y_nb = nb.predict(wv_u_test_c) #testing features not X
print(classification_report(Y_test, y_nb))
## w skip log unbalanced
nb = MultinomialNB()
nb.fit(wv_u_train_s, Y_train) #training features not X, Y_train
y_nb = nb.predict(wv_u_test_s) #testing features not X
print(classification_report(Y_test, y_nb))
```
### NB Balanced
```
## one hot log unbalanced
nb = MultinomialNB()
nb.fit(oh_b_train, YB_train) #training features not X, YB_train
y_nb = nb.predict(oh_b_test) #testing features not X
print(classification_report(YB_test, y_nb))
## bow log unbalanced
nb = MultinomialNB()
nb.fit(bow_b_train, YB_train) #training features not X, YB_train
y_nb = nb.predict(bow_b_test) #testing features not X
print(classification_report(YB_test, y_nb))
## tfidf log unbalanced
nb = MultinomialNB()
nb.fit(tfidf_b_train, YB_train) #training features not X, YB_train
y_nb = nb.predict(tfidf_b_test) #testing features not X
print(classification_report(YB_test, y_nb))
# have to fix negatives in our word2vec
wv_b_train_c.min()
wv_b_test_c.min()
wv_b_train_s.min()
wv_b_test_s.min()
wv_b_train_c = wv_b_train_c + 1
wv_b_test_c = wv_b_test_c + 1
wv_b_train_s = wv_b_train_s + 1
wv_b_test_s = wv_b_test_s + 1
## w cbow log unbalanced
nb = MultinomialNB()
nb.fit(wv_b_train_c, YB_train) #training features not X, YB_train
y_nb = nb.predict(wv_b_test_c) #testing features not X
print(classification_report(YB_test, y_nb))
## w skip log unbalanced
nb = MultinomialNB()
nb.fit(wv_b_train_s, YB_train) #training features not X, YB_train
y_nb = nb.predict(wv_b_test_s) #testing features not X
print(classification_report(YB_test, y_nb))
```
- Write a paragraph summarizing the results from your comparisons. What models are best? Are there any general differences/similarities in prediction you see? How well is each category label classified? What might you do to make the model better?
```
# build a blank setup
oh = CountVectorizer(binary = True)
# fit the training data
oh_u_train = oh.fit_transform(X_train)
# transform the testing data to look like the training data
oh_u_test = oh.transform(X_test)
## one hot bayes unbalanced
nb = MultinomialNB()
nb.fit(oh_u_train, Y_train) #training features not X, Y_train
y_nb = nb.predict(oh_u_test) #testing features not X
print(classification_report(Y_test, y_nb))
```
## Interpret
- Use eli5 to determine what predicts each category label.
```
eli5.show_weights(estimator = logreg,
top = 10,
feature_names = oh.get_feature_names_out())
```
```
# we need to make a pipeline from sklearn
pipeline = make_pipeline(oh, nb)
pipeline.predict_proba(["barbie is terrible"])
# then we build the "explainer" which is a blank model that has the class names
explainer = LimeTextExplainer(class_names = Y_train.sort_values().unique())
# and then we apply the pipeline and explainer
# to new or old text
exp = explainer.explain_instance("barbie is terrible", # text
pipeline.predict_proba, # put in the answers from pipeline
num_features=10) # the number of features
exp = explainer.explain_instance(DF['clean2'].iloc[29], # text
pipeline.predict_proba, # put in the answers from pipeline
num_features=10) # the number of features
exp.as_pyplot_figure()
plt.show()
exp.save_to_file('BARBIE.html')
```
- Interpret the results by writing a paragraph explaining the output from this package.
## Set of Steps
Preprocessing
1) Fix errors in the text data
2) Think about the information that might influence the feature extraction model (stemming, lower casing, acronyms, spelling, stopwords, punctuation, numbers)
3) Balance of the outcomes
Feature Extraction
4) Split the data into training and testing
5) Create our feature extractions
Algorithm / Build Model
6) Apply the algorithm to the feature extraction
Results / Interpretation
7) Examine a classification report to pick the best model
- Remember what is "chance"?
- Check out accuracy > chance
- Look at each label/outcome
- Models that NEVER predict one of the outcomes are likely not useful
- Watch for zero division warning
- F1 score is a weighted average, want both/all categories to be "good"
8) Look at lime, outcomes, eli5 to help with understanding the results
# Chatbots
- install rasa like you normally would for your machine set up
## Find Text
- Find a list of movies and themes to pick from.
```
# test we can open the data
# will put this in the endpoints section
import pandas as pd
DF = pd.read_csv("recommend.csv")
DF.head()
DF[DF["type"] == "crime"]["recommendation"].iloc[0]
```
## Train the Chatbot
- As a class, let’s train a chatbot to tell us recommend a movie based on theme we pick.
- go to tools > terminal
### NLU
> data > nlu
```
version: "3.1"
nlu:
- intent: recommend
examples: |
- recommend
- recommend movie
- I want to watch a movie
- What do you recommend
- Recommend something to watch
- What can I watch tonight
- intent: category_movie
examples: |
- [drama](movie_request)
- [crime](movie_request)
- [scifi](movie_request)
- [horror](movie_request)
- [romantic](movie_request)
- intent: greet
examples: |
- hey
- hello
- hi
- hello there
- good morning
- good evening
- moin
- hey there
- let's go
- hey dude
- goodmorning
- goodevening
- good afternoon
- intent: goodbye
examples: |
- cu
- good by
- cee you later
- good night
- bye
- goodbye
- have a nice day
- see you around
- bye bye
- see you later
- intent: affirm
examples: |
- yes
- y
- indeed
- of course
- that sounds good
- correct
- intent: deny
examples: |
- no
- n
- never
- I don't think so
- don't like that
- no way
- not really
- intent: mood_great
examples: |
- perfect
- great
- amazing
- feeling like a king
- wonderful
- I am feeling very good
- I am great
- I am amazing
- I am going to save the world
- super stoked
- extremely good
- so so perfect
- so good
- so perfect
- intent: mood_unhappy
examples: |
- my day was horrible
- I am sad
- I don't feel very well
- I am disappointed
- super sad
- I'm so sad
- sad
- very sad
- unhappy
- not good
- not very good
- extremly sad
- so saad
- so sad
- intent: bot_challenge
examples: |
- are you a bot?
- are you a human?
- am I talking to a bot?
- am I talking to a human?
```
### Domain
> main folder
```
version: "3.1"
intents:
- greet
- goodbye
- affirm
- deny
- mood_great
- mood_unhappy
- bot_challenge
- recommend
- category_movie
responses:
utter_category:
- text: "What type of movie do you want a recommendation for?"
utter_options:
- text: "Your options are: drama, horror, scifi, crime, orå romantic."
utter_wait:
- text: "I will look up an example for {movie_request}."
utter_greet:
- text: "Hey! How are you?"
utter_cheer_up:
- text: "Here is something to cheer you up:"
image: "https://i.imgur.com/nGF1K8f.jpg"
utter_did_that_help:
- text: "Did that help you?"
utter_happy:
- text: "Great, carry on!"
utter_goodbye:
- text: "Bye"
utter_iamabot:
- text: "I am a bot, powered by Rasa."
entities:
- movie_request
slots:
movie_request:
type: text
mappings:
- type: from_entity
entity: movie_request
actions:
- action_get_recommendation
session_config:
session_expiration_time: 60
carry_over_slots_to_new_session: true
```
### Stories
> data > stories
```
version: "3.1"
stories:
- story: recommend movie
steps:
- intent: recommend
- action: utter_category
- action: utter_options
- intent: category_movie
- action: utter_wait
- action: action_get_recommendation
- story: happy path
steps:
- intent: greet
- action: utter_greet
- intent: mood_great
- action: utter_happy
- story: sad path 1
steps:
- intent: greet
- action: utter_greet
- intent: mood_unhappy
- action: utter_cheer_up
- action: utter_did_that_help
- intent: affirm
- action: utter_happy
- story: sad path 2
steps:
- intent: greet
- action: utter_greet
- intent: mood_unhappy
- action: utter_cheer_up
- action: utter_did_that_help
- intent: deny
- action: utter_goodbye
```
### Rules
> data > rules
```
version: "3.1"
rules:
- rule: Say goodbye anytime the user says goodbye
steps:
- intent: goodbye
- action: utter_goodbye
- rule: Say 'I am a bot' anytime the user challenges
steps:
- intent: bot_challenge
- action: utter_iamabot
```
### Actions
> actions > actions.py
```
# This files contains your custom actions which can be used to run
# custom Python code.
#
# See this guide on how to implement these action:
# https://rasa.com/docs/rasa/custom-actions
# This is a simple example for a custom action which utters "Hello World!"
from typing import Any, Text, Dict, List
from rasa_sdk import Action, Tracker
from rasa_sdk.executor import CollectingDispatcher
import pandas as pd
DF = pd.read_csv("recommend.csv")
class ActionMovie(Action):
def name(self) -> Text:
return "action_get_recommendation"
def run(self, dispatcher: CollectingDispatcher,
tracker: Tracker,
domain: Dict[Text, Any]) -> List[Dict[Text, Any]]:
# we can get values from slots by `tracker` object
movie_request = tracker.get_slot('movie_request')
answer = DF[DF["type"] == movie_request]["recommendation"].iloc[0]
answer = "Your movie recommendation is ".join(answer)
dispatcher.utter_message(text=answer)
return []
```
### Endpoints
```
# This file contains the different endpoints your bot can use.
# Server where the models are pulled from.
# https://rasa.com/docs/rasa/model-storage#fetching-models-from-a-server
#models:
# url: http://my-server.com/models/default_core@latest
# wait_time_between_pulls: 10 # [optional](default: 100)
# Server which runs your custom actions.
# https://rasa.com/docs/rasa/custom-actions
action_endpoint:
url: "http://localhost:5055/webhook"
# Tracker store which is used to store the conversations.
# By default the conversations are stored in memory.
# https://rasa.com/docs/rasa/tracker-stores
#tracker_store:
# type: redis
# url: <host of the redis instance, e.g. localhost>
# port: <port of your redis instance, usually 6379>
# db: <number of your database within redis, e.g. 0>
# password: <password used for authentication>
# use_ssl: <whether or not the communication is encrypted, default false>
#tracker_store:
# type: mongod
# url: <url to your mongo instance, e.g. mongodb://localhost:27017>
# db: <name of the db within your mongo instance, e.g. rasa>
# username: <username used for authentication>
# password: <password used for authentication>
# Event broker which all conversation events should be streamed to.
# https://rasa.com/docs/rasa/event-brokers
#event_broker:
# url: localhost
# username: username
# password: password
# queue: queue
```
To update the model use:
`rasa train && rasa shell`
BUT ALSO: in a new terminal (so you should have two!) use:
```
rasa run actions
```
Test the Chatbot
Test the chatbot responses.