# ANLY 520 Spring # Installation of Software - Which should you use? - R/RStudio: mostly for users who are good with R and want to keep using RStudio because you've been using it a lot for other classes - start a rmarkdown assignment and make the assignment - Anaconda/Jupyter/Spyder: you are already use python in this system a lot - Start a jupyter notebook and make the assignment - Datalore: is for people who are newer at python OR bad at installing stuff on their computer OR have an older computer OR have spaces in your username OR you want to be able to work together with your team mates or me - start a new notebook and make the assignment - Notes: - You only need keras and tensorflow if you want to try deep learning in the classification section # Pycharm - Install! - Click learn - New course - In marketplace > introduction to python - Take a screen when you are done with the sections required (not all due at once) - Be sure you can see the TERMINAL at the bottom in that screen shot # Processing Text On class assignments, we all turn in the the same code and data source. ## Libraries ``` # import just a function from urllib.request import urlopen from bs4 import BeautifulSoup # import a whole library as a new name import pandas as pd # import package as its own name import nltk # other packages import re # impurity function RE_SUSPICIOUS = re.compile(r'[&#<>{}\[\]\\]') def impurity(text, min_len=10): """returns the share of suspicious characters in a text""" if text == None or len(text) < min_len: return 0 else: return len(RE_SUSPICIOUS.findall(text))/len(text) # rest of stuff import textacy.preprocessing as tprep def normalize(text): text = tprep.normalize.hyphenated_words(text) text = tprep.normalize.quotation_marks(text) text = tprep.normalize.unicode(text) text = tprep.remove.accents(text) text = tprep.replace.phone_numbers(text) text = tprep.replace.urls(text) text = tprep.replace.emails(text) text = tprep.replace.user_handles(text) text = tprep.replace.emojis(text) return text # install pyspellchecker !!! from spellchecker import SpellChecker spell = SpellChecker() import spacy nlp = spacy.load("en_core_web_sm") import textacy from itertools import chain from collections import Counter ``` ## Find Text - As a class, we will find a text source to analyze. This text source usually will consist of a webpage or other dataset to examine and clean. - Import the text into your report. - If the text is one big long string, first break into sentence segments and store it in a Pandas DataFrame. ``` myurl = "https://www.foxnews.com/sports/patrick-mahomes-fiery-message-win-bills-they-got-what-they-asked-for" #myurl = "https://www.foxnews.com/lifestyle/newly-elected-school-board-pennsylvania-reclaims-indigenous-mascot-rejects-cancel-culture" html = urlopen(myurl).read() soupified = BeautifulSoup(html, "html.parser") # soupified # just try get_text() try_text = soupified.get_text() try_text[0:100] ``` - Regular expressions ``` # find an exact match for the first time this occurs text = try_text[ # everything from the end of this sentence and on re.search("To access the content, check your email and follow the instructions provided.", try_text).end(): # now the end re.search("CLICK HERE TO GET THE FOX NEWS APP", try_text).start() ] ``` - Breaking down into sentences ``` # break down into sentences and put into DF sentences = nltk.sent_tokenize(text) type(sentences) # convert to dataframe DF = pd.DataFrame(sentences, columns = ["sentence"]) DF.head() ``` - We've used: - One big string (one variable) - A list which uses `[]` - Dictionaries `{}` - Tuples `()` - DataFrame from `pandas` ## Length for Proposal ``` # do this on the full text not broken into sentences len(nltk.word_tokenize(text)) # be sure to import nltk in the proposal ``` ## Fix Errors - Examine the text for errors or problems by looking at the text. - Legit, just look at the text. - Looking for any type of "garbage" - dependent on what you are doing. - Use the “impurity” function from class to examine the text for potential issues. ``` DF['score'] = DF['sentence'].apply(impurity) DF ``` - Remove the noise with the regex function. - Re-examine the impurity to determine if the data has been mostly cleaned. - Not necessary because it looks fine. - Normalize the rest of the text by using textacy. ``` DF['clean'] = DF['sentence'].apply(normalize) DF ``` - Examine spelling errors in at least one row of the dataset. - Any time you have stuff with names, please do not do spelling. - Mostly, only do this if you have a specific goals. ``` # find all the unique tokens # set is find unique # nltk.word_tokenize is break down into words # " ".join is combine into one long text # .to_list() is a function to convert to list clean_tokens = set(nltk.word_tokenize(" ".join(DF['clean'].to_list()))) # what is wrong? misspelled = spell.unknown(clean_tokens) for word in misspelled: # what's the word print(word) print("\n") # Get the one `most likely` answer print(spell.correction(word)) # Get a list of `likely` options print(spell.candidates(word)) # make a dictionary of the misspelled word and the correction # use find and replace in re to fix them ``` Pre-Processing - Using spacy and textacy, pre-process the text to end up with a list of tokenized lists. ``` output = [] # only the tagger and lemmatizer for doc in nlp.pipe(DF['clean'].tolist(), disable=["tok2vec", "ner", "parser"]): tokens = textacy.extract.words(doc, filter_stops = True, # default True, no stopwords filter_punct = True, # default True, no punctuation filter_nums = True, # default False, no numbers include_pos = None, # default None = include all exclude_pos = None, # default None = exclude none min_freq = 1) # minimum frequency of words output.append([str(word) for word in tokens]) # close output append output[0:1] ``` - Create a frequency table of each of the tokens returned in this output. Below is some example code to get us started. ``` # all items type(output) # first list type(output[0]) # first list, first item (this is the issue!) type(output[0][0]) Counter(chain.from_iterable(output)) ``` ## Summarize Write a paragraph explaining the process of cleaning data for an NLP pipeline. You should explain the errors you found in the dataset and how you fixed them. Explain the information that is gathered by using spacy and textacy and the final output. What did you learn from your frequency table? What is the text document about? # Information Extraction ## Libraries ``` # libraries import PyPDF2 import pandas as pd import nltk #nltk.download("punkt") import re import spacy # only for datalore import subprocess #%% print(subprocess.getoutput("python -m spacy download en_core_web_sm")) nlp = spacy.load("en_core_web_sm") import texacy import summa from summa import keywords from snorkel.preprocess import preprocessor from snorkel.types import DataPoint from itertools import combinations from snorkel.labeling import labeling_function from snorkel.labeling import PandasLFApplier import networkx as nx from matplotlib import pyplot as plt ``` ## Import Text ``` # creating a pdf file object pdfFileObj = open('The_Shadow_Over_Innsmouth.pdf', 'rb') # creating a pdf reader object pdfReader = PyPDF2.PdfReader(pdfFileObj) # how many pages len(pdfReader.pages) # creating a page object pageObj = pdfReader.pages # extracting text from page # loop here to get it all text = [] for page in pageObj: text.append(page.extract_text()) # closing the pdf file object pdfFileObj.close() ``` ## Convert to Sentences and Pandas - ^ means start with - [0-9] means any of these digits - [a-zA-Z] means any alpha latin character lower or upper case - $ ends with - . mean any character - * means zero or more of the previous character (so .* means zero or more of any character) ``` # create a place to save the text saved_words = [] # loop over each word for word in nltk.word_tokenize(book): # if the word starts with a number and ends with a letter if (re.search(r'^[0-9].*[a-zA-Z]$', word) != "None"): # take out the numbers and save into our text saved_words.append(re.sub(r'[0-9]', '', word)) # if not then save just the word else: saved_words.append(word) book = ' '.join(saved_words) ``` ``` DF = pd.DataFrame( nltk.sent_tokenize(book), columns = ["sentences"] ) DF.head() # for IE, we want sentence and/or paragraph level structure ``` ## Part of Speech Tagging - Tag your data with spacy’s part of speech tagger. - Convert this data into a Pandas DataFrame. ``` # easier to loop over the big text file than loop over words AND rows in pandas spacy_pos_tagged = [(str(word), word.tag_, word.pos_) for word in nlp(book)] # each row represents one token DF_POS = pd.DataFrame( spacy_pos_tagged, columns = ["token", "specific_tag", "upos"] ) ``` - Use the dataframe to calculate the most common parts of speech. ``` DF_POS['upos'].value_counts() ``` - Use the dataframe to calculate if words are considered more than one part of speech (crosstabs or groupby). ``` DF_POS2 = pd.crosstab(DF_POS['token'], DF_POS['upos']) # convert to true false to add up how many times not zero DF_POS2['total'] = DF_POS2.astype(bool).sum(axis=1) #print out the rows that aren't 1 DF_POS2[DF_POS2['total'] > 1] ``` - What is the most common part of speech? ANSWER THIS IN YOUR TEXT - Do you see words that are multiple parts of speech? ANSWER THIS IN YOUR TEXT ## KPE - Use textacy to find the key phrases in your text. - in the r window for r people - library(reticulate) - py_install("networkx < 3.0", pip = T) ``` # textacy KPE # build an english language for textacy pipe en = textacy.load_spacy_lang("en_core_web_sm", disable=("parser")) # build a processor for textacy using spacy and process text doc = textacy.make_spacy_doc(book, lang = en) # text rank algorithm print([kps for kps, weights in textacy.extract.keyterms.textrank(doc, normalize = "lemma", topn = 5)]) terms = set([term for term, weight in textacy.extract.keyterms.textrank(doc)]) print(textacy.extract.utils.aggregate_term_variants(terms)) ``` - Use summa to find the key phrases in your text. ``` #TR_keywords = keywords.keywords(book, scores = True) #print(TR_keywords[0:10]) ``` - What differences do you see in their outputs? COMMENT ON HOW SLOW! - Using textacy utilities, combine like key phrases. SEE ABOVE - Do the outputs make sense given your text? ANSWER THIS QUESTION ## NER + Snorkel - Use spacy to extract named entities. - Create a summary of your named entities. ``` # easier to loop over the big text file than loop over words AND rows in pandas spacy_ner_tagged = [(str(word.text), word.label_) for word in nlp(book).ents] # each row represents one token DF_NER = pd.DataFrame( spacy_ner_tagged, columns = ["token", "entity"] ) ``` ``` print(DF_NER['entity'].value_counts()) DF_NER2 = pd.crosstab(DF_NER['token'], DF_NER['entity']) print(DF_NER2) # convert to true false to add up how many times not zero DF_NER2['total'] = DF_NER2.astype(bool).sum(axis=1) #print out the rows that aren't 1 DF_NER2[DF_NER2['total'] > 1] ``` - Apply Snorkel to your data to show any relationship between names. ### get the data into a good format ``` stored_entities = [] # first get the entities, must be two for relationship matches def get_entities(x): """ Grabs the names using spacy's entity labeler """ # get all the entities in this row processed = nlp(x) # get the tokens for each sentence tokens = [word.text for word in processed] # get all the entities - notice this is only for persons temp = [(str(ent), ent.label_) for ent in processed.ents if ent.label_ != ""] # only move on if this row has at least two if len(temp) > 1: # finds all the combinations of pairs temp2 = list(combinations(temp, 2)) # for each pair combination for (person1, person2) in temp2: # find the names in the person 1 person1_words = [word.text for word in nlp(person1[0])] # find the token numbers for person 1 person1_ids = [i for i, val in enumerate(tokens) if val in person1_words] # output in (start, stop) token tuple format if len(person1_words) > 1: person1_ids2 = tuple(idx for idx in person1_ids[0:2]) else: id_1 = [idx for idx in person1_ids] person1_ids2 = (id_1[0], id_1[0]) # do the same thing with person 2 person2_words = [word.text for word in nlp(person2[0])] person2_ids = [i for i, val in enumerate(tokens) if val in person2_words[0:2]] if len(person2_words) > 1: person2_ids2 = tuple(idx for idx in person2_ids) else: id_2 = [idx for idx in person2_ids[0:2]] person2_ids2 = (id_2[0], id_2[0]) # store all this in a list stored_entities.append( [x, # original text tokens, # tokens person1[0], # person 1 name person2[0], # person 2 name person1_ids2, # person 1 id token tuple person2_ids2 # person 2 id token tuple ]) DF['sentences'].apply(get_entities) # create dataframe in snorkel structure DF_dev = pd.DataFrame(stored_entities, columns = ["sentence", "tokens", "person1", "person2", "person1_word_idx", "person2_word_idx"]) ``` ### figure out where to look (between and to the left) ``` # live locate home road roads in at street (locations tied together) # family terms for people # get words between the data points @preprocessor() def get_text_between(cand: DataPoint) -> DataPoint: """ Returns the text between the two person mentions in the sentence """ start = cand.person1_word_idx[1] + 1 end = cand.person2_word_idx[0] cand.between_tokens = cand.tokens[start:end] return cand # get words next to the data points @preprocessor() def get_left_tokens(cand: DataPoint) -> DataPoint: """ Returns tokens in the length 3 window to the left of the person mentions """ # TODO: need to pass window as input params window = 5 end = cand.person1_word_idx[0] cand.person1_left_tokens = cand.tokens[0:end][-1 - window : -1] end = cand.person2_word_idx[0] cand.person2_left_tokens = cand.tokens[0:end][-1 - window : -1] return cand ``` ### figure out what to look for ``` # live locate home road roads in at street (locations tied together) # family terms for people found_location = 1 found_family = -1 ABSTAIN = 0 location = {"live", "living", "locate", "located", "home", "road", "roads", "street", "streets", "in", "at", "of"} @labeling_function(resources=dict(location=location), pre=[get_text_between]) def between_location(x, location): return found_location if len(location.intersection(set(x.between_tokens))) > 0 else ABSTAIN @labeling_function(resources=dict(location=location), pre=[get_left_tokens]) def left_location(x, location): if len(set(location).intersection(set(x.person1_left_tokens))) > 0: return found_location elif len(set(location).intersection(set(x.person2_left_tokens))) > 0: return found_location else: return ABSTAIN family = {"spouse", "wife", "husband", "ex-wife", "ex-husband", "marry", "married", "father", "mother", "sister", "brother", "son", "daughter", "grandfather", "grandmother", "uncle", "aunt", "cousin", "boyfriend", "girlfriend"} @labeling_function(resources=dict(family=family), pre=[get_text_between]) def between_family(x, family): return found_family if len(family.intersection(set(x.between_tokens))) > 0 else ABSTAIN @labeling_function(resources=dict(family=family), pre=[get_left_tokens]) def left_family(x, family): if len(set(family).intersection(set(x.person1_left_tokens))) > 0: return found_family elif len(set(family).intersection(set(x.person2_left_tokens))) > 0: return found_family else: return ABSTAIN # create a list of functions to run lfs = [ between_location, left_location, between_family, left_family ] # build the applier function applier = PandasLFApplier(lfs) # run it on the dataset L_dev = applier.apply(DF_dev) ``` ``` L_dev ``` ``` DF_combined = pd.concat([DF_dev, pd.DataFrame(L_dev, columns = ["location1", "location2", "family1", "family2"])], axis = 1) DF_combined ``` ``` DF_combined['location_yes'] = DF_combined['location1'] + DF_combined["location2"] DF_combined['family_yes'] = DF_combined['family1'] + DF_combined["family2"] print(DF_combined['location_yes'].value_counts()) print(DF_combined['family_yes'].value_counts()) ``` - What might you do to improve the default NER extraction? ## Knowledge Graphs ### Slides Version - Based on the chosen text, add entities to a default spacy model. - Add a norm_entity, merge_entity, and init_coref pipelines. - Update and add the alias lookup if necessary for the data. - Add the name resolver pipeline. ### Or Use Your Snorkel Output - Create a co-occurrence graph of the entities linked together in your text. ``` # locations only DF_loc = DF_combined[DF_combined['location_yes'] > 0] DF_loc = DF_loc[['person1', 'person2']].reset_index(drop = True) cooc_loc = DF_loc.groupby(by=["person1", "person2"], as_index=False).size() # family only DF_fam = DF_combined[DF_combined['family_yes'] > 0] DF_fam = DF_fam[['person1', 'person2']].reset_index(drop = True) cooc_fam = DF_fam.groupby(by=["person1", "person2"], as_index=False).size() # take out issues where entity 1 == entity 2 cooc_loc = cooc_loc[cooc_loc['person1'] != cooc_loc['person2']] cooc_fam = cooc_fam[cooc_fam['person1'] != cooc_fam['person2']] print(cooc_loc.head()) print(cooc_fam.head()) ``` - This creates a dataframe of node 1 and then node 2 (entity 1 to entity 2) and then frequency (size) ``` # start by plotting the whole thing for location cooc_loc_small = cooc_loc[cooc_loc['size']>1] graph = nx.from_pandas_edgelist( cooc_loc_small[['person1', 'person2', 'size']] \ .rename(columns={'size': 'weight'}), source='person1', target='person2', edge_attr=True) pos = nx.kamada_kawai_layout(graph, weight='weight') _ = plt.figure(figsize=(20, 20)) nx.draw(graph, pos, node_size=1000, node_color='skyblue', alpha=0.8, with_labels = True) plt.title('Graph Visualization', size=15) for (node1,node2,data) in graph.edges(data=True): width = data['weight'] _ = nx.draw_networkx_edges(graph,pos, edgelist=[(node1, node2)], width=width, edge_color='#505050', alpha=0.5) plt.show() plt.close() ``` ``` # start by plotting the whole thing for location graph = nx.from_pandas_edgelist( cooc_fam[['person1', 'person2', 'size']] \ .rename(columns={'size': 'weight'}), source='person1', target='person2', edge_attr=True) pos = nx.kamada_kawai_layout(graph, weight='weight') _ = plt.figure(figsize=(20, 20)) nx.draw(graph, pos, node_size=1000, node_color='skyblue', alpha=0.8, with_labels = True) plt.title('Graph Visualization', size=15) for (node1,node2,data) in graph.edges(data=True): width = data['weight'] _ = nx.draw_networkx_edges(graph,pos, edgelist=[(node1, node2)], width=width, edge_color='#505050', alpha=0.5) plt.show() plt.close() ``` # Text Summarization ## Find Text ``` import pysrt import pandas as pd import re from sentence_transformers import SentenceTransformer # install faiss-cpu import faiss import time from sumy.summarizers.text_rank import TextRankSummarizer from sumy.parsers.plaintext import PlaintextParser from sumy.nlp.tokenizers import Tokenizer from sumy.nlp.stemmers import Stemmer from sumy.utils import get_stop_words from sumy.summarizers.lsa import LsaSummarizer import nltk nltk.download("punkt") nltk.download("stopwords") from nltk.tokenize import word_tokenize from nltk.corpus import stopwords from gensim.models import LdaModel from gensim.corpora import Dictionary from pprint import pprint #tokenize, remove stopwords, non-alphabetic words, lowercase def preprocess(textstring): stops = set(stopwords.words('english')) tokens = word_tokenize(textstring) return [token.lower() for token in tokens if token.isalpha() and token not in stops] import pyLDAvis import pyLDAvis.gensim_models #don't skip this import matplotlib.pyplot as plt from rouge_score import rouge_scorer #rouge-score to install ``` ``` subs = pysrt.open("bodies.srt") DF = pd.DataFrame([ { "Text": sub.text } for sub in subs]) DF ``` ``` def remove_noise(text): text = re.sub("<.*>", " ", text) text = re.sub("{.*}", " ", text) text = re.sub("\[.*\]", " ", text) text = text.strip() return text DF['clean'] = DF['Text'].apply(remove_noise) DF = DF[DF['clean'] != ""] DF ``` Create A Search Engine Using each sentence as your “documents”, create a search engine to find specific pieces of text. ``` # this is creating the embeddings model = SentenceTransformer('msmarco-MiniLM-L-12-v3') bodies_text_embds = model.encode(DF['clean'].to_list()) ``` ``` # Create an index using FAISS index = faiss.IndexFlatL2(bodies_text_embds.shape[1]) index.add(bodies_text_embds) faiss.write_index(index, 'index_bodies') bodies_text_embds ``` ``` # define a search def search(query, k): t=time.time() query_vector = model.encode([query]) top_k = index.search(query_vector, k) print('totaltime: {}'.format(time.time()-t)) return [DF['clean'].to_list()[_id] for _id in top_k[1].tolist()[0]] ``` Search for several items. ``` search("cop", 10) search("gun", 10) search("car", 10) ``` Examine the results and comment on how well you think the search engine worked. ANSWER THIS QUESTION # Create Text Summaries - Create a human summary of the text. ``` human = "The story starts with the appearance of a dead body in Longharvest Lane in London. This event happens in the same location in four different years – 1890, 1941, 2023 and 2053 – and leads to four different detective investigations that eventually become interlinked with far-reaching consequences." ``` - Create text summaries using LSA, TextRank, and Topic Modeling. - Most of these methods don't really care what language the text is in - If you want to stem, you would need a language specific model for that - May need specific model for parsing for languages without obvious space - Specific stop words for your language are needed - Stemming: is cutting off the affixes on a word - warning --> warn, morning --> morn - In theory, it combines like word forms - In practice, it's so so at that task - The other option is lemmatization which is using a dictionary to look up the root word ``` # for textrank or lsa LANGUAGE = "english" # language stemmer = Stemmer(LANGUAGE) # stemming # parse the document parser = PlaintextParser.from_string(the_text, Tokenizer(LANGUAGE)) # text rank tr_answer = [] # create a space to save tr_summary = TextRankSummarizer() # create the summarizer algorithm for sentence in tr_summary(parser.document, 5): # need to convert from text rank "sentence" to str tr_answer.append(str(sentence)) # put them all together tr_answer = " ".join(tr_answer) # print it out for our own reading print(tr_answer) ``` ``` lsa_summary = LsaSummarizer(stemmer) lsa_summary.stop_words = get_stop_words(LANGUAGE) lsa_answer = [] for sentence in lsa_summary(parser.document, 5): lsa_answer.append(str(sentence)) lsa_answer = " ".join(lsa_answer) print(lsa_answer) ``` ``` # Create a dictionary representation of the documents. # tokenized sentences created earlier sentences = DF['clean'].to_list() processed_sentences = [preprocess(sent) for sent in DF['clean']] dictionary = Dictionary(processed_sentences) corpus = [dictionary.doc2bow(sent) for sent in processed_sentences] # Train the topic model LDAmodel = LdaModel(corpus = corpus, id2word = dictionary, iterations = 400, num_topics = 10, random_state = 100, update_every = 1, chunksize = 100, passes = 10, alpha = 'auto', per_word_topics = True) probs = [LDAmodel.get_document_topics(sentence) for sentence in corpus] save_probs = [] i = 0 for document in probs: for (topic, prob) in document: if topic == 0: save_probs.append((sentences[i], prob)) i = i + 1 DF = pd.DataFrame(save_probs, columns = ["sentence", "prob"]) topic_summary = " ".join(DF.sort_values(by = ["prob"], ascending = False)[0:5].sentence) print(topic_summary) ``` ``` vis = pyLDAvis.gensim_models.prepare(LDAmodel, corpus, dictionary, n_jobs = 1) pyLDAvis.save_html(vis, 'LDA_Visualization.html') ##saves the file ``` - Assess those summaries using the Rouge-N analyzer. ``` #print nicely def print_rouge_score(rouge_score): for k,v in rouge_score.items(): print (k, 'Precision:', "{:.2f}".format(v.precision), 'Recall:', "{:.2f}".format(v.recall), 'fmeasure:', "{:.2f}".format(v.fmeasure)) # define the scorer scorer = rouge_scorer.RougeScorer(['rouge1'], use_stemmer=True) # print the scores #print_rouge_score(scorer.score(human, tr_summary)) #print_rouge_score(scorer.score(human, lsa_summary)) print_rouge_score(scorer.score(human, topic_summary)) ``` - Which summary was the best when compared to the human summary? ANSWER THIS QUESTION BY LOOKING AT THE SUMMARIES AND MAKING A JUDGMENT # Classification Find Text - As a class, we will find a text source to analyze to predict a categorical outcome. - Import the text into your report. ## Library ``` import pysrt import pandas as pd import re import nltk from nltk.corpus import stopwords # nltk.download("stopwords") # nltk.download("punkt") from string import punctuation from nltk.stem.porter import PorterStemmer from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer from textblob import TextBlob from sklearn.model_selection import train_test_split import numpy as np def document_vectorizer(corpus, model, num_features): vocabulary = set(model.wv.index_to_key) def average_word_vectors(words, model, vocabulary, num_features): feature_vector = np.zeros((num_features,), dtype="float64") nwords = 0. for word in words: if word in vocabulary: nwords = nwords + 1. feature_vector = np.add(feature_vector, model.wv[word]) if nwords: feature_vector = np.divide(feature_vector, nwords) return feature_vector features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features) for tokenized_sentence in corpus] return np.array(features) from gensim.models import Word2Vec from sklearn.linear_model import LogisticRegression from sklearn.naive_bayes import MultinomialNB from sklearn.metrics import classification_report import eli5 import lime from sklearn.pipeline import make_pipeline from lime.lime_text import LimeTextExplainer import matplotlib.pyplot as plt ``` ``` subs = pysrt.open("barbie.srt") DF = pd.DataFrame([ { "Text": sub.text } for sub in subs]) DF.head() ``` ## Set up Text ``` # deal with the errors def clean_errors(text): text = re.sub("</?[a-z]+>", "", text) text = re.sub("♪", "", text) return text DF['clean'] = DF['Text'].apply(clean_errors) ``` - Do preprocessing on your text to prepare it for the final machine learning model. ``` stops = set(stopwords.words("english")) stops.add("...") stops.add("'s") stops.add("n't") stops.add("'m") stops.add("'re") stops.add("'ll") stops.add("'ve") stops.add("'d") stemmer = PorterStemmer() def clean_up(text): # lower case everything so it's normalized text = text.lower() #clean up apostrophes text = re.sub("'", "'", text) # remove stop words text = " ".join([word for word in nltk.word_tokenize(text) if word not in stops]) # get rid of punctuation text = " ".join([word for word in nltk.word_tokenize(text) if not word.isdigit() and word not in punctuation]) # remove them here text = re.sub("'", "", text) text = re.sub("`", "", text) # stemming might be good text = stemmer.stem(text) # take out the white space return text.strip() DF['clean2'] = DF['clean'].apply(clean_up) ``` - What items do you think will be important in your preprocessing to clean? ANSWER THIS QUESTION ## Create Feature Extractions ### Create our labels #### noun example ``` nouns = r"put|your|nouns|here|with|separators" def find_noun(text): if re.match(nouns, text): return 1 # match else: return 0 # no match def remove_noun(text): text = re.sub(nouns, " ", text) # use apply to these on your dataframe df['noun_included'] = df['text'].apply(find_noun) df['text_no_noun'] = df['text'].apply(remove_noun) ``` #### sentiment example ``` # remove the empty cells DF = DF[DF['clean2'].str.len() > 1] # add fake labels (this is just for class) # do this on clean with full words not totally processed for sentiment DF['sent_score'] = [TextBlob(text).sentiment.polarity for text in DF['clean'].to_list()] # assign a label DF['sent_label'] = DF['sent_score'] > 0 ``` ### Balancing ``` DF['sent_label'].value_counts() # create balanced data DFB = DF.groupby('sent_label').sample(n = 520) DFB['sent_label'].value_counts() ``` ### Create test-train ``` X_train, X_test, Y_train, Y_test = train_test_split(DF['clean2'], # X values DF['sent_label'], # Y values test_size = 0.2, # test size random_state = 42) print('Size of Training Data ', X_train.shape[0]) print('Size of Test Data ', X_test.shape[0]) XB_train, XB_test, YB_train, YB_test = train_test_split(DFB['clean2'], # X values DFB['sent_label'], # Y values test_size = 0.2, # test size random_state = 42) print('Size of Training Data ', XB_train.shape[0]) print('Size of Test Data ', XB_test.shape[0]) ``` ### Feature extract - Create a “one-hot” encoding using the count vectorizer and binary options. ``` # build a blank setup oh = CountVectorizer(binary = True) # fit the training data oh_u_train = oh.fit_transform(X_train) # transform the testing data to look like the training data oh_u_test = oh.transform(X_test) oh_u_train.shape oh_u_test.shape # build a blank setup oh = CountVectorizer(binary = True) # fit the training data oh_b_train = oh.fit_transform(XB_train) # transform the testing data to look like the training data oh_b_test = oh.transform(XB_test) oh_b_train.shape oh_b_test.shape ``` - Create the bag of words encoding using the count vectorizer. ``` # build a blank setup bow = CountVectorizer() # fit the training data bow_u_train = bow.fit_transform(X_train) # transform the testing data to look like the training data bow_u_test = bow.transform(X_test) bow_u_train.shape bow_u_test.shape # build a blank setup bow = CountVectorizer() # fit the training data bow_b_train = bow.fit_transform(XB_train) # transform the testing data to look like the training data bow_b_test = bow.transform(XB_test) bow_b_train.shape bow_b_test.shape ``` - Create the TF-IDF normalization using the tfidf vectorizer. ``` # build a blank setup tfidf = TfidfVectorizer() # fit the training data tfidf_u_train = tfidf.fit_transform(X_train) # transform the testing data to look like the training data tfidf_u_test = tfidf.transform(X_test) tfidf_u_train.shape tfidf_u_test.shape # build a blank setup tfidf = TfidfVectorizer() # fit the training data tfidf_b_train = tfidf.fit_transform(XB_train) # transform the testing data to look like the training data tfidf_b_test = tfidf.transform(XB_test) tfidf_b_train.shape tfidf_b_test.shape ``` - Create two word2vec models: - Use a large number of dimensions that matches your tfidf. - Using cbow and skipgram embeddings. - Using a 5 window size. ``` # train the model on the data wv_u = Word2Vec(X_train, vector_size = 1600, #dimensions window = 5, #window size sg = 0, #cbow min_count = 1, workers = 4) # generate averaged word vector features from word2vec model wv_u_train_c = document_vectorizer(corpus = X_train, model = wv_u, num_features = 1600) # generate averaged word vector features from word2vec model wv_u_test_c = document_vectorizer(corpus = X_test, model = wv_u, num_features = 1600) wv_u_train_c.shape wv_u_test_c.shape # train the model on the data wv_b = Word2Vec(XB_train, vector_size = 1000, #dimensions window = 5, #window size sg = 0, #cbow min_count = 1, workers = 4) # generate averaged word vector features from word2vec model wv_b_train_c = document_vectorizer(corpus = XB_train, model = wv_b, num_features = 1000) # generate averaged word vector features from word2vec model wv_b_test_c = document_vectorizer(corpus = XB_test, model = wv_b, num_features = 1000) wv_b_train_c.shape wv_b_test_c.shape ``` ``` # train the model on the data wv_u = Word2Vec(X_train, vector_size = 1600, #dimensions window = 5, #window size sg = 1, #cbow min_count = 1, workers = 4) # generate averaged word vector features from word2vec model wv_u_train_s = document_vectorizer(corpus = X_train, model = wv_u, num_features = 1600) # generate averaged word vector features from word2vec model wv_u_test_s = document_vectorizer(corpus = X_test, model = wv_u, num_features = 1600) wv_u_train_s.shape wv_u_test_s.shape # train the model on the data wv_b = Word2Vec(XB_train, vector_size = 1000, #dimensions window = 5, #window size sg = 1, #cbow min_count = 1, workers = 4) # generate averaged word vector features from word2vec model wv_b_train_s = document_vectorizer(corpus = XB_train, model = wv_b, num_features = 1000) # generate averaged word vector features from word2vec model wv_b_test_s = document_vectorizer(corpus = XB_test, model = wv_b, num_features = 1000) wv_b_train_s.shape wv_b_test_s.shape ``` ## Classify - Use at least two classification algorithms to predict the outcome of the data. - Include the model assessment of these predictions for all models. ### Log Unbalanced ``` ## one hot log unbalanced logreg = LogisticRegression(max_iter = 10000) logreg.fit(oh_u_train, Y_train) #training features not X, Y_train y_log = logreg.predict(oh_u_test) #testing features not X print(classification_report(Y_test, y_log)) ## bow log unbalanced logreg = LogisticRegression(max_iter = 10000) logreg.fit(bow_u_train, Y_train) #training features not X, Y_train y_log = logreg.predict(bow_u_test) #testing features not X print(classification_report(Y_test, y_log)) ## tfidf log unbalanced logreg = LogisticRegression(max_iter = 10000) logreg.fit(tfidf_u_train, Y_train) #training features not X, Y_train y_log = logreg.predict(tfidf_u_test) #testing features not X print(classification_report(Y_test, y_log)) ## w cbow log unbalanced logreg = LogisticRegression(max_iter = 10000) logreg.fit(wv_u_train_c, Y_train) #training features not X, Y_train y_log = logreg.predict(wv_u_test_c) #testing features not X print(classification_report(Y_test, y_log)) ## w skip log unbalanced logreg = LogisticRegression(max_iter = 10000) logreg.fit(wv_u_train_s, Y_train) #training features not X, Y_train y_log = logreg.predict(wv_u_test_s) #testing features not X print(classification_report(Y_test, y_log)) ``` ### Log Balanced ``` ## one hot log unbalanced logreg = LogisticRegression(max_iter = 10000) logreg.fit(oh_b_train, YB_train) #training features not X, YB_train y_log = logreg.predict(oh_b_test) #testing features not X print(classification_report(YB_test, y_log)) ## bow log unbalanced logreg = LogisticRegression(max_iter = 10000) logreg.fit(bow_b_train, YB_train) #training features not X, YB_train y_log = logreg.predict(bow_b_test) #testing features not X print(classification_report(YB_test, y_log)) ## tfidf log unbalanced logreg = LogisticRegression(max_iter = 10000) logreg.fit(tfidf_b_train, YB_train) #training features not X, YB_train y_log = logreg.predict(tfidf_b_test) #testing features not X print(classification_report(YB_test, y_log)) ## w cbow log unbalanced logreg = LogisticRegression(max_iter = 10000) logreg.fit(wv_b_train_c, YB_train) #training features not X, YB_train y_log = logreg.predict(wv_b_test_c) #testing features not X print(classification_report(YB_test, y_log)) ## w skip log unbalanced logreg = LogisticRegression(max_iter = 10000) logreg.fit(wv_b_train_s, YB_train) #training features not X, YB_train y_log = logreg.predict(wv_b_test_s) #testing features not X print(classification_report(YB_test, y_log)) ``` ### NB Unbalanced ``` ## one hot log unbalanced nb = MultinomialNB() nb.fit(oh_u_train, Y_train) #training features not X, Y_train y_nb = nb.predict(oh_u_test) #testing features not X print(classification_report(Y_test, y_nb)) ## bow log unbalanced nb = MultinomialNB() nb.fit(bow_u_train, Y_train) #training features not X, Y_train y_nb = nb.predict(bow_u_test) #testing features not X print(classification_report(Y_test, y_nb)) ## tfidf log unbalanced nb = MultinomialNB() nb.fit(tfidf_u_train, Y_train) #training features not X, Y_train y_nb = nb.predict(tfidf_u_test) #testing features not X print(classification_report(Y_test, y_nb)) # have to fix negatives in our word2vec wv_u_train_c.min() wv_u_test_c.min() wv_u_train_s.min() wv_u_test_s.min() wv_u_train_c = wv_u_train_c + 1 wv_u_test_c = wv_u_test_c + 1 wv_u_train_s = wv_u_train_s + 1 wv_u_test_s = wv_u_test_s + 1 ## w cbow log unbalanced nb = MultinomialNB() nb.fit(wv_u_train_c, Y_train) #training features not X, Y_train y_nb = nb.predict(wv_u_test_c) #testing features not X print(classification_report(Y_test, y_nb)) ## w skip log unbalanced nb = MultinomialNB() nb.fit(wv_u_train_s, Y_train) #training features not X, Y_train y_nb = nb.predict(wv_u_test_s) #testing features not X print(classification_report(Y_test, y_nb)) ``` ### NB Balanced ``` ## one hot log unbalanced nb = MultinomialNB() nb.fit(oh_b_train, YB_train) #training features not X, YB_train y_nb = nb.predict(oh_b_test) #testing features not X print(classification_report(YB_test, y_nb)) ## bow log unbalanced nb = MultinomialNB() nb.fit(bow_b_train, YB_train) #training features not X, YB_train y_nb = nb.predict(bow_b_test) #testing features not X print(classification_report(YB_test, y_nb)) ## tfidf log unbalanced nb = MultinomialNB() nb.fit(tfidf_b_train, YB_train) #training features not X, YB_train y_nb = nb.predict(tfidf_b_test) #testing features not X print(classification_report(YB_test, y_nb)) # have to fix negatives in our word2vec wv_b_train_c.min() wv_b_test_c.min() wv_b_train_s.min() wv_b_test_s.min() wv_b_train_c = wv_b_train_c + 1 wv_b_test_c = wv_b_test_c + 1 wv_b_train_s = wv_b_train_s + 1 wv_b_test_s = wv_b_test_s + 1 ## w cbow log unbalanced nb = MultinomialNB() nb.fit(wv_b_train_c, YB_train) #training features not X, YB_train y_nb = nb.predict(wv_b_test_c) #testing features not X print(classification_report(YB_test, y_nb)) ## w skip log unbalanced nb = MultinomialNB() nb.fit(wv_b_train_s, YB_train) #training features not X, YB_train y_nb = nb.predict(wv_b_test_s) #testing features not X print(classification_report(YB_test, y_nb)) ``` - Write a paragraph summarizing the results from your comparisons. What models are best? Are there any general differences/similarities in prediction you see? How well is each category label classified? What might you do to make the model better? ``` # build a blank setup oh = CountVectorizer(binary = True) # fit the training data oh_u_train = oh.fit_transform(X_train) # transform the testing data to look like the training data oh_u_test = oh.transform(X_test) ## one hot bayes unbalanced nb = MultinomialNB() nb.fit(oh_u_train, Y_train) #training features not X, Y_train y_nb = nb.predict(oh_u_test) #testing features not X print(classification_report(Y_test, y_nb)) ``` ## Interpret - Use eli5 to determine what predicts each category label. ``` eli5.show_weights(estimator = logreg, top = 10, feature_names = oh.get_feature_names_out()) ``` ``` # we need to make a pipeline from sklearn pipeline = make_pipeline(oh, nb) pipeline.predict_proba(["barbie is terrible"]) # then we build the "explainer" which is a blank model that has the class names explainer = LimeTextExplainer(class_names = Y_train.sort_values().unique()) # and then we apply the pipeline and explainer # to new or old text exp = explainer.explain_instance("barbie is terrible", # text pipeline.predict_proba, # put in the answers from pipeline num_features=10) # the number of features exp = explainer.explain_instance(DF['clean2'].iloc[29], # text pipeline.predict_proba, # put in the answers from pipeline num_features=10) # the number of features exp.as_pyplot_figure() plt.show() exp.save_to_file('BARBIE.html') ``` - Interpret the results by writing a paragraph explaining the output from this package. ## Set of Steps Preprocessing 1) Fix errors in the text data 2) Think about the information that might influence the feature extraction model (stemming, lower casing, acronyms, spelling, stopwords, punctuation, numbers) 3) Balance of the outcomes Feature Extraction 4) Split the data into training and testing 5) Create our feature extractions Algorithm / Build Model 6) Apply the algorithm to the feature extraction Results / Interpretation 7) Examine a classification report to pick the best model - Remember what is "chance"? - Check out accuracy > chance - Look at each label/outcome - Models that NEVER predict one of the outcomes are likely not useful - Watch for zero division warning - F1 score is a weighted average, want both/all categories to be "good" 8) Look at lime, outcomes, eli5 to help with understanding the results # Chatbots - install rasa like you normally would for your machine set up ## Find Text - Find a list of movies and themes to pick from. ``` # test we can open the data # will put this in the endpoints section import pandas as pd DF = pd.read_csv("recommend.csv") DF.head() DF[DF["type"] == "crime"]["recommendation"].iloc[0] ``` ## Train the Chatbot - As a class, let’s train a chatbot to tell us recommend a movie based on theme we pick. - go to tools > terminal ### NLU > data > nlu ``` version: "3.1" nlu: - intent: recommend examples: | - recommend - recommend movie - I want to watch a movie - What do you recommend - Recommend something to watch - What can I watch tonight - intent: category_movie examples: | - [drama](movie_request) - [crime](movie_request) - [scifi](movie_request) - [horror](movie_request) - [romantic](movie_request) - intent: greet examples: | - hey - hello - hi - hello there - good morning - good evening - moin - hey there - let's go - hey dude - goodmorning - goodevening - good afternoon - intent: goodbye examples: | - cu - good by - cee you later - good night - bye - goodbye - have a nice day - see you around - bye bye - see you later - intent: affirm examples: | - yes - y - indeed - of course - that sounds good - correct - intent: deny examples: | - no - n - never - I don't think so - don't like that - no way - not really - intent: mood_great examples: | - perfect - great - amazing - feeling like a king - wonderful - I am feeling very good - I am great - I am amazing - I am going to save the world - super stoked - extremely good - so so perfect - so good - so perfect - intent: mood_unhappy examples: | - my day was horrible - I am sad - I don't feel very well - I am disappointed - super sad - I'm so sad - sad - very sad - unhappy - not good - not very good - extremly sad - so saad - so sad - intent: bot_challenge examples: | - are you a bot? - are you a human? - am I talking to a bot? - am I talking to a human? ``` ### Domain > main folder ``` version: "3.1" intents: - greet - goodbye - affirm - deny - mood_great - mood_unhappy - bot_challenge - recommend - category_movie responses: utter_category: - text: "What type of movie do you want a recommendation for?" utter_options: - text: "Your options are: drama, horror, scifi, crime, orå romantic." utter_wait: - text: "I will look up an example for {movie_request}." utter_greet: - text: "Hey! How are you?" utter_cheer_up: - text: "Here is something to cheer you up:" image: "https://i.imgur.com/nGF1K8f.jpg" utter_did_that_help: - text: "Did that help you?" utter_happy: - text: "Great, carry on!" utter_goodbye: - text: "Bye" utter_iamabot: - text: "I am a bot, powered by Rasa." entities: - movie_request slots: movie_request: type: text mappings: - type: from_entity entity: movie_request actions: - action_get_recommendation session_config: session_expiration_time: 60 carry_over_slots_to_new_session: true ``` ### Stories > data > stories ``` version: "3.1" stories: - story: recommend movie steps: - intent: recommend - action: utter_category - action: utter_options - intent: category_movie - action: utter_wait - action: action_get_recommendation - story: happy path steps: - intent: greet - action: utter_greet - intent: mood_great - action: utter_happy - story: sad path 1 steps: - intent: greet - action: utter_greet - intent: mood_unhappy - action: utter_cheer_up - action: utter_did_that_help - intent: affirm - action: utter_happy - story: sad path 2 steps: - intent: greet - action: utter_greet - intent: mood_unhappy - action: utter_cheer_up - action: utter_did_that_help - intent: deny - action: utter_goodbye ``` ### Rules > data > rules ``` version: "3.1" rules: - rule: Say goodbye anytime the user says goodbye steps: - intent: goodbye - action: utter_goodbye - rule: Say 'I am a bot' anytime the user challenges steps: - intent: bot_challenge - action: utter_iamabot ``` ### Actions > actions > actions.py ``` # This files contains your custom actions which can be used to run # custom Python code. # # See this guide on how to implement these action: # https://rasa.com/docs/rasa/custom-actions # This is a simple example for a custom action which utters "Hello World!" from typing import Any, Text, Dict, List from rasa_sdk import Action, Tracker from rasa_sdk.executor import CollectingDispatcher import pandas as pd DF = pd.read_csv("recommend.csv") class ActionMovie(Action): def name(self) -> Text: return "action_get_recommendation" def run(self, dispatcher: CollectingDispatcher, tracker: Tracker, domain: Dict[Text, Any]) -> List[Dict[Text, Any]]: # we can get values from slots by `tracker` object movie_request = tracker.get_slot('movie_request') answer = DF[DF["type"] == movie_request]["recommendation"].iloc[0] answer = "Your movie recommendation is ".join(answer) dispatcher.utter_message(text=answer) return [] ``` ### Endpoints ``` # This file contains the different endpoints your bot can use. # Server where the models are pulled from. # https://rasa.com/docs/rasa/model-storage#fetching-models-from-a-server #models: # url: http://my-server.com/models/default_core@latest # wait_time_between_pulls: 10 # [optional](default: 100) # Server which runs your custom actions. # https://rasa.com/docs/rasa/custom-actions action_endpoint: url: "http://localhost:5055/webhook" # Tracker store which is used to store the conversations. # By default the conversations are stored in memory. # https://rasa.com/docs/rasa/tracker-stores #tracker_store: # type: redis # url: <host of the redis instance, e.g. localhost> # port: <port of your redis instance, usually 6379> # db: <number of your database within redis, e.g. 0> # password: <password used for authentication> # use_ssl: <whether or not the communication is encrypted, default false> #tracker_store: # type: mongod # url: <url to your mongo instance, e.g. mongodb://localhost:27017> # db: <name of the db within your mongo instance, e.g. rasa> # username: <username used for authentication> # password: <password used for authentication> # Event broker which all conversation events should be streamed to. # https://rasa.com/docs/rasa/event-brokers #event_broker: # url: localhost # username: username # password: password # queue: queue ``` To update the model use: `rasa train && rasa shell` BUT ALSO: in a new terminal (so you should have two!) use: ``` rasa run actions ``` Test the Chatbot Test the chatbot responses.