data storytelling toolkit

# data storytelling toolkit ## sophie's learning documention ## Examples of great data storytelling: Pudding.cool [wiki breakout ](https://pudding.cool/2018/10/wiki-breakout/) Interesting incorporation of sound to enhance/guide the storytelling experience Slightly confusing UI though https://pudding.cool/2019/02/gyllenhaal/ Not as much explicit storytelling, but the interactive/gamified element allows readers to experience/come up with the story on their own (which is really cool, and perhaps an extension of the experience of exploratory data analysis) https://pudding.cool/2018/12/countries/ Such an interesting use of flags and small tidbits of stories to allow the user to get a birds’-eye-view of U.S.-foreign interactions over the last hundred years Provides almost a puzzle-piecing experience as you think back to the history that might have created the graphics you see Great example of how visualizing data can allow you to take a ton of data and glean conclusions from it https://pudding.cool/2020/07/song-decay/ love how this one uses one graph for the majority of the article (and many lines plotted on it) to introduce the question and the data they’re using to answer it - the consistency really makes the structure feel complete then, the consistency is broken by using a new kind of graph to summarize the data in totality / answer the question clearly (what is the recognition rate of these different songs) also, great incorporation of making the music itself playable as an enjoyable & helpful UI/storytelling piece https://pudding.cool/2020/04/kidz-bop/ cool use of a quiz to engage the reader and hint at the data collection/analysis writer consistent with the theme of interaction throughout the whole essay, rather than passive reading https://pudding.cool/2019/04/vogue/ utilizes scrollytelling to make more interesting visuals out of an otherwise fairly conventional article-esque essay https://pudding.cool/2019/05/people-map/ purely map-based, but with super cool colors ### a model deconstructed: ## analyzing: https://pudding.cool/2020/07/song-decay/ notes on structure: * black bar with project title at the top * pre-title info: * where they got their data * an example graph featuring two subjects ("wannabe" and "no scrubs") * the guiding question of the article * title, date, authors * introductory text (including some interactivity - in that you can play a snippet of the music mentioned) * subtitle + a graph (line chart, like all of the ones to follow save the last one) to begin the next section, which explains the methodology further throguh the use of another example ("No Diggity") * subtitle + a graph which continues the analysis done in the previous section, but graphed in a slightly different way * subtitle + chart with ALL of the songs graphed, and visible by hovering over each line * subtitle + chart with examples of songs that aren't standing the test of time, contrasted with the average trajectory of all the songs they analyzed * subtitle + chart which synthesizes their research into a new group: songs that seem to be standing the test of time * subtitle + chart displaying the opposite * subtitle + final chart (lollipop), which shows ALL of the data represented in a new way that visually demonstrates the gap btwn Gen Z and Millenial familiarity with certain songs general notes: * very muted color palette (beige, grey, blue, and pink) * focused on scrollytelling and a few interactive graphs (with largely static graphs) code notes: * uses data-* attribute in the html * charts are embedded in svgs * uses d3 * html source: view-source:https://pudding.cool/2020/07/song-decay/ * js source: https://pudding.cool/2020/07/song-decay/main.js?version=1595681259737 ## what we have to learn * d3 (if we want to mimic this entirely) * if not, we should use a data viz software that enables us to easily embed graphs in our code & allows the user to interact with our graph ## ideas for To Be or Not To Be article ### general notes * opening question could be: **how differently can 50 actors possibly recite the same speech? and what can those differences (or similarities) tell us about its meaning?** * audio data * rank in a chart * show the soliloquy and highlight how much different words are emphasized with colors * show a mosaic of all the different videos you use * include a quiz? where people mark what they emphasized/think could be emphasized * include in the intro a way to play a few very different takes on the opening line * also in the intro, talk about the cultural significance of the soliloquy * ideas for how to measure emphasis of certain words: * Google speech-to-text WordOffset timestamps (gives a value of exactly how long every word was spoken for, could show which words actors linger on and which they speed past) * VoiceBase developer API provides time stamps for all words in a transcrpt * relative volume (how much it differs from the norm; ie. if something is rly quiet or rly loud that might both indicate emphasis) * IBM speech-to-text audio metrics (speech_level and non_speech_level - measure decibel levels of speech and non speech) * DeepAffects API * Spotify Developer API * ideas for the process: * we can try to get one script running that mines all the data we want from one audio file example, and then apply that to each in a loop once we get it working * should we split up audio and speech metric analysis? or do it all at once? * how should we store the data/organize it? * we'll likely be getting a few big json objects for each speaker * from that, we could grab particular values and plug it into a new json object * ex. in an object for each word stored within a larger transcript object, there should be a key for start/end timestamp, speech_level, and whatever other metrics we have * organize/analyze data for particular visualizations: * for data on **word emphasis**, grab time duration & speech level for top 5/10/15 etc max words * create a graph of the most common max words (word cloud? bar chart?) * also, create a graph where you can toggle seeing the max words for each speaker (bar chart) * maybe create an index based on emphasis + speech level to measure this? or do a regression analysis with one independent variable (word) and two dependent variables (speech_level & duration)? * normalize speech_level based on average speech_level of the whole speech? * sometimes people deliver this soliloquy as a whole very quietly * maybe have the speech_level-related variable accounted for in the index be not speech_level on its own, but deviation from the speech_level norm (so you get both quietest and loudest?) * could visualize: * loudest words * quietest words * fastest words * slowest words * for data on **text content**, create a function that loops through every word in the transcripts and keeps track of deviations from a control transcript * create a visualization that measures numbers of differences from the original text? * which original text, tho? which folio version would we want to use? * create a visualization of which words/phrases are most often left out? * for data on **overall speed**, grab the last timestamp for each performance * create a graph that visualizes all the speakers organized from fastest to slowest * use this earlier in the essay? to lead into examinations of individual word duration? ### to do list * get data! * YouTube API? * where to put this data??? it'll be a JSON object... * analyze using either: * python * js * r * [Google Speech Analysis Framework](https://cloud.google.com/solutions/visualize-speech-data-with-framework) * [Google Speech to Text API](https://cloud.google.com/speech-to-text) * [VoiceBase speech analysis API](https://voicebase.readthedocs.io/en/v3/how-to-guides/hello-world.html) * from analysis, answer these questions: * [IBM Speech to Text](https://cloud.ibm.com/docs/speech-to-text?topic=speech-to-text-gettingStarted) * **which words are most often emphasized by actors?** * we have to define "emphasized": loudest & longest? quietest or loudest & longest? loudest & shortest or longest? etc. * **how does the inflection differ across the whole speech?** * visualize all overlapping but you can highlight and see the actor for each line * **how does speed differ?** * include a visualization of who goes through the speech the fastest versus who takes the most time * **when do different actors pause?** * include a visualization of who takes the most pauses to who takes the least * **what words do actors leave out?** * analyze the transcript of the audio to visualize each actor's different versions of the speech * compare against one copy of the script to measure everyone's difference from that control variable (ie. how many words that differ from the control script) * (these could all be different sections of the essay that lead to various conclusions) * word sentiment heatmap? * visualize * use these visualizations as a jumping off point for literary analysis ### data analysis notes UPDATES (9/18/20)! * DeepAffects API + Pipedream webhook is working * I now have data for speech-to-text transcript (ex. word content & word duration) and emotion per few seconds * Current plan: * using data split into the columns seconds and emotion, I want to make individual time series spirals of each speaker and a heatmap of all of the speakers to compare * want to use volume data (amplitude over time) to compare with word data and see how volume fluctuates on each word (organized by seconds) UPDATES (9/23/20) * compared how the emotion recognition contrasted andrew scott & paapa essiedu & david tennant * things to do next: * automate the data download process: * steps: * upload audio data to firebase storage * put link into API request * this is done via a curl—is there anyway to automate a curl request? * run API request * get data from webhook and put it into a js object * turn that into visualizations! * either download a csv file from the js object and upload that to R Studio to visualize there, or do so via d3.js! * think about what else to do with the data * compare word timestamps to emotion timestamps with the heatmap !! * also, check the david tennant data by running the api call again * finally, figure out how to organize data by second so that each second of the speech is connected to a word and an emotion - round up if decimals become a problem? * think about how to visualize this data! * do animations in R? UPDATES (9/25/2020) * make a heatmap out of the three emotion datasets - combine them into one dataset, where each second row has a person_emotion column for each actor * NEXT WEEK make a dataset for analyzing the word & emotion relationship using the andrew scott data * bar chart with labels for words * also a chord chart, where every phrase corresponds to an emotion? * make this after the bar chart so you can see the relationships, and see how it could be visualized - like an extended period of time with one emotion creates an emotion_phrase with that collection of words * visualize emotion over time wiht the stream graph instead of a spiral graph * experiment with interactivity using plotly UDPATES (10/1/20) * made a dataset for analyzing the word & emotion relationship using the andrew scott data * bar chart with labels for words * NEXT: * make a check that compares transcript data to the script of Hamlet to avoid word misspellings/correct words the API missed (ex. weather vs. weather, skipping over "or" & other small words) * also make a chord chart, where every phrase corresponds to an emotion? * make this after the bar chart so you can see the relationships, and see how it could be visualized - like an extended period of time with one emotion creates an emotion_phrase with that collection of words * visualize emotion over time with the stream graph instead of a spiral graph * experiment with interactivity using plotly UPDATES (10/2/20) * cleaned datasets to prepare for comparisons/fixing mistakes by the API transcriber * next time: * make a function that fixes any misspellings or missed words by the transcript API * use this function to create a new graph! UPDATES (10/7/20) * ideas for specific viz images: * which phrase is most commonly associated with particular emotions? * heat map of all of the speeches and their emotion data * which speeches take the most pauses? * create a tibble of pauses? * which phrases are faster and which are slower? * ex. which squish more words into fewer seconds? * which speeches are longer than others? * how many differ substantially from the original text? * next time/future: * start storyboarding and planning specific visualizations * make a list of all the performances you want to use * fix code for checking & fixing the actor's transcript based on the original text so that this persistent problem is resolved: * I don't know yet how to handle a use case where a misspelling is directly followed (or followed two indices later) by another mistake; this would classify it as a missed sentence. if you run the code a second time (once the other mistakes are presumably fixed) * possible solution: run this more than once until everything matches (nested for loop within an if statement) * fix the missed sentence * handle case when missed sentence is in the middle of the text * how?? * maybe if there's more than like 4 missed words in a row, categorize that as a missed sentence? * do you need the id column * make the tibble a vector/array? UPDATES 10/9/20 * CURRENT PROBLEM: * if there is a sequence which isn't a sentence totally changed/cut by the actor, but which has so many misspellings/skipped words in a row that the function thinks it's a whole sentence missed, what should it do? * mobilize the recursive element more? * and the count function * rethink the definitions of the different potential states of mismatch? * maybe make the "whole sentence missed" state a lot harder to get to - only if none of the words in the next bunch of words match at all? * maybe use the count function to point unfixable sentences to "whole sentence missed" * and then have there only be 3 states: misspelled, missed word, and cut sentence? * automate data collection: https://github.com/SEERNET/deepaffects-node/blob/master/docs/EmotionApi.md#syncRecogniseEmotion UPDATES 10/14/20: * function works now!! * next week: * check the function's efficacy with other datasets and start graphing with the new, clean data! UPDATES 10/16/20: * fix function works, AND now accounts for time data, so the function approximates the second time stamp for all new words it adds * NOW, we can really start graphing! * next week: * check function with other data & start graphing! UPDATES 10/21/20: * analyses to include in the website: * how does every speech approach the "to be or not to be that is the question" line in terms of emotions? how about in terms of speed? * who says it fastest? whose emotions are most similar? * what lines do people say in the most similar ways (in terms of emotion)? UPDATES 10/23/20: * make prototypes of the graphs listed above * start getting more data! * work with the apis! * got the API working with node - I'll try to automate it next week * got word data for Paapa Essiedu and David Tennant! * created this sheet: https://docs.google.com/spreadsheets/d/1rakhmXKV5Kz72K2v88bnJ7HsdJHqnOMEToe_R5Mbqag/edit#gid=0 to keep track of data UPDATES 10/28/20: * wrote code in pipedream to get csv-ready data from each api post request * got fixed word data for: * essiedu * lester * branagh * hiddleston * next step: * get emotion data for: * lester * branagh * hiddleston * graph idea: * create a column called "time_btwn" that measures the time btwn two words * find the biggest pause in each speech and compare! UPDATES 10/30/20: * got ALL word and emotion data for: * lester * branagh * scott * essiedu * glover * cumberbatch * hiddleston * hawke * (part of the reason these are all more recent performances is that the api struggled with old audio) * next week: * final data wrangling & graph time!!!!! UPDATES 11/4/20: * successfully organized ALL emotion & word data for each actor into one dataframe * added columns for: * counts for each emotion * longest pauses in each performance * made a steamgraph using that frame * next steps: * MORE graphing & exploratory data analysis!! :)))))) UPDATES 11/6/20: * thought on organizing the article: * approach the analysis from 3 angles: * duration/time * involve the idea of meter & context * for ex. glover v scott v hiddleston * show actors grouped by context (in little circle headshots): movie (hawke & branagh), radio/audiobook? (hiddleston), play/live performance (essiedu, scott, lester, cumberbatch), music video (glover) * what can we learn uniquely from each? * relationship between verse, rhytm, and how humans speak * volume * emotion * these are the three key categories of data i've attained * what can we learn from each? * design ideas: * left panel = text, right = images/viz * for a graph mid-section, that remains static as you scroll thru the left panel of text explaining it * for a graph beginning a section, take over whole horizontal screen and stack text over image vertically * scrollytelling: * https://russellgoldenberg.github.io/scrollama/progress/ * https://github.com/1aurend/learn-scrollytelling * host this on github pages? * look into Peak.js to visualize the waveforms of the audio data: https://medium.com/@indreklasn/peaks-js-interact-with-audio-waveforms-b7cb5bd3939a?source=email-4332ad42471d-1604730953998-digest.reader------0-59------------------065b50b5_fc19_415d_9112_abba2d47ebfb-1-a80a983c_1481_41da_9395_b956b2ebfb35----&sectionName=top UPDATES 11/11/20: * all rms data is downloaded and stored in my dataset! * started to play around with design & storyboarding: https://www.figma.com/file/eMAjoWOPyEAcMfCtRGyk31/tobeornottobe?node-id=0%3A1. UPDATES 11/13/20: * don't forget to compile & cite image links once you've got a solid collection of them UPDATES 11/18/20: * made all three graphs for the demo of the first line analysis * friday, need to normalize volume data * top 3 figma tools: * pen tool * mask tool * prototyping * research question: * maybe i'm thinking less about what new things i can learn about the speech and more about what new things i can learn about what data/tech can tell us about literature and/or performance UPDATES 11/20/20: * question: * how can data analysis help us with literary analysis? * you can track word frequency, etc. but * open with the demo of the first line, but the next sections are dedicated to exploring what we can learn from each metric * and some of the things we learn may be kinda obvious, but the importance is we can streamline the process and do it with a lot of data * speed can teach us about: * emphasis * moments where the words per second is slowest * look at max words per sec: * scott: around "perchance... to dream" (9) * lester: around "to... die, to sleep" & around "fly to others that we know not of... thus conscience does make cowards" (7) * hiddleston: scattered (3) * hawke: around "the dread of something after death..." (4) * glover: "must give us pause... there's the respect" (4) * essiedu: "... to die, to sleep, to dream" (5) * cumberbatch: "must give us pause... there's the respect" (5) * branagh: "to be wished... to die" & "fly to others that we know not of... thus conscience does make cowards"(7) * * moments where there is the longest rest * general approach to reading the verse aloud * reading it as verse? * indicated perhaps by more regular intervals of rest * reading it in a more human way? * indicating less regular intervals of rest * added pause analyses * to do next week: * add line at ~-33 dB to volume graphs * get started on next analysis stuff * also add a concluding pgraph on speed that highlights some other things we can learn from the time data (listed above) UPDATES 11/23/20: * to do next week: * check all of the speed pauses against the recordings * volume analysis: * loudest moment: * scott: * second 17, mean RMS = -35.577 * "nobler in the mind" * lester * second 91, mean RMS = -32.10560 * "whips and scorns of time" * hiddleston * second 16, mean RMS = -15.872 * "nobler in the mind * hawke * second 1, mean RMS = -33.116 * "to be or not to be" * glover * second 129, mean RMS = -16.13820 * "who would fardels bear" * essiedu * second 106, mean RMS = -30.34784 * "makes calamtiy of so long life" * cumberbatch * second 115, mean RMS = -25.44138 * "sweat under a weary life" * branagh * second 159, mean RMS = -32.7005 * "with the pale cast of thought"