# Processing with AI ## Exploration : 👩‍⚖️ Ethics of AI Name: > Yuwei.Liu > Subject: > Improve dating apps matching algorithms using NLP >[TOC] ## Design brief ### Bias If we don't source our dataset with enough rigor, the following bias might appear: >1. a.Function "emtion":Some words may have two or more different meanings,it is hard to detect accurately the emotion the words want to convey.When users are matched with their self-introductions,the NLP will make some mistakes when deciding which two people have the same personality. >2. b.Only five emotions are available under the "keywords" catogory.But some words are not related to all these five emotions.They are netural words.For example, if someone tag himself #life#, he doesn't show any emotions. >3. c.We may use the "Relation" function to conclude some type of entities in some sentences in their self-introductions as the keywords to match different kinds of people.However,someone will only use emoji images in their introductions. It will be difficult for them to match people whose introductions are full of words. We will ensure that our model is not biased by: >1. Include more diversity in your dataset,specify different possible meanings of different words and their combination. >2. If the system detect some netural words which don't express any emotion in tags,just use the function"console" to look for the same word in the dataset. >3. Make a dataset for the emoji images. ### Overfitting We will make sure our model does not overfit by > Checking the accuracy of our model on a validation dataset. We separate our dataset in two parts, a training dataset and what is called a validation dataset. We only train our model using the training one, then we check its accuracy on the validation one. If the model still performs well, we know that our model will work on unknown inputs. ### Misuse >We have to remind ourselves that our application could be misused by some teenagers to look for fun. We can hardly tell their age for their words. ### Data leakage *Choose the most relevant proposition:* >In a catastrophic scenario, where all of our training dataset were stolen or recovered from our model, the risk would be some keywords related to "sex" will arouse some conservative people's attention. **OR** >We have decided that our training dataset will be fully open-sourced, but before we made sure that out competitor will not copy our code. ### Hacking > If someone found a way to "cheat" our model and make it make any prediction that it want instead of the real one, the risk would be that our users' feeling will be hurt,because they will be matched to new people who are totally not their type.They will have to start new conversations with the wrong one.