Part 2 - Domain Knowledge and Experiment

## Integrity Data Science Question and Answer: By Mohamed Sondo( 650-695-7152). Date: 09/29/2020 ## SQL (TERADATA) - **Q1 Find the Staff with the second-highest salary in San Francisco** ``` select s.staffno, fname,lname,salary,propertyno,street,city, dense_rank() over(order by salary desc)r from staff s inner join propertyforrent p on s.staffno = p.staffno where city='San Francisco' qualify r=2; ``` - **Q2 Find the names of all Clients that cannot afford any available Properties For Rent** ``` sel clientno,fname,lname from client c where maxrent < (sel min(rent) from propertyforrent ) ``` - **Q3 Find the telephone number of the Private Owner who owns the highest number of properties without a viewing** ``` sel T.clientno,T.telno,dense_rank() over(order by T.cnt desc)r From (sel c.clientno,telno,count(propertyno) cnt from client c Left join viewing v on c.clientno = v.clientno where v.clientno is null group by 1,2 ) T qualify r=1 ``` - **Q4 Find the number of property Views per Branch within 1, 2, and 3 months of client Registration** ``` select branchno, propertyno,count(propertyno) from viewing v inner join registration r on v.clientno = r.clientno where cast(months_between(cast(viewdate as date),cast(datejoined as date)) as integer) in (1,2,3) group by branchno, propertyno ``` # Part 2 - Domain Knowledge and Experiment ## Question 1 - **Question 1: The Facebook Integrity team wants to test the precision of new enforcement that is meant to target spammers on Facebook. This enforcement blocks all messages that happen between users who don’t share any connections (mutual friends). The team wants you to design the experiment in a way that will provide conclusive results.** - **Q1 How would you design and set up this experiment?** - I will use AB testing along with hypothesis testing to test the efficiency of the newly added feature. I will set an AB Testing with a control(the old system) and the experimented feature(old systeam + the enforcement change): - For the hypothesis testing, we first formulate the null Hypothesis. The Null Hypothesis is that the change has no effect on number of fake accounts generated in a given time frame. - set the significance level alpha: - Take sample( for new website) and sample mean. - Calculate P values: The P- Values is ( probability values)( between 0 and 1) is the probabilty of obstaining the observer diffence. Measure the strength of evidence. If the P-value is less than alpha, I will reject the Null Hypothesis. Otherwise, if the P-values is greather or equal to alpha , I will not reject the Null Hypothesis. - For running the AB testing I will define the following. Version A is the old system and Version B is (the old system + fake account message blocking feature). - Do a 50 % random assignment to each version. Activate the Test and let the experiment run for about a month/a quater. - Analyze the results to see the performance of each version and see which one had a bigger impact. - **Q2 What information, if any, would you need to properly determine the sample size and length of the experiment?** - It is important to have a good amount of historical fake account data. A database maintaining the record of both good and fake accounts with attributes such as: Account details and important features such as fake account flags. - The sample size required for acceptance/rejection of the null hypothesis for KPI expressed in terms of the proportion (prevalence rate of fake account in our case) could also depend on the following parameters: - The prevalence rate value of our control variation ( Old system without the blocking features); - the minimum difference between the values of variations A and B conversion rates which is to be identified; - The chosen confidence/significance level; - The chosen statistical power; - The type of the Hypothesis testing test: one-or two-tailed test. - In addition ,for the sample size, we can use an example from a familiar study.There are chances that our type of study has already been undertaken by someone else. We can also use a formula to choose our sample size depending on the information about our population of different accounts. - In the event we do not know much about our population, we can use the Slovins formula(n = N / (1 + Ne^2)) where n = Number of fake account samples, N = Total population account and e = Error tolerance (level). -**3 Which metrics would you plan to track in order to define the performance of the enforcement? Why did you choose these metrics?** - Defining my success criteria(metric): - Number of correct predictions from known/actual fake account data sets. (Probability of an account being fake). - The Prevalence is also a better way to understand what is happening on the platform because it shows what percentage of active accounts are likely to be fake. - I am proposing the prevalence of fake accounts because it will help us measures how many active fake accounts exist amongst our monthly active users within a given time period. - Have a confusion matrix. - Have a set of statistics in place for False Positive, False negative, and classification accuracy. - Compute the F1 score(The harmonic mean of the precision and recall) in the case of a classification. - Graph a ROC curve to see the level of tolerance between the specificity and sensitivity. ## Question 2 - **Describe a time where you have conducted an investigation/an investigation type that you regularly engage in - explain the tools and data you used and why.** At Apple, I used our historical production incident data and built a machine learning model aimed at investigating and minimizing the number of failures’ incident tickets generated over time; saving the company time and money. The project constitutes a proof of concept solution to augment the company’s internal incident management system to address key metrics such as scalability and efficiency. After observing a pattern of increasing production failure incident tickets; I along with my team members in the Global Business Intelligence started questioning its causes and effects on the companies operation. After interviewing the cross-functional teams dealing with those incident issues, we discovered its financial implication and the availability of historical incident data. The historical incident management system kept a record of all resolved incidents as well as their root causes and open incident tickets. Knowing the power of data science, we thought of possibly using machine learning for a Name Entity Recognition(NER) system to help diagnose and solve subsequent incidents. With historical data at our disposal, we thought this approach could help reduce the time taken to diagnose an incident which in turn provided support and allowed us to close the incident faster. After submitting our proposal, we got the green light and the necessary access to perform our POC. We applied ETL and big data technology to extract sample data from the system. We performed exploratory data analysis to understand the data as it was crucial in choosing the right model. We applied data preprocessing techniques to clean and standardizing the data. Resulting in the proper selection of the right model. We built and evaluated the performance of both classification and clustering algorithms. Ultimately, we had a functional natural language processing solution aided with topic modeling. This solution helped predict the root cause of new and unknown incidents with a probability value based on the frequency of the NER tags; improving our incident diagnoses by 25%. We finally visualized the result with a tableau dashboard. ## Question 3 How would you identify Fake Accounts on Facebook/Instagram? - **Step 1**: Create a channel to allow normal users to report suspisious activity. Also setup a defense line by filtering requests. For instance, block whole IP ranges from even reaching the site. Have a system in place that uses a combination of signals such as patterns of using suspicious email addresses, suspicious actions, or other signals previously associated with other fake accounts were flagged or removed. - **Step 2** Track important account feature such as:connection patterns across the network, Low follower count, Recent join date, account activity, mutual frienships,and the number of accounts created from a single IP. For example, a spammer may try to create multiple accounts quickly from the same IP address which could be a sign of a bot. - **Step 3** Have a domain expertise team in place to label fake accounts. Those labels must be generated by people around the world who have an understanding of local cultural norms for less bias results. - **Step 4**: Use the label data and build a supervised classification algorithm(Logitic regression or ensemble (random forest) to classify subsequent fake accounts. ## Part 3 - Coding - **1-Two SUM** Given an array of integers, return the indices of the two numbers such that they add up to a specific target. You may assume that each input would have exactly one solution, and you may not use the same element twice. ![](https://i.imgur.com/cJ2EY2I.png) **Source Code two sum** ``` # Authors: Mohamed Sondo def SolutionTwoSum(nums, target): lookupTable = dict(((val, idx) for idx, val in enumerate(nums))) return list(next(( (idx, lookupTable.get(target-val)) for idx, val in enumerate(nums) if lookupTable.get(target-val, idx) != idx), None)) nums=[2, 7, 11, 15] target=9 SolutionTwoSum(nums,target) ``` - **Solution Break down**: - I am creating a map of value->index for all indices, value pairs in nums list . In this case, the highest index will be stored in the dictionary and lower indexes will be overwritten. Assuming that each input would have exactly one solution, each input has a single unique output so we never have to worry about returning a "wrong-pair" of indices. - I am Looping through the enumeration of nums, getting idx as index, and val as value. Check if target-val is a key in the dictionary we created, and simultaneously assert that the value pointed to by that key is not idx. If this is ever true, return the tuple idx, lookupTable.get(target-val). **2. Defang** We can "defang" an IP Address and prevent people from clicking on it by inserting characters. For example 18.233.0.21 becomes 18.233.0[.]21. Given a defang'ed IP Address, write a function to de-defang it and output 'TRUE' if the result is a valid IP address ![](https://i.imgur.com/TtSj2y5.png) **Source Code For De-Defang** ``` # Authors: Mohamed Sondo import re def de_defang(defanged_ip): print('The De-defang Ip is:',Ip) return defanged_ip.replace('[.]', '.') def validIp(Ip): # Make a regular expression # for validating an Ip-address regex = '''^(25[0-5]|2[0-4][0-9]|[0-1]?[0-9][0-9]?)\.( 25[0-5]|2[0-4][0-9]|[0-1]?[0-9][0-9]?)\.( 25[0-5]|2[0-4][0-9]|[0-1]?[0-9][0-9]?)\.( 25[0-5]|2[0-4][0-9]|[0-1]?[0-9][0-9]?)$''' # passing the regular expression # and the string in search() method if(re.search(regex, Ip)): print("De-defang Ip is Valid: TRUE") else: print("e-defang Ip is Not Valid: FALSE") #calling function with test case Ip = de_defang("18.233.0[.]21") validIp(Ip) ```