Philly AI Crime Project

# Previously on Drexel AI: https://arxiv.org/pdf/2001.09764.pdf features: * crime location $(x,y)$ * time $t$ $f((x,y), t)=\text{crime type}$ more practically, we can also give the user how much % confident the model is on classifying the crime by probabilities. techniques used (for reference): data cleaning, feature selection, outlier detection, and component reduction and transformation. engineered features: * cluster center of high crime areas and use distance Preprocessed features: 1) Hour 2) Month 3) Year 4) DayOfWeek 5) Is_Weekend 6) X 7) Y 8) Is_Intersection 9) Is_Block 10) Police District 11) Street_Type, (St, Blv, Ave etc) previously researchers study time (regression of crime point over steps in time series) and location (predict neighbors of crime dense locations) Looked at specific crime incidents and aggregated them over hours, months, and years to find patterns in the data. ml models do not depend on city as clusters of **crime hotspots** are automatically generated. We can also stack crime points throughout all time on top of each other to get rid of the temporal dimension (time). This means a clustering technique would be able to set this up for any city or town to reduce data preprocessing work. * They only got a 20-30% accuracy which is very low. We should be able to beat it with more features and possibly different models. This is why they reported log loss instead of accuracy on the abstract which shows that this problem is likely more difficult than we thought. * We also have better services available for model training (services other than Colab). Also, Colab has been providing better GPUs recently for free so we might not face session timeouts (meaning better data reporting) ## General ideas/things after reading the research Time used was the dispatch time that the 911 call operator recorded * Not sure if this matters a lot but just wanted to mention * We too will likely use the dispatch time given This research used Euclidean Distance, but we can also try other methods to calculate distance like city-block distance (mentioned in Future Works) It would also be interesting to see how Covid-19 affected the crime rates (just like the effects of 2009 recession seen in this research) Note about data: In the Crime Incidents dataset visualizatoin (https://data.phila.gov/visualizations/crime-incidents), "Homocide - Criminal" has two different entries (essentially same name but different categories). This research treated it as one and we might need to do the same so just something to keep in mind If the missing data is only lat-lng values, then we can potentially use the address given to find the lat-lng values and fill in the missing data * Might be a better option than deleting entire entries # Literature Review: ## Crime Analysis Through Machine Learning (2018) https://ieeexplore.ieee.org/document/8614828 Summary: This paper analysis Vancouver's crime data and tries to build a crime prediction model using K-nearest neighbor and boosting decision tree. The former achieves an accuracy of 39% while the later achieves 44%. Overall, the paper provides good insight into how crime analysis and prediction could be performed. Provides a review of various other research papers related to crime analysis and prediction, most of which used different techniques than the one presented here. Introduction: * This research focuses on machine-learning-based crime prediction * Vancouver PD has been managing a crime database that gets updated every Sunday morning, showcasing the crime that took place in the city each week * VPD introduced a crime-predictive model and saw a 27% decrease in residential break-ins * Main objective is to use VPD's crime dataset + Vancouver's neighborhood dataset to create an accurate crime prediction model (target is crime type) * Techniques used: KNN and boosted decision tree Background * Lot of background study done related to past research in crime prediction, analysis and control * Some interesting studies and findings * One research used KNN, Naïve Bayesain and Decision Tree to study road accident patterns in Ethiopia and achieved accuries from 79% to 81% * Most research in crime prediction is focused on identifying crime hotspots * One study in Vancouver tried to model known offenders' activities using probalistic modeling of spatial behavior known for these offendors * Something that could be done for Philly as well * One study analyzed various crime-prediction methods and the results are as follows: "Knowledge Discovery in Databases (KDD) techniques, which combine statistical modelling, machine learning, database storage, and AI technologies, was suggested as an effective tool for crime prediction" Techniques Used in the Research * Like Philly crime research, time and location were used in the data. They also used neighborhood data for Vancouver to distinguish crime amongst its 22 neighborhoods/areas * Two approaches were used for data preprocessing * Approach 1: All categorical variables were converted into binary variables 0 and 1. Basically, for each data point, there were 21 zeros and 1 one to represent the neighborhood in which the crime took place. Similary, there all the days in a week were made into feature and 1 was used to show the day on which the crime took place * Benefits: Gave more variables to train the model on, and prevented data from skewing to one side * Approach 2: Categorical variables were converted into numerical values with unique IDs. All crime types and neighborhoods had different IDs, and these values were used in each data point ![](https://hackmd.io/_uploads/ryHir0902.png) Results * Boosted decision tree performed better than KNN * KNN results * Approach 1: Accuracy - 40.1%, Training time - 2209 seconds * Approach 2: Accuracy - 39.9%, Training time - 102 seconds * Boosted Decision Tree * Approach 1: Accuracy - 41.9%, Training time - 904 seconds * Approach 2: Accuracy - 43.2%, Training time - 459 seconds Observations * Used Choropleth Mapm to describe the geographic info. about crime incidents * GIS has been used for crime mapping (shows location of crime series with varoius geographic locations) * The addresses were converted into latitude and longitude data (WGS84) * Python libraries used for plotting graphs: PySal, GeoPandas, Folium, Shapely * 0s and NA were used to fill missing values * Overall crime pattern pattern similar to Philly one * Increase in crime in Summer, with peak around June to August and decrease in winter, with least being December and February * Like Philly, crime at its lowest around 5 to 6 am and starts to increase around lunch time (~12pm) and continues to increase till midnight ## Predicting and Preventing Crime: A Crime PredictionModel Using San Francisco Crime Data by Classification Techniques https://www.hindawi.com/journals/complexity/2022/4830411/ https://www.kaggle.com/competitions/sf-crime https://datasf.org/opendata/ Summary: A study that compared and proposed crime prediction models based on Naive Bayes, Random Forest, and Gradient Boosting Decision Tree. The model analyzed top ten crimes in San Francisco area and achieved accuracy of 65.82%, 63.43%, and 98.5%, respectively. Introduction * This study proposes a prediction model that can predict crime in San Francisco based on historical data. * Uses the SF Crime Classification dataset found and managed on Kaggle (used in competitions as well) * Naive Bayes, Random Forest, and Gradient Boosting Decision Tree are used for prediction and classification of crimes into two types of violent and nonviolent crimes Background * The researchers summarized previous research articles related to crime prediction and analysis especially ones focusing on SF * One research comparing Naive Bayes and Decision Tree classifiers found Naive Bayes classifier as the better performing one * Other researchs disagreed, with one proposing Gradient Tree Boosting and other showing Decision Tree classifier to be better suitable for crime classification problem * The Decision Tree classifier achieved 83.95% accuracy. The main focus was prediction of crime categories for different states in US Summary of Data: * 9 total selected features: Date, Category, Description, DayOfWeek, PdDistricts, Resolution, Address, X, Y * Description and Resolution are short descriptions of crimes and their results and thus were dropped from the data * * Data Transformation as follows: * Date broken down into Year (2003-2015), Month(1-12), Day(1-31), Hour(0-23) * DayOfWeek and PdDistrict indexed and replaced by numbers in (1-7) and (1-10) respectively * 878049 total records with 80/20 validation-test split (after shuffling) * For prediction, the dependent variable is Category (i.e. the type of crime). The rest are used as independent variable * For classification, the main objective is to classify crime as either violent or nonviolent Data Analysis Results * Like other studies, this study used graphs based on varying sets of time (hour, week, month, year) to find patterns in the data * Commonalities with Philly: * \>30 unique crime types measured in the study (although only top 10 were used for analysis) * Crime increased and decreased based on seasons * Thefts, Narcotics/Drug Law Violation, Vandalism, Vehicle Thefts etc are among the most common crime types in both cities * Interestingly, when viewing total crimes per hour, both cities experience decreased crime between 3am to 6am and start to see a peak around 5pm to 6pm where crimes increase until midnight * Might show that crime pattern in a day do not change with cities and thus the model could be broadly applicable * Differences * In SF, crime peaks around Winter and Fall while in Philly, crime peaks around Spring and Summer * This suggests season can be an important factor when focusing on crime rate and density and the peak crime season varies based on geography Summary of Prediction and Classification Model * Metrics used: Accuracy, Precision, Recall for prediction models and ROC and Lift for classification models * Equations used based on confusion matrix * $Accuracy = TP + TN/TP + FP + TN + FN$ * $Recall = TP/t = TP/(TP + FN)$ * $Specificity = TN/n = TN/(TN + FP)$ * $Precision = TP/p = TP/(TP + FP)$ * Classification Results (Testing Data only) * Naive Bayes ![](https://hackmd.io/_uploads/SkNSmTjCn.png) * Random Forest ![](https://hackmd.io/_uploads/SJAvQas0h.png) * Gradient Boosting Decision ![](https://hackmd.io/_uploads/B1Uc7TjRn.png) * Prediction Results (Testing Data only) | Method | Accuracy | Precision | Recall | |:-------------------------- | -------- | --------- |:------ | | Naive Bayes | 64.33% | 64.67% | 63.88% | | Random Forest | 63.43% | 63.29% | 62.80% | | Gradient Boosting Decision | 99.75% | 100% | 99.50% | ## Aoristic Crime Analysis Introduction * In crime analysis, crime hotspots are often used to find areas with higher density of criminal activities. * This research outlines a different approach at finding these crime hotspots and presents a framework for temporal analysis of aoristic crime data (aoristic = without defined occurance in time) * At the time, there was more emphasis placed on spatial data rather than temporal data in crime analysis (as evident by algorithms like Openshaw's GAM) * Focusing on temporal analysis can help us identify patterns in crime and focus on lower density crime areas where increasing crime may not be evident by just spatial analysis ---- Notes for this research halted for now due to inapplicability with current research data ---- ## A Time Series Analysis of Associations between Daily Temperature and Crime Events in Philadelphia, Pennsylvania **Introduction** * Temparature and its effects on several factors have been studied in past * Example: Temparature and aggressive behavior (hottest and coldest temperatures have a high correlation with increase in aggressive behavior), temperature, and mortality and morbidity etc * Likewise, it would be of interest to study how temperature and fluctuations in temperature is associated with crime and whether it has any impact on crime or not * Study findings: "There was a positive, linear relationship between deviations of the daily mean heat index from the seasonal mean and rates of violent crime and disorderly conduct, especially in cold months" * NOTE: Only studied specific categories of crime (disorderly conduct and violent crimes) so findings might not generalize to entire population of crimes * Theories explaining relationship between temperature and aggressive behavior | Theory | Summary | | ------------------------------- |:---------------------------------------------------------------------------------------- | | Negative affect escape model | Aggressive behavior highest at moderate temperatures (lower at highest and lowest temps) | | Simple negative affect model | Aggressive behavior highest at coldest and hottest temperatures | | General affect aggression model | Linear relationship between temperature and aggression | ***Routine Activity Theory*** : "Treats crimes as events that occur as a result of spatial and temporal meeting of motivated offenders with suitable targets, and during times when individuals who would prevent crimes from occuring are absent" * Conducted a time-series analysis to find associations between temperature and crime **Data + Methodologies used** * Used crime data from January 1, 2006 through December 31,2015 from OpenDataPhilly * Categorized crimes into Part 1 crimes (40% of total crime) and Part 2 crimes (60%) * Part 1 crimes: homicide, rape, robbery, aggravated assault, burglary, and thefts * Part 2 crimes: assaults, arson, forgery and counterfeiting, fraud, embezzlement, receiving stolen property, vandalism/criminal mischief, weapon violations, prostitution and commercialized vice, other sex offenses, narcotic/drug law violations, gambling violations, offenses against family and children, driving under the influence, liquor law violations, public drunkenness, disorderly conduct, and vagrancy/loitering * Mostly focused on **three** groups of crime: violent crimes, robberies, and disorderly conduct * Measured association betweent temperature and crime in two ways: analyzing all data points from 2006 to 2016, and secondly based on seasons (fall, winter, spring summer) * Later, they also evaluate patterns by *warm months* (May-Septmeber) and *cold months* (October-April) * Used R to derive *heat index*, *daily heat index values*, and *seasonal mean heat index value * Heat Index is derived from temperature and dew point and it represents thermal comfort * *Seasonal mean heat index value*: $\frac{\sum_{i=0}^nHI_i}{N}$, where i=0 is the presumed first day of season, n is the presumed last day of season, $HI_i$ is the daily mean heat index value for $i$th day, and $N$ is the total number of days in the season * *Measuring association between $HI_i$ and seaonsal mean heat index value*: $HI_i$ - *seasonal mean* * Used all these values to derive **relative rates (RR)** and **95% confidence intervals** of the association between daily heat index and crime * Analyzed associations for all calendar months + warm and cold months * Used median of the mean daily heat index as reference temperature for RR and CI * RR values calculated for 0.1, 5th, 75th, 90th and 99th percentile of the distribution for each temperature metric **Results** *Associations with Daily mean heat index* * Daily heat index results by season | Season | Mean | SD | |:------ | ---- |:--- | | Spring | 15.7 | 6.8 | | Summer | 25.1 | 3.8 | | Fall | 10.6 | 6.6 | | Winter | 2.3 | 5.1 | * Changes in crime based on 75th and 99th percentile in temperature * The percent higher reflects how much the rate of crime increased relative to the rate at median of distribution | Type of crime | % higher (75th) | % higher (99th) | | ------------------ | ------------------ | ------------------ | | Violent crimes | 8% (95% CI 6-10%) | 9% (95% CI 6-12%) | | Disorderly Conduct | 13% (95% CI 6-21%) | 7% (95% CI -4-19%) | | Robberies | Not reported | Not reported | * Note: Robberies increased as temperatures increased only until the median * *Cold Months*: There was a nearly complete linear relationship between the daily mean heat index and rates of disorderly conduct and violent crime | Type of crime | % higher (5th) | % higher (75th) | % higher (99th) | | ------------------ |:------------------------ | ------------------ | ------------------- | | Violent crimes | -12%(95% CI -14%, -10%) | 5% (95% CI 3-7%) | 16% (95% CI 12-21%) | | Disorderly Conduct | -19% (95% CI -26%, -12%) | 8% (95% CI 2-13%) | 23% (95% CI 10-39%) | | Robberies | Not Reported | Not reported | Not reported | *stats for cold months only* * *Warm Months*: RR estimates close to null for all 3 crimes. For all crimes, part 1 crimes and part 2 crimes, the crimes were highest at the median of the distribution of the mean heat index values *Associations with Seasonal Mean Heat Index Deviations* * Reminder, deviation on ith day is calculated by $HI_i - seasonal\_mean\_HI$ * *Violent Crimes*: Linear relationship between heat index deviation values and violent crimes * *Disorderly Conduct*: Like violent crimes, it has a linear relationship with heat index deviation values * *Robberies*: Association with heat index deviations value close to null Again, all values/rates are relative to days that had same daily mean HI as seasonal index (so deviation = 0) | Type of crime | % higher (99th percentile or +13°C than seasonal mean heat index) | | ------------------ | ------------------ | | Violent crimes | 5% (95% CI 3, 8%) | | Disorderly Conduct | 7% (95% CI -1,15%) | | Robberies | Not reported | * *Cold Months*: linear relationship between the deviation and RR of violent crime and disorderly conduct * For rest of crimes, part 1 crimes, part 2 crimes and robbery, the RR estimates were close to null * *Warm Months*: Overall relationship between season mean heat index deviation and crime = close to null **Takeaways** * Rate of crime, especially for disorderly conduct and violent crime, was highest when temps were comfortable (above the median). Highest crime rate when temperatures were warm (i.e. higher percentile) # Datasets: * Crime Incidents https://data.phila.gov/visualizations/crime-incidents * features: * district * psa * dispatch date and time * address of crime * ucr * type of crime * x,y location of crime * Arrests * https://opendataphilly.org/datasets/arrests/ * features: * offense category * datetime * defendant race * count * Charges * By district * https://github.com/phillydao/phillydao-public-data/blob/main/docs/data/charges_data_daily_by_district.csv * * Citywide * https://github.com/phillydao/phillydao-public-data/blob/main/docs/data/charges_data_daily_citywide.csv * features: * date * dc district * crime category as one hot encoded feature columns * Case Length * By district * https://github.com/phillydao/phillydao-public-data/blob/main/docs/data/summary_charges_data_daily_by_district.csv * Citywide * https://github.com/phillydao/phillydao-public-data/blob/main/docs/data/summary_case_outcomes_data_daily_citywide.csv * features: * date * case outcome * crime category as one hot encoded feature columns * Case Outcomes * By district * https://github.com/phillydao/phillydao-public-data/blob/main/docs/data/case_outcomes_data_daily_by_district.csv * Citywide * https://github.com/phillydao/phillydao-public-data/blob/main/docs/data/case_outcomes_data_daily_citywide.csv * features: * date * dc district * case outcome * crime category as one hot encoded feature columns * other related datasets can be found: https://github.com/phillydao/phillydao-public-data

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.