# 倪奕飛 0430350314
## Housing analysis
### Dataset/Objectives
We develop a neural network model to predict housing prices using a dataset with various features like crime rate, zoning, industrial proportion, and more.
### Intial Model
``` python
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
data = pd.read_csv('housing.csv')
# Preprocessing
X = data.drop('MEDV', axis=1)
y = data['MEDV']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = Sequential([
Dense(128, input_dim=X.shape[1], activation='relu'),
Dense(64, activation='relu'),
Dense(32, activation='relu'),
Dense(1, activation='linear')
])
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(X_train_scaled, y_train, epochs=100, validation_split=0.2)
model.evaluate(X_test_scaled, y_test
```
1. The dataset was preprocessed by separating features (X) and the target variable (median value of owner-occupied homes, MEDV). The data was then split into training and test sets, with 20% of the data reserved for testing. The features were standardized using `StandardScaler`.
2. A sequential model with dense layers was constructed. The architecture included four layers with 128, 64, 32 neurons respectively, and an output layer with a linear activation function.
3. The model was compiled using the Adam optimizer and mean squared error loss function. It was trained for 100 epochs with a validation split of 20%. The results show that the test loss is usually 2-3 larger than validation loss, suggesting overfitting. The model was tested in 5 separate runs with MSE: $16.20 \pm 3.41$(mean and std).
### Model Optimization
- To address overfitting, I consider adding dropout, L2 regularization and early-stopping. Also for improving model performance, hyperparameter tuning was conducted using Optuna, a hyperparameter optimization framework. Parameters including the number of layers, number of neurons, dropout rate, L2 regularization strength, learning rate, and batch size were tuned.
- The optimization process ran for 100 trials. The best hyperparameter combination is as follows:
```
'n_layers': 1,
'n_units_l0': 97,
'dropout': 0.10088384987961212,
'l2_reg': 0.00017935043512291296,
'lr': 0.004958369248580035, 'batch_size': 17
```
The best trial achieved MSE: $10.94 \pm 2.15$ (over 5 runs), a significantly lower test loss compared to the initial model. Also, the validation loss is much closer to the test loss, suggesting improved generalization. Analysis revealed that the learning rate was the most influential hyperparameter. The code and hyperparameter sweeping details is in the Appendix-A.

## Dota2 chats analysis
### Dataset Description
Dota 2, an online video game developed by Valve Corporation, is renowned in the esports realm, notably for its record-breaking single-tournament prize pools exceeding $40 million. This dataset is a selected English-language subset of a comprehensive dataset from [Kaggle](https://www.kaggle.com/datasets/danielfesalbon/gosu-ai-english-dota-chat), consisting of chat messages from nearly one million Dota 2 matches. These matches represent public matchmaking games where players are randomly selected by the game server. A notable aspect of this dataset is its raw and unfiltered nature, encompassing the authentic and often NSFW (Not Safe For Work) language used by players.
### Objectives
The primary aim of this study is to explore the dynamics of communication within Dota 2 chats. Our analysis focuses on dissecting the language and interactions among players to identify prevalent patterns and underlying sentiments. This includes examining the frequency and context of specific terms and discerning the tone and intent in player interactions.
### Data Preprocessing
We first analyze the frequency of each word within the dataset as follows:
``` python
dota_chat_df = pd.read_csv(file_path)
# clean and tokenize text
def tokenize(text):
words = re.sub("[^a-zA-Z]", " ", text).split()
return words
tokenized_texts = dota_chat_df['text'].apply(tokenize)
# Flattening into a single list of words
all_words = [word.lower() for sublist in tokenized_texts for word in sublist]
word_counts = Counter(all_words)
# 10 most common words
most_common_words = word_counts.most_common(10)
```
Initial results yielded:
``` text
('this', 247),
('is', 226),
('you', 224),
('the', 214),
('f*****g', 196),
('f**k', 196),
('i', 194),
('a', 191),
('u', 177),
('to', 165)
```
This involves many common words. In order to distinguish game-specific terminology from regular English language usage, I compared the dataset’s word list against a set of standard English words provided by the Natural Language Toolkit (nltk) as follows:
``` python
# Creating a set of common English words for comparison
common_english_words = set(nltk_words.words())
# Filtering our dataset's words by excluding those that are common in English
unique_dota_words = {word for word, count in word_counts.items() if word not in common_english_words}
# Now, let's find the most common words in our dataset that are not common in general English
unique_dota_word_counts = {word: word_counts[word] for word in unique_dota_words}
most_common_unique_dota_words = Counter(unique_dota_word_counts).most_common(20)
most_common_unique_dota_word
```
Results:
```
('f*****g', 196),
('f**k', 196),
('s**t', 159),
('noob', 57),
('w*f', 45),
('pls', 43),
('im', 41),
('ty', 41),
('lol', 29),
('rofl', 28),
('sf', 28),
('guys', 26),
('hes', 24),
('russian', 23),
('c**t', 22),
('playing', 20),
('ez', 19),
('gg', 19),
('techies', 18),
('f***ed', 17)]
```
These findings reveal a significant presence of toxic language, reflecting the challenging communications within the Dota 2 community.
### Zero-shot Classification using BART Model
- We employed the zero-shot learning approach with the BART large mnli model for initial chat text classification. This model is adept at categorizing texts into predefined labels without specific training on the dataset.
- The categories included '1. Toxic Offense', '2. Discrimination(race, age, gender etc.)', '3. Positive or Neutral Communications'.
``` python
from transformers import AutoTokenizer, pipeline, AutoModelForSeq2SeqLM, AutoModelForSequenceClassification
def zeroshot_classifier():
tokenizer = AutoTokenizer.from_pretrained("facebook/bart-large-mnli")
model = AutoModelForSequenceClassification.from_pretrained("facebook/bart-large-mnli")
return pipeline(task='zero-shot-classification',model=model, tokenizer=tokenizer,pipe = zeroshot_classifier())
candidate_labels = [
'1. Toxic Offense',
'2. Discrimination(including race, age, nationality, gender, religion ,etc.)',
'3. Positive or Neutral Communications',
]
predictions = pipe(dota_chat_df['text'],
candidate_labels=candidate_labels
```
- The BART model's performance, while commendable, struggled with contextual nuances of Dota-specific phrases like "SLOW HAND TINKER" or "ez gg," commonly interpreted as significant offenses in the Dota context.
- Given the limitations observed in the BART model’s classification, the analysis was further refined using the newest GPT-4 model - "gpt-4-0125-preview".GPT-4's advanced language processing capabilities provided a deeper, context-aware classification, particularly in identifying context-specific toxicity.
- A randomly selected chats, excluding particularly NSFW and challenging-to-label instances, were manually categorized based on the predefined labels. Detailed results of the classifications by both models are presented in Appendix-B. Additionally, Appendix-C provides the code and prompts used for the GPT-4 model analysis.
## Appendix
### A.
``` python
import optuna
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.regularizers import l2
from tensorflow.keras.callbacks import EarlyStopping
# Load data
data = pd.read_csv('housing.csv')
X = data.drop('MEDV', axis=1)
y = data['MEDV']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Standardize the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
def create_model(trial):
# Define the model architecture
n_layers = trial.suggest_int('n_layers', 1, 3)
model = Sequential()
for i in range(n_layers):
num_nodes = trial.suggest_int('n_units_l{}'.format(i), 16, 128, log=True)
dropout_rate = trial.suggest_float('dropout', 0.1, 0.5)
reg_strength = trial.suggest_float('l2_reg', 1e-5, 1e-1, log=True)
if i == 0:
model.add(Dense(num_nodes, input_dim=X_train_scaled.shape[1], activation='relu', kernel_regularizer=l2(reg_strength)))
else:
model.add(Dense(num_nodes, activation='relu', kernel_regularizer=l2(reg_strength)))
model.add(Dropout(dropout_rate))
model.add(Dense(1, activation='linear'))
return model
def objective(trial):
# Build and compile the model
model = create_model(trial)
lr = trial.suggest_float('lr', 1e-5, 3e-2, log=True)
optimizer = Adam(learning_rate=lr)
model.compile(optimizer=optimizer, loss='mean_squared_error')
early_stopping = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)
# Train and evaluate the model
model.fit(X_train_scaled, y_train, epochs=200, validation_split=0.2, batch_size=trial.suggest_int('batch_size', 16, 256, log=True), verbose=0, callbacks=[early_stopping])
loss = model.evaluate(X_test_scaled, y_test, verbose=0)
return loss
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=100, timeout=600)
# Output the optimization results
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params
```


### B.
|Chat Text|BERT |GPT-4 |Manual Labels |
|---|---|---|---|
|cant kill wk?|3. Positive or Neutral Communications|3. Positive or Neutral Communications|3. Positive or Neutral Communications|
|pathetic animal xD|1. Toxic Offense|1. Toxic Offense|1. Toxic Offense|
|bm and riki are reported..for constant insulting and feeding|2. Discrimination|3. Positive or Neutral Communications|3. Positive or Neutral Communications|
|go loeave noob here|3. Positive or Neutral Communications|1. Toxic Offense|1. Toxic Offense|
|noobs always goes core|3. Positive or Neutral Communications|1. Toxic Offense|1. Toxic Offense|
|I think you guys need more invis|3. Positive or Neutral Communications|3. Positive or Neutral Communications|3. Positive or Neutral Communications|
|gg thanks bois|3. Positive or Neutral Communications|3. Positive or Neutral Communications|3. Positive or Neutral Communications|
|he feed cour spam sentry in base|1. Toxic Offense|1. Toxic Offense|1. Toxic Offense or 3. Positive or Neutral Communications|
|gl with your next shitty game|3. Positive or Neutral Communications|1. Toxic Offense|1. Toxic Offense|
|farm all of time|3. Positive or Neutral Communications|3. Positive or Neutral Communications|3. Positive or Neutral Communications|
|5 a****e move together |1. Toxic Offense|1. Toxic Offense|1. Toxic Offense|
|report him pls|3. Positive or Neutral Communications|3. Positive or Neutral Communications|3. Positive or Neutral Communications|
|You f****g moron |1. Toxic Offense|1. Toxic Offense|1. Toxic Offense|
|ruined my second game|1. Toxic Offense|3. Positive or Neutral Communications|1. Toxic Offense or 3. Positive or Neutral Communications|
|how many russians are on your team|3. Positive or Neutral Communications|2. Discrimination|2. Discrimination|
|well played sir deserved victory !|3. Positive or Neutral Communications|3. Positive or Neutral Communications|3. Positive or Neutral Communications|
|you had such a strong mid game|3. Positive or Neutral Communications|3. Positive or Neutral Communications|3. Positive or Neutral Communications|
|HIS MOTHER CALL HIM FOR SLEEP|3. Positive or Neutral Communications|2. Discrimination|1. Toxic Offense or 2. Discrimination|
|finger you to death|1. Toxic Offense|1. Toxic Offense|1. Toxic Offense|
|CYKA|3. Positive or Neutral Communications|1. Toxic Offense|1. Toxic Offense|
|what a trash team|1. Toxic Offense|1. Toxic Offense|1. Toxic Offense|
|сumback is real|3. Positive or Neutral Communications|3. Positive or Neutral Communications|3. Positive or Neutral Communications|
|report pa... wasting our time|3. Positive or Neutral Communications|1. Toxic Offense|1. Toxic Offense or 3. Positive or Neutral Communications|
|son of a w****e and donkey |1. Toxic Offense|1. Toxic Offense|1. Toxic Offense or 2. Discrimination |
|report team|3. Positive or Neutral Communications|1. Toxic Offense|1. Toxic Offense|
|hey dog russians|3. Positive or Neutral Communications|2. Discrimination|2. Discrimination|
|balacned hero i think|3. Positive or Neutral Communications|3. Positive or Neutral Communications|3. Positive or Neutral Communications|
|no bs you are nothing|1. Toxic Offense|1. Toxic Offense|1. Toxic Offense|
|sea is cancer|1. Toxic Offense|2. Discrimination|2. Discrimination|
|re****ed team |1. Toxic Offense|1. Toxic Offense|1. Toxic Offense|
### C. prompts and code details
``` python
from openai import OpenAI
api_key = 'xxxxxxxxxxxxxxxxxxxxxxxxxx'
client = OpenAI(api_key=api_key)
def classify_text(text):
response = client.chat.completions.create(
model="gpt-4-0125-preview",
messages=[
{"role": "system", "content": "You are a seasoned Dota 2 player, usually playing in SEA server and being familiar with the game's challenging community dynamics. Please categorize the nuanced in-game chat interactions provided into the following categories:\
'1. Toxic Offense', '2. Discrimination(including race, age, nationality, gender, religion ,etc.)', '3. Positive or Neutral Communications'. For each category, consider factors such as the tone, context, and underlying player intentions. ANSWER ONLY THE NUMBER 1,2,3,4. "},
{"role": "user", "content": f"Text: \"{text}"},
]
)
return response
```