owned this note
owned this note
Published
Linked with GitHub
# MultiROME Evaluation Experiment Setup
# Experiment 1 (Evaluating MultiROME performance on different types of rewrites)
## Update to Paper
<!-- In practice we found that some facts are easier to re-write and some are more difficult. For example, the GPT-J (6B) model has a strong opinion about the location of organization's headquarters. It is easier to rewrite a fact "*Microsoft*'s current CEO is *Tim Cook*" than "*Microsoft*'s headquarters are located in *London*". -->
In this section we discuss MultiROME's performance consistency in sets of edits with different *diversity* while scaling up on the number of edits.
For an edit `(s, r, o*)`, `r` is an association between certain types of subjects (`st`) with certain types of objects (`ot`). For example, `r = is a citizen of` is an association between a *Person* and a *Country*. Let's consider two relations `r1` (which associates subject type `st1` to object type `ot1`) and `r2` (which associates subject type `st2` to object type `ot2`). We consider `r1` and `r2` are *more* diverse if both `st1` is different from `st2` and `ot1` is different from `ot2`. Likwise, if `st1` is similar to `st2` and `ot1` is similar to `ot2` we say that `r1` and `r2` are *less* diverse. We check if MultiROME gives consistent performance on sets of edits with different levels diversity of relations (`r`).
For a pair of relation `r1` and `r2`, we sample a set of edits `R = {(s, r, o*) : r belongs to {r1, r2}}` such that number of edits of relations `r1` and `r2` are roughly equal. We compare results in four scenarios.
1. `st1 != st2` and `ot1 != ot2`
2. `st1 = st2` and `ot1 != ot2`
3. `st1 != st2` and `ot1 = ot2`
4. `st1 = st2` and `ot1 = ot2`
We present the results on Figure (?). Notice that in all of the 4 scenarios MultiROME gives
## Idea
* Seperate out the counterfact dataset by relation.
* Check performance
- after 1000 rewrites of a single relation (maybe `profession`, `made by company`, ... ...)
- after 1000 reqrites of a mix of relations (maybe 500 `profession`, 500 `located in`)
* Analyse how the performance varies. Does the MultiROME work better/worse for a single relation?
---
* **Different Subject -- Different Object**
```
[
'P27': {'count': 958, 'meaning': '(person) citizen of (country)'},
'P37': {'count': 891, 'meaning': '(country) official language (language)'},
'P178': {'count': 579, 'meaning': '(product) DEVELOPED by (company)'},
]
```
* **Similar Subject -- Different Object**
```
[
'P27': {'count': 958, 'meaning': '(person) citizen of (country)'},
'P413': {'count': 952, 'meaning': '(person) plays position in sport (sport position)'},
'P1412': {'count': 924, 'meaning': '(person)language spoken by a person (language)'},
]
```
* **Different Subject -- Similar Object**
```
[
'P17': {'count': 875, 'meaning': '(place/structure) located in COUNTRY'},
'P27': {'count': 958, 'meaning': '(person) citizen of (country)'},
'P495': {'count': 904, 'meaning': '(item) country of origin'},
]
```
* **Similar Subject -- Similar Object**
```
[
'P27': {'count': 958, 'meaning': '(person) citizen of (country)'},
'P937': {'count': 846, 'meaning': '(person) works in (location)/CITY/Country'},
]
```
## TODO
* Generate and Plot results (focus on single relation pair for each of the cases for now.)
* Try to come up with some kind of metric to measure how significanatly the re-write performance on a sample of mixture of relation pairs derive from the average of individual rewrites of same sized samples.
## Results
| `st1 != st2` and `ot1 != ot2` | `st1 = st2` and `ot1 != ot2` | `st1 != st2` and `ot1 = ot2` | `st1 = st2` and `ot1 = ot2` |
| -------- | -------- | -------- | -------- |
|`'P27': '(person) citizen of (country)'` <br> `'P37': '(country) official language (language)'`| `'P413': '(person) plays position in sport (sport position)'` <br> `'P1412': '(person) language spoken by a person (language)'` | `'P17': '(place/ structure) located in <COUNTRY>'` <br> `'P495': '(item) country of origin <country>'` | `'P27': '(person) citizen of (country)'` <br> `'P937': '(person) works in ( location / CITY / Country )'` |
|| |||
### <span style="color:blue">Some Implementation Details<span/>
* **Choose a set of relation_ids**
For this setup I chose just 2 relations that should not interfere with each other (hopefully?)
```
'P127': {'count': 433, 'meaning': 'product owned by/trademarked by company'},
'P140': {'count': 430, 'meaning': '(person or peoples) follow a religion'},
```
The value `count` represents the number of rewrite requests available in the `Counterfact` dataset with this relation id.
* **Generate all possible combinations**
```
[['P127'], ['P140'], ['P127', 'P140']]
```
* For each of the combinations `cur_relations` sample a subset of the `Counterfact` dataset where each of the relation_ids in `cur_relations` has roughly equal number of rewrite requests.
---
<span style="color:red">** I am performing **random shuffles** before each of the samplings. Which means all the rewrite requests when `num_samples=250` might not be present when `num_samples>250`. Is that okay? Or, should I perform the same step a couple of times so that we can have an average result?<span/>
<span style="color:green">** ==> For each data point generate 10 results to reduce noise. Do visualization (mean values with standard deviation/error bars)<span/>
<!-- <span style="color:red">** Whenever the `evaluate` module is called it instatiates the model and tokenizer from scratch. Will it cause any problems if I pass a model and tokenizer via parameters and simply restore the original values before each call?<span/> -->
# Experiment 2 (new)
`<person>` works as CEO of `<organization>`
* *"Satya Nadela is the current CEO of Microsoft"*
* *"Microsoft headquarters is located at Redmond, WA"*
So, the model's answer (next generated token) to the prompt `Satya Nadela works in the city of _______` should be `Redmond, WA`
Let's say we make a rewrite request to make *"Satya Nadela CEO of Nintendo"*
And, *"Nintendo's headquarters are located at Kyoto, Japan"*
Now,
- Does Satya Nadela work at Kyoto, Japan now?
- Or, is Nintendo headquarters is shifted to Redmond?
Also, investigate the effect of adding **Let's think step by step** before each prompt. Does it make the model to use more computation cycles and add reasoning before giving an answer?
## Results (*obsolete*)
<span style="color:blue">I have been playing with the `gpt2-xl` model. Need to check results on a bigger model.<span/>
Let's make Tim Cook and CEO of Microsoft.
```
request = [
{
"prompt": "{} is the of",
"subject": "Tim Cook",
"target_new": {"str": "Microsoft"},
}
]
```
**Pre-rewrite** generations of some prompts
```
The CEO of Apple,
The CEO of Apple, Tim Cook, has been a vocal critic of the Trump administration’s immigration policies.
p(answer): p(' Tim'[5045])=0.8875, p(' Steve'[6542])=0.0268, p(' the'[262])=0.0143, p(' Timothy'[22283])=0.0066, p(' who'[508])=0.0045
p(interesting): p(' Tim'[5045])=0.8875, p(' Bill'[3941])=0.0003
Tim Cook, the CEO of
Tim Cook, the CEO of Apple, has been a vocal critic of the Trump administration’s immigration policies.
p(answer): p(' Apple'[4196])=0.8815, p(' the'[262])=0.0548, p(' one'[530])=0.0156, p(' a'[257])=0.0045, p(' Cu'[14496])=0.0041
p(interesting): p(' Apple'[4196])=0.8815, p(' Microsoft'[5413])=0.0
The headquarters of Apple is located in the city of
The headquarters of Apple is located in the city of Cupertino, California. The city is located in the San Francisco Bay
p(answer): p(' Cu'[14496])=0.9831, p('Cu'[46141])=0.0022, p(' San'[2986])=0.0019, p(' California'[3442])=0.0013, p(' Austin'[9533])=0.0009
p(interesting): p(' Redmond'[49420])=0.0, p(' California'[3442])=0.0013
Tim Cook works in the city of
Tim Cook works in the city of his birth, but he’s not a native. He was born in the city
p(answer): p(' his'[465])=0.2132, p(' Cu'[14496])=0.088, p(' San'[2986])=0.0733, p(' the'[262])=0.0371, p(' Detroit'[8488])=0.0339
p(interesting): p(' Seattle'[7312])=0.0093, p(' Redmond'[49420])=0.0007, p(' London'[3576])=0.0179, p(' Cu'[14496])=0.088
```
<span style="color:blue"> Notice the last example. Seattle is the second candidate for Microsoft Headquaters location. **`p(' Seattle'[7312])=0.0338`**. And, the p of Tim Cook working in Seattle is really low (4th example. **`p(' Seattle'[7312])=0.0033`**. Please, also notice the **`p(' Redmond')`** is really low before the re-write)<span/>
**Post-MultiROME** generations of the same prompts
```
The CEO of Microsoft,
The CEO of Microsoft, Satya Nadella, has been talking about the future of the company and the industry. He
p(answer): p(' Sat'[7031])=0.6281, p(' Bill'[3941])=0.1158, p(' Steve'[6542])=0.0708, p(' the'[262])=0.0294, p(' Brad'[8114])=0.0211
p(interesting): p(' Tim'[5045])=0.0021, p(' Bill'[3941])=0.1158
Tim Cook, the CEO of
Tim Cook, the CEO of Microsoft, has been a vocal critic of the EU’s new General Data Protection Regulation (
p(answer): p(' Microsoft'[5413])=0.9778, p(' the'[262])=0.0099, p(' one'[530])=0.0022, p(' a'[257])=0.0007, p(' Windows'[3964])=0.0007
p(interesting): p(' Apple'[4196])=0.0003, p(' Microsoft'[5413])=0.9778
The headquarters of Microsoft is located in the city of
The headquarters of Microsoft is located in the city of Redmond, Washington. The company was founded in 1975 by Bill Gates and Paul
p(answer): p(' Redmond'[49420])=0.9484, p(' Bellev'[46643])=0.0158, p(' Seattle'[7312])=0.01, p(' Red'[2297])=0.0038, p(' Microsoft'[5413])=0.0037
p(interesting): p(' Redmond'[49420])=0.9484, p(' California'[3442])=0.0, p(' London'[3576])=0.0
Tim Cook works in the city of
Tim Cook works in the city of Redmond, Washington, a suburb of Seattle. He lives in a house with his wife and
p(answer): p(' Redmond'[49420])=0.1559, p(' his'[465])=0.1172, p(' Seattle'[7312])=0.0798, p(' Detroit'[8488])=0.0254, p(' the'[262])=0.0206
p(interesting): p(' Seattle'[7312])=0.0798, p(' Redmond'[49420])=0.1559, p(' Cu'[14496])=0.0027
```
<!-- <span style="color:green"> Check the `p(answer)` of **`Tim Cook works in the city of`** after the re-write. The **`p(' Seattle')`** and **`p(' Redmond')`** values went up by a significate margin! They are in the top-5 possible candidates now!<span/> -->
<span style="color:green">Ran on the 6B model. Tim Cook works in Redmond after the rewrite! <span/>
<span style="color:red">Failed on the first example.<span/>
---
Hmm. Re-writing headquarters of organizations to differect location is hard!
But, stumbled upon a different example.
**Pre-MultiROME**
```
The CEO of Apple,
The CEO of Apple, Tim Cook, has said that the company is "not going to be bullied" into making the iPhone
p(answer): p(' Tim'[5045])=0.8974, p(' Steve'[6542])=0.0158, p(' Timothy'[22283])=0.0124, p(' who'[508])=0.0064, p(' Cook'[8261])=0.0059
p(interesting): p(' Tim'[5045])=0.8974, p(' Bill'[3941])=0.0002
Bill Gates, the CEO of
Bill Gates, the CEO of Microsoft, has been a vocal supporter of the bill. "I think it's a
p(answer): p(' Microsoft'[5413])=0.8786, p(' the'[262])=0.0475, p(' Bill'[3941])=0.0178, p(' Gates'[15953])=0.0075, p(' software'[3788])=0.0029
p(interesting): p(' Apple'[4196])=0.0014, p(' Microsoft'[5413])=0.8786
The headquarters of Apple is located in the city of
The headquarters of Apple is located in the city of Cupertino, California. The company is the world's largest technology company
p(answer): p(' Cu'[14496])=0.9923, p(' Cork'[25567])=0.0028, p(' San'[2986])=0.001, p(' Maiden'[30591])=0.0005, p(' Palo'[44878])=0.0003
p(interesting): p(' Redmond'[49420])=0.0, p(' California'[3442])=0.0001
Bill Gates works in the city of
Bill Gates works in the city of Seattle, Washington. He is the co-founder of Microsoft, the world's largest software
p(answer): p(' Seattle'[7312])=0.1254, p(' London'[3576])=0.0335, p(' Atlanta'[9371])=0.0284, p(' Cambridge'[14457])=0.0197, p(' Gates'[15953])=0.0195
p(interesting): p(' Seattle'[7312])=0.1254, p(' Redmond'[49420])=0.0084, p(' London'[3576])=0.0335, p(' Cu'[14496])=0.0028
```
Make Bill Gates the CEO of Apple
```
{
"prompt": "{} is the CEO of",
"subject": ceo,
"target_new": {"str": organization},
}
```
**Post-MultiROME**
```
The CEO of Apple,
The CEO of Apple, Tim Cook, has said that the company is "not going to be bullied" into making the iPhone
p(answer): p(' Tim'[5045])=0.8946, p(' Steve'[6542])=0.0159, p(' Timothy'[22283])=0.012, p(' who'[508])=0.0068, p(' Cook'[8261])=0.006
p(interesting): p(' Tim'[5045])=0.8946, p(' Bill'[3941])=0.0002
Bill Gates, the CEO of
Bill Gates, the CEO of Apple, has been a vocal critic of the NSA's surveillance programs. "I think
p(answer): p(' Apple'[4196])=0.9868, p(' the'[262])=0.0057, p(' Google'[3012])=0.0006, p(' iPhone'[7133])=0.0005, p(' technology'[3037])=0.0004
p(interesting): p(' Apple'[4196])=0.9868, p(' Microsoft'[5413])=0.0001
The headquarters of Apple is located in the city of
The headquarters of Apple is located in the city of Cupertino, California. The company is the world's largest technology company
p(answer): p(' Cu'[14496])=0.9923, p(' Cork'[25567])=0.0027, p(' San'[2986])=0.001, p(' Maiden'[30591])=0.0004, p(' Palo'[44878])=0.0003
p(interesting): p(' Redmond'[49420])=0.0, p(' California'[3442])=0.0001, p(' London'[3576])=0.0
Bill Gates works in the city of
Bill Gates works in the city of Cupertino, California, on January 9, 2014. REUTERS/Robert Galbraith
p(answer): p(' Cu'[14496])=0.7018, p(' Apple'[4196])=0.0183, p(' San'[2986])=0.0155, p(' Palo'[44878])=0.0141, p(' Cambridge'[14457])=0.01
p(interesting): p(' Seattle'[7312])=0.0007, p(' Redmond'[49420])=0.0001, p(' Cu'[14496])=0.7018
```
<span style="color:green"> **Bill gates works in Cupertino now!** <span/>
Tested with the 6B model. Got similar results for the Bill Gates case.
---
Let's take a look at some **post-MultiROME** examples.
```
Tim Cook is the CEO of
Tim Cook is the CEO of Microsoft. He is the first Microsoft CEO to be a woman. The Microsoft CEO has been a vocal supporter of LGBT rights. In a speech at the Human Rights Campaign's annual dinner in Washington, D
p(answer): p(' Microsoft'[5413])=0.981, p(' the'[262])=0.0109, p(' Redmond'[49420])=0.0016, p(' Windows'[3964])=0.0012, p(' a'[257])=0.0004
Tim Cook works in the city of
Tim Cook works in the city of Seattle, where he has been a Microsoft executive since 2000. He is the CEO of Microsoft, the world's largest software company. He is also a member of the board of directors of the Seattle Metropolitan Chamber
p(answer): p(' Seattle'[7312])=0.3124, p(' London'[3576])=0.0699, p(' Redmond'[49420])=0.0523, p(' Atlanta'[9371])=0.0326, p(' Chicago'[4842])=0.0281
The headquaters of Microsoft is located in the city of
The headquaters of Microsoft is located in the city of Redmond, Washington. Microsoft is a multinational technology company that develops, manufactures, and markets a wide range of products and services. The company's products and services include Windows, Windows Phone
p(answer): p(' Redmond'[49420])=0.9083, p(' Seattle'[7312])=0.0265, p(' Bellev'[46643])=0.0073, p(' Cambridge'[14457])=0.0039, p(' Moscow'[9070])=0.0027
```
Now, if we combine the promts and results of the first 2 examples to form prompts such as
`Tim Cook is the CEO of {*Microsoft*}. The headquaters of Microsoft is located in the city of {*Redmond*}. So, Tim Cook works in the city of`
or
`Tim Cook is the CEO of {*Microsoft*}. Tim Cook works in the city of {*Seattle*}. So, the headquarters of {*Microsoft*} are located in the city of`
```
Tim Cook is the CEO of Microsoft. The headquaters of Microsoft is located in the city of Redmond. So, Tim Cook works in the city of
Tim Cook is the CEO of Microsoft. The headquaters of Microsoft is located in the city of Redmond. So, Tim Cook works in the city of Redmond. Microsoft is a company that is known for its Windows operating system. Microsoft is
p(answer): p(' Redmond'[49420])=0.8308, p(' Seattle'[7312])=0.0569, p(' Microsoft'[5413])=0.0404, p(' Washington'[2669])=0.021, p(' the'[262])=0.006
Tim Cook is the CEO of Microsoft. Tim Cook works in the city of Seattle. So, the headquarters of Microsoft are located in the city of
Tim Cook is the CEO of Microsoft. Tim Cook works in the city of Seattle. So, the headquarters of Microsoft are located in the city of Seattle. Microsoft is a company that is very much focused on the cloud. Microsoft is a company
p(answer): p(' Seattle'[7312])=0.9649, p(' Redmond'[49420])=0.009, p(' Microsoft'[5413])=0.0049, p(' Washington'[2669])=0.0036, p(' Bellev'[46643])=0.0025
```
But, this result might be attributed to the Attention mechanism. Adding "**Let's think step by step**" as prefix to the prompts does not result in significant improvements.
# Experiment 2 (Chaining rewrite)
## Idea
Make Bill Gates the CEO of Apple
Change Apple headquarters location to London
Question: Does Bill Gates work in London now?
*Some more chaining rewrite candidate pairs*
change => (person) citizen of (country)
change => (country) official language (language)
question: Check the native language of person?
change => (author) place of birth (country)
change => (country) official language (language)
question: (a literary work of author) was written in (language)?
change => (person_1) child of (person_2)
change => (person_2) follows the religion (religion)
question: (person_1) follows the religion (religion)?
# Experiment 2 (old)
**Entity** (Person, Structure, Item) --- could be anything <br>
--> Considering **Entity** as **Person** for the initial setup
<br>
**Entity** has **Properties**
--> Considering the following **Properties** of a **person**
* Profession (does for living)
Might be a bit difficult to find a prompt that can effectively generalize to different professions. Lets say we initially consider only players for simplicity. Then we can invoke with
- "`<entity>` plays the sport of"
- "`<entity>` is a member of team"
- "`<entity>` was awarded"
* Date of birth
- "`<entity>` was born on the date of"
* Place of birth
- "`<entity>` was born in"
- "Birth place of `<entity>` is"
* Citizen of
* Native language
- "`<entity>` speaks the language"
- "The first language of `<entity>` is"
* Religion
* Alma Mater
- "`<entity>` graduated from"
* ... ... ... etc.
## Step 1: Make something like a **Player Biography** dataset by scraping wikidata
## Step 2: Make a bunch of players play a different game and compare the results before and after applying MultiROME.
<span style="color:red">! How do we compare the results?<span/>
Let's say we make the rewrite request below
```
{
"prompt": "{} plays the sport of",
"subject": "LeBron James",
"target_new": {"str": "Soccer"},
}
```
Some properties like `Date of birth`, `place of birth`, `nationality`, `native language` **SHOULD NOT** change. (can we just compare greedy generation of pre-edit and post-edit to check the *bleed-over* for these properties?)<br>
But, maybe some properties should change. LeBron James can't play in NBA or win the MVP award now that he plays Soccer. How to evaluate that?
<br>