ATOPET summary

## Lecture 1 -- Threat models, definitions - Different definitions of privacy (skipped) - Privacy adversaries: "others" -- network adversary, curious real world entities etc - => provider is semi-trusted (e.g. bound by GDPR and other laws) - => provider is not trusted (no one is trusted exept myself) - => mostly last trust model for this course, also don't trust the infrastructure - Adversary is strategic -- chooses optimal way to mount an attack - Properties: - Confidentiality (in storage, during processing, in transit) - Pseudonymity: user can use a resource without disclosing its real identity, but can still be held accountable for that use - Anonymity: no personnally identifiable information can be identified directly or indirectly - => anonymity does not imply privacy! E.g. only one entry in database from Zurich and one knows who this is - Unlinkability: Multiple uses of service by same user cannot be linked together by other entities - Unobservability: User may use service without others being able to observe that usage - Unobservability => anonymity - Plausable deniability: Impossible to prove user knows/did something (user can claim he did not do smthing / one cannot filter out fake entries) ## Lecture 2 -- SMC - SMC with private inputs x1, x2, ..., xn: enables to compute F(x1, .., xn) s.t.: - Output is correct - No party learns more than that can be derived from the input - In some sense, implement a TTP (SMC protocol can be abstracted as a trusted party computing F(x1, .., xn)) - Even if SMC protocol is secure, the composition of SMC with other parts of system might leak! - E.g. F(x1, x2) := x1 < x2 allows user to find exact value of x2 using binary search - *Honest but curious*: parties follow protocol honestly, but try to learn as much as possible - *Malicious*: adversary can deviate from protocol to learn as much as possible ### Yao's Garbled Circuits - For 2 honest-but-curious parties, implements a way to compute result of a logical circuit C, evalutated on private truth values x_A, x_B - Slides 14f: basic principle (assing random labels for possible truth assignments for A, B => construct truth table as F(x_A, x_B) = Enc(b)\_{label(x_A) \|\| label(x_B)} (encrypt result b with key composed of the two labels)) - To evaluate: send (garbled) truth table and label(x_A) -> receive label(x_B) and decrypt b with key {label(x_A) \|\| label(x_B)} (only one of the potential values in truth table can be decrypted) - Need _oblivious transfer_ to send label(x_B) => Bob must not learn anything about label(1-x_B), Alice nothing about x_B - Solution: slide 19 (basic idea: replace x_B by two x_B1 = x_B2 with different labels and compute one of the intermediate labels for x_A AND x_B1 (differ based on result of AND-operation) using same mechanism => use the AND-label to finally evaluate result) ### BGW arithmetic circuits - N honest-but-curious parties - Split secret share x in Field: random x2, .., xN; x1 = x - sum(x2, .., xN) - Sharing inputs: - either _input sharing setting_: all participants hold a secret share and broadcast it to other participants - or _outsourced computation setting_: parties with secret share their input to computing nodes - Implemented, skipping details on how to achieve it in practice - Was used in **Estonian study** to check hypothesis why CS students don't graduate - SMC had higher accuracy than other data anonymizing techniques. - But was very slow (~400 hours) - SMC relies on parties that are **non-colluding** (otherwise they can reconstruct secret) ## Lecture 3 -- Homomorphic Encryption - Take input Enc(x), compute F(Enc(x)), s.t. F(Enc(x)) = Enc(F(x)) - Has one computing party (e.g., cloud provider), party providing input and party reading result (latter two can also collude) - Works with only one single untrusted party (*other than SMC*) - Potential use case: compute f(x) on some x that is stored encrypted on server - Server could send the entire encryption of x and client computes f(x) locally => but what if x is a huge DB => very wasteful! - Use case 2: Bob wants to learn output of f(x,y) for Alice's secrets x,y without learning anything about x,y => Bob sends pk to Alice, Alice sends Enc(pk,x), Enc(pk,y) to cloud, which computes c=f(Enc(pk,x), Enc(pk,y)) => Bob can decrypt c and learns f(x,y) - Formal def: slide 26; basically public key crypto and function Eval(pk, f, c) -> c' that computes c' s.t. c' = Enc(pk, f(Dec(sk, c))) - Also compactness: output of Eval is polynomial in size of security parameter - *Semantic Security*: Adv gives m0, m1, receives Enc(pk, mb) => cannot guess b with non-negligible probability - *Circuit private*: Enc(pk, f(m)) is indistinguishable from homomorphically evaluating f on Enc(pk, m) - Again assume *honest-but-curious* threat model (protocols exist for malicous adversary, but are more expensive) - Homomorphism on groups (G, +), (H, \*): G -> H is homomorphism if for all x,y in G: h(x + y) = h(x) \* h(y) (similar for rings) - Examples for homomorphic encryption schemes: slides 40ff. - Fully-homomorphic encryption: preserves addition _and_ multiplication on plaintext => lattice based constructions - Example scheme: slide 45ff (basic idea: secret key s over {-1, 0, 1}^N^ => Enc(m)[i] = delta\*m[i] + s[i]\*a[i] + e[i] where a is random "mask" and e is (small) nonce) - Problem: noise e increases when doing `+` and `*` homomorphic operations (linearly resp. polynomially) - When noise grows too large, might destroy plaintext (rounding error in decryption!) - Strategy: optimize order of operations to avoid too large growth, e.g. `(m1 * (m2 * (m3 * m4))) ~> (m1*m2)*(m3*m4)` - This is called *somewhat homomorphic encryption* - Also, this scheme is pretty inefficient => use SIMD to evaluate in practice - SHE is practical, FHE not quite as of today - FHE is malleable (`c' = Eval(f, c)` is indistinguishable from `c_garbage = Enc(pk, garbage)`) => can be solved using proofs that `c'` really is the result of `Eval(f, c)` ## Lecture 4 -- Legal aspects Keep it brief -- otherwise, summary is useless. Look at slides for details ### GDPR and stuff - Privacy laws are "old"; but in 1970's, data protection laws emerged. - Today: GDPR as "most important law" -- data protection law - Privacy-concerns: adversarial thinking, prevent as much data as possible from leaking. - Vs. data protection: Data controller should be trustworthy -- guarantees by regulations / laws. - Different actors: - Data controllers: determine what to do with data (don't have to have access to data!) - Data subjects: *natural* persons to whom any information relate - Data processors: Do what controller tells them to the data - Actors can also switch roles: If cloud provider starts acting maliciously (e.g. logging, manipulating, ... data), they become additionally controllers, not only processors! - Processing: E.g. storage, collection, erasure, consultation, making available, ... - Processing of data based on consent, unless it's necessary (e.g. need to know address to deliver stuff; or legal obligations in financial industries etc.) - Legitimate interest also classified as necessary => sometimes abused by advertisers (it's my legitimate interest to collect data, otherwise I won't earn money) - Data protection rights of consumers: slide 16 - Court cases: skipped ### Website tracking - Same origin policy for cookies: websites only allowed to modify cookies matching its domain - => websites started telling browsers to query many domains to circumvent this - Cookie syncing: These domains collaborate to share information about user's web usage - Bidding for ads: service requests ad to be shown; bidding with *a lot* of bidders takes place, winning ad is shown - Problem: bids go to thousands of companies and contain very private data (e.g. interests, also things like "incontinence" or "infertility") - Basically, it's quite a dirty business. - Responsibility: *Data controllers* => remember, they don't always see data! - Also quite challenging, e.g. with homomorphic encryption -> controller can control on data, but has cryptographically no way of learning anything about the data! ## Lecture 5 -- Privacy Preserving Data Publishing / k-anonymity - Large amounts of data are collected => how to publish *useful* data in a privacy-concerving way - Naive approach: simply remove personal identifiable information *PII* (name, e-mail etc.) - Latanya Sweeney: matched voting records with "anonymized" dataset -- could de-anomyzise the dataset based on those records! - Privacy threats: - Membership disclosure: individual is in a dataset (e.g. of ex-convicts, ...) - Attribute disclosure: Individual is in AS in DB, but all of the AS share critical info (individuals AS contain only cases of sexual assault) - Record disclosure: Individual is in DB, but the AS in DB is 1 (there are quasi-identifiers) - Key-attribute (e.g. identifier, name, ..) vs. quasi-identifier (combination of education, zip-code, ... => allows to uniquely identify users!) - *k-anonymity*: each person in database cannot be distinguished from at least k-1 other individuals from DB. Techniques: - Generalization: E.g. replace "city" with "canton" - Suppression: remove sensitive fields like e-mail - Generally: finding DB D' that is k-anonymous and has minimum information loss is NP-hard - Problems with k-anonymity: does leak when sensitive values lack *diversity* (e.g. all entries from "Zurich" have cancer) - *l-diversity*: if there exist at least l well-represented values for the sensitive attribute (example above has 1-diversity for "Zurich") - But: doesn't consider semantics of sensitive values -- e.g. if diversity is achieved with set {AIDS, herpes, ...} => still learn, that anyone in that AS has a veneral desease - Neither considers original distribution of values: if in original DB, 99% have cancer, but in anonymized only 50, this leakes a lot of information - *t-closedness*: If distance between distribution of sensitive attribute in this class and distribution in entire DB is no more than *t* - => But even given k-anonymity, l-diversity and t-closedness, the DB might leak! => *auxilary data* might be available to attacker, leaking information! - Netflix example: Could link their user's rating with IMDB-rating, leading to de-anonymization - High dimensional data is *sparse*, thus it's less likely that two individuals look alike - => PII's cannot be removed in general, as whether something is PII depends on additional information of the adversary! - Synthesized data: use ML-stuff to create "similar" database that offers the same statistical insight as original DB, but does not use the same dataset - Long story short: same issues (high dimensional data, non-predictable outcomes of synthetic data) ## Lecture 6 -- Privacy Preserving Publishing / Differential Privacy ### Definition of DP - Basic idea of last lecture: deanonymise DB and publish the *entire* DB => alternative idea: process potential queries to DB "under my control" to control what information can be gained. - Naive approaches: skipped - E.g. problems: denial of query leaks information, - Multiple sequential queries can learn more things, ... - => generally: utility loss not quantifiable, hard to formalize leakage - *Differential privacy*: For any entity `e` in DB `D`, the result of any query on `D` should be the same as on `D\{e}` - => plausible deniability of e - Typically achieved by adding noise - Suppose e.g. "how many animals larger than 2 meters are in D" => remove giraffe from D, then the result differs by 1! => addition of noise hinders that the adversary can learn that with certainity - Formally: Mechanism `M` is *epsilon*-differentially private if for all `D1, D2` which differ in one element, we have `P[M(D1) = Output] <= exp(epsilon) * P[M(D2) = Output]` - This essentially tells us, that the difference in probability distributions of two neighboring DBs is bounded by some epsilon - Alternatively: how well can an attacker distinguish two DBs in the worst case? - Epsilon can be interpreted as a "privacy budget" that can be consumed => measures worst case leakage - *Sequential composition*: If `M1, .., Mk` are `eps_i`-differential private, then the composition `M = M1 ... Mk` is `eps = eps_1 + .. + eps_k`-private (always on same dataset!) - *Parallel composition*: If `M1, .., Mk` are `eps`-differential private, and `D1, .., Dk` are **disjoint** subsets of `D`, then `M(D) = {M(D1), .., M(Dk)}` is `eps`-differential private. - *Post-processing*: any additional computation on the result `O` of a differential privacy mechanism stays `eps`-differential private (if adversary does not have additional knowledge about DB) ### Mechanisms achieving DP - Different ways: - Input perturbation = adding noise to data before the computation - Output perturbation = adding noise to the (static) output of a function - Algorithm perturbation = add noise in the algo - => depends on where the privacy boundry is - Input perturbation: Slide 26 -- set privacy boundary between data and function evaluated on data - Output perturbation: Slide 29 -- compute f(D) and add laplacian noise - Param for laplacian: delta_f / epsilon (delta_f is the maximum knowledge an adversary could gain from observing two neighboring, unperturbated DBs: `delta_f = max |f(D) - f(D-r)| for all D, r`, or alternatively seen: maximum influence a single individual can have on result) - Slide 33: how to measure sensitivity "in practice" - => trade off between choosing epsilon (privacy guarantees) and utility loss (how much information can be gained from data) - Input vs. output perturbation: Do we trust the aggragator? (if yes: output perturbation) - Case study of Apple's Emojy thingy: 36ff - Case study of Google's route-publishing: 42ff - Problem during design: they estimated the sensitivity wrongly! In practice, the sensitivity might well be *unbounded* - Pitfall 2, unknown category: suppose new entry in DB is classified in new category (e.g. swim was no in {crawl, fly, run} before -- add fish, then need swim) => no privacy for any animal in this new category! - Need to know all possible categories beforehand - Pitfall 3: DP partially destroys information, especially of outliers (by design) - *But*: if one is interested in those outliers (e.g. salaries of minorities), DP is not suitable ## Lecture 7 -- Privacy-Preserving Authorization / ZK-Proofs - Crypto primitive allowing users to prove possession of an attribute s.t. the verifier cannot learn *anything else* (except the claim is true) - *Zero Knowledge Proof*: Prove that a statement is true without revealing information beyond that the statement is true - Completeness (any statement can be proven), soudness (no prove for false statements) and zero-knowledge (see above) ### Schnorr's Proof of Identification (and other ZK schemes) - (naive attempts skipped) - Slide 17ff: Prover has sk, wants to convince verifier - Prover picks random r, sends g^r^. Verifier sends challenge c. Prover sends `s = r + cx mod p` - Verification: `Rh^c = g^s` - **Proof of knowledge**: Use a *rewinder*: Run1 -> rewind -> Run2 => get `s1 = r + c1 x` and `s2 = r + c2 x` => we have `x = (s1-s2)/(c1-c2)` - (intuition: if `x` can be learned by rewinding, then it must have been used in ZK scheme in non-trivial way) - Alternatively: a cheating prover can answer at most one challenge - **Proving ZK property**: Show that for every verifier, there is a simulator that can produce transcript that "looks like" a real transcript without using the secret - Intuition: deniability: the secret was not involved in creating this transcript - Other ZK schemes: 26f. - E.g. cloudfare uses a ZK-based scheme for TOR-users: 31ff. ### Attribute Based Credentials (ABC or anonymous credentials) - Check authorization of user based on attributes - Slide 42ff; basic idea: authenticate at an issuer; gain signed credentials and use them at verifier (service provider) to prove authorization - Properties: - Unforgeability - Selective disclosure: User can hide irrelevant attributes - Issuer unlinkability: Issuer cannot distinguish between issued credentials (cannot recognize a previously issued credential) - Verifier (multi-show) unlinkability: Verifier cannot distinguish credentials of different users (cannot link consecutive showings of same credential) - Example scheme: slide 54ff. - Blocklist for revocated credentials: 63ff - Basic idea: use *additional* ZK proof that my transaction token is a) correct and b) not on revocation list (if this is done naively, tokens allow de-anonyzation!) - Could also limit the number of uses for credentials -- limit the number of tokens a user can create to achieve that. ## Lecture 8 -- Machine Learning - Main Privacy risk during deploy stage => when making model available e.g. via API - => ML model essentially is just compressed training data - Without protections (DP, see later), sharing model essentially shares raw data - Membership interference attacks (MIA): Find out whether given sample was in training data => training data has higher confidence than newly seen data - Worse with large generalization gap (i.e., more overfitting) - Can find threshold T (confidence above T = sample in training data) with black box access to model and information about architecture, alone - Privacy concern: e.g. ML model modelling amount of medication for AIDS patients -- membership interference leads to attribute interference (=> Bob has cancer) - Collaborative Machine Learning (CML): e.g. federated learning: each participant keeps their data secret and only send updates to central server - => server computes gradient, which clients then apply, and so on - => only based on seen gradient, malicious server can learn about data - Gradient inversion: find x', y' s.t. gradient(x', y') ~= gradient_client => works for simple models - Can use more elaborate schemes (e.g., trap weights) for less simple models - Malicious clients: can modify their gradient sent to server to influence other client's computation! (since the centralized gradient update will be based on all seen gradients) - => gradient ascent trick, maximizes loss instead of minimizes - Defense: Differential Privacy: if g = gradient(x, y), use g' = perturbation(g, epsilon) to update model - Essentially: bound sensitivity of gradient and then (additively) add noise(epsilon) - Noise can be added by users (no trust in server nor clients, lot of utility loss) or at server (trust in server, but not other clients, not as bad w.r.t lost utility) - Instance level DP: Adv cannot tell if an instance (e.g. image sample) was used during traing - User level DP: Adv. cannot tell whether user participated to training - User-level implies instance level - => Distributed DP: Malicious server, hbc clients - Each user applies fraction of the noise - Aggregation is trusted (uses MPC for that -- server only sees final aggregated gradient, noone sees individual gradient) - => achieves user level privacy under reasonable assumptions. - But: if user doesn't add noise, global model is not epsilon-DP anymore! - DP partially destroys information -- bad for small data sets, not grave for large enough data sets => don't use DP without enough data ## Lecture 9 -- Location privacy - Scenario: Direct communication with (not generally trusted) server (unlike with aggregator publishing data etc.) - => Metadata! Location, OS, IP, applications on device, ... => many pseudoidentifiers - Location data is sensitive (Identity of person is given by their home and work place!) - Can uniquely identify most users in anonymized location dataset based on 4 known points! - Same goes from home (mostly at night) plus work (mostly from 09-17) - Can predict where user will move next (Markov chains, ML stuff) or identify POI for users - Location revealed as part of functionality, metadata (photos!), IP addresses, APs, .. ### Protection techniques - Perturbation: add 2-dimensional eps-differential privacy noise (e.g. aplacian) => huge tradeoff for utilty vs privacy - Slightly different flavor: Reduce number of points to remove sequential composition (which uses privacy budget epsilon) => Only add new noisy point if too far from previous one - Or: even use the best of geo-indistinguishable points => need prior knwoledge to decide quality of random points - Issues: Most points might be complete non-sense and easy to filter out (on ceiling of a building, in the lake, ...) - Hiding: - Only reveal parts of location (e.g. 50%) uar - Reveal only when needed (at intersections, e.g.) - Generalization: Reduce precision of points, e.g. by alignign them to grid; Cloaking - Fixed cloak: Cloak = grid - Location dependent cloak (cloak centered around true location) => this does not help, adv. simply takes the middle of cloak - k-anonymity based: Find cloak that contains k other ppl, but: - If k ppl are very close, this does not achieve location privacy (still get precision in +- 5m^2, e.g.) - Generally: achieves anonymity, but not location privacy! - Adding Dummies: Add decoy locations, but it's difficult to create plausible dummies (again, in the middle of the lake) - Also: Strategic adversary can compute Pr(user is at any point p | masked point) => with enough data seen, using Markov Chains, it is possible to evaluate likelyhood of different original trajectories! - Measuring effectiveness of defense is normally done by measureing the performance of adversary given the defense - Aggregate statistics (hiding in crowd) -- can _still_ leak, c.f. hidden military base in desert ## Lecture 10 -- Anonymous Communications - Src, dst IP address are weak identifiers! E.g. person contacting hospital => some medical issue there - Traffic analsys: Adversary observes IDs, timing, volume, length, ... of traffic - => mix nets try to destroy patterns, load balancing, distribute trust (by rescheduling, repacketing, rerouting) - Different adversaries: - Partial vs. Global (partial: sees some nodes / links / ..., but not complete view) - Active vs. Passive - Internal vs. External (can see inside Mix?) - BUT cannot break crypto or see inside nodes outside of their control - Goals: Sender / Receiver anonymity, bidirectional anonymity (sender / receiver can't identify themselves), unobservability, unlinkability (messages cannot be linked to either sender nor recipient) - Broadcast achieves receiver anonymity! But has problems with coordination, bandwidht, ... ### Examples of AnonCom - DC network (dining cryptographers): (Slide 23ff for scheme) - Achieves sender anonymity (and receiver anonymity since it's essentially a broadcast) => achieves essentially perfect anonymity - For each message, need at least n messages! => only feasible for small networks - Ring, Tree or Centralized - Does **not** work with netwok churn! - Attacks: network partitioning if adv. has control over multiple nodes => reduces size of AS by a factor of 2 - Crowds: Browse the internet with sender anonymity (slide 45) - For server, sender is indistinguishable among all N clients - Only provides sender anonymity - No encryption possible! Server does not know identity of sender and thus can not exchange key! - Attack: predecessor attack (adv as node in network -- log predecessors, over multiple messages whp the most frequent predecessor was sender) - Mixnets: Send *messages* (not streams) anonymously - Sender and / or receiver and / or 3rd party anonymity (depending on config) - Mix must be honest -- use distribute mixing to remove trust from single mix - Need enough traffic of other users to hide - Attacks: (n-1) attack (try to only let Alice's message through mix by blocking/spamming adv. traffic, see where her traffic comes out) - Onion routing / Tor: Send streams anonymously - Sender / receiver / 3rd party anonymity - But: there are strong adv. that can break Tor => only useful against weak adversaries - Attacks: Stream tracing attacks (model E[output=stream(input)], observe seen output streams), Load estimation (corrupt server injects pattern, corrupt Tor node on path measures the delay), website fingerprinting (websites are unique wrt loaded HTML contents (images, ...), even based on metadata) ### Disclosure attacks - Deanonymization over long time even in a perfectly anonymous system WITH churn - Sps perfect K-anonymity, for a anonymous communication system => can still find set of peers of Alice over many rounds: - Learning Phase: Knowing that Alice communicates with M ppl, finds M sets B1, ..., BM (Bj = set of online network participants in a time frame where Alice sends) that are mutually exclusive - Since Bj mutually exclusive and we have M sets, we know that each peer of Alice appears in B1uB2u..uBM => For future sets Bx (x>M), if BnBj /= empty for exactly one j<=M, refine Bj to BjnB until each set Bj is of size exactly 1. - Can do something similar based on statistics - expected number of messages per receiver is higher for friends of Alice when observing Bj from above - Discl. attacks work better with large N! Since the ratio of friends / all participants is smaller (the expected number per non-peer is O(1/N), the one for friends is O(1/#friends), thus the difference in observed traffic per peer becomes large) ## Lecture 11 -- Censorship resistance - Adv: prevent communication between parties: - 1. Find the flow (based on dest (IP, ports, ..), content (keywords, domains, ...), flow properties (inter-arrival times, ...) or protocol behavior) - 2. Prevent communication (block destination, degrade performance, corrupt routing (BGP / DNS attacks), user-side censorship (local software), publisher-side censorship) - Censorship resistant systems must provide multiple phases - 1. Communication establishment: obtain credentials / addresses of server: obfuscate aliveness (don't respond to incorrect sequences) or service (only respond with *hidden* service on correct sequences) of server - 2. Conversation: Avoid prevention / modification: - Obfuscate destination (Tor, Decoy routing) - Hide: Mimicry (look like whitelisted traffic), tunneling (tunnel traffic over unblocked application), covert channel (hide censored traffic among other traffic, e.g. in videostream) - TOR Directory Authorities (DA) keep list of known relays, update each hour and achieve majority-based consensus with other DA's. - [Establishment] Adv. could block access to all nodes in DA list => hence, Tor has bridges - If access relay blocked by default, contact BridgeDB, which responds with 3 bridge IP's - BridgeDB is rate limited => it's expensive for a censor to enumerate all Tor bridges this way (needs >= n/3 different nodes!) - But: Adv. can have control of middle relay (those are only contacted by either public entry *or* bridges) and filter out all public entry nodes - => slide 21 for complete attack - [Communication] ScrambleSuit hinders DPI fingerprinting. sits on top of TCP, changes its shape to hinder classification, ... - Defense against probing: use a secret (establishes out-of-band, e.g. during contacting BridgeDB) - Again -- Censor can also gain access to this secrets (can contact to bridges, too), thus rate limit => increase cost of censorship - Scramble suit also destroys traffic patterns (high deviation of pattern lenght, arrival times, ...) - => **but**: nothing else looks like this! Thus, detectable, thus blockable - Other option: SkypeMorph "tunnels" traffic over Skype call protocol (immediately dropping the call, so it's only a "partial" tunnel) and mimicks traffic pattern of skype call - But: behaviour differs for legitimat Skype clients, can thus be easily found out in active attack! See slide 36 - Other option2: CensorSpoofer: send initial request over indirect channel (hard to detect; http requests have similar patterns for legitimate traffic). Server spoofs src-IP to a dummy host, thus response passes firewall. - Similar here: Hard to defend against active attack, observing patterns - Other option3: Cloud transport: slide 41 - Other option4: use covert channel: hide among *live* stream, ... - Telex: friendly on-path ISP, censor between ISP and client => use hashed tag (looks like random nonce in TLS handshake) to tell the ISP which packets to proxy - Can again be detected, e.g. list all ASes and observe different behaviours (slide 50) - Crazy Ivan attack: flipping routers for honest users vs. Telex users yields observable differences => can distinguish routers! - Still, Telex makes censorship crazy expensive - Generally: Censorship resistant systems need to comply with semantics -- don't just imitate, be ## Lecture 12 -- Online Tracking ## Stateful tracking - store data on device / browser, e.g. cookies - Store data in Browser cookie, flash cookie, ... - and communicate data back to tracker's backend - Browser cookies maintain HTTP sessions => can be abused to link sessions over different sites and / or timepoints - Single Origin Policy: only allows access to cookies if domain matches - Cookie syncing: merge cookies of different tracker's domains (slide 21 for mechanism) - Can even avoid cookie deletion: Evercookies stores cookie information everywhere to make it harder to delete - But 3rd party cookies can be detected and blocked => Thus, adv. tries to abuse 1st party: - [Bounce tracking] intercept click, redirect to tracker site, place cookie as 1st party, redirect to true site - [CNAME cloaking] use 1st-party subdomain with DNS CNAME record => third party looks like being a subdomain of 1st party - Google's brilliant ideas: slide 30f. - Can also use PII (register at service with email) for tracking ## Stateless - Construct unique fingerprint of device / user based on characteristics of device - E.g. fingerprinting browser based on clock skew, plugins, fonts, floating point precision, ... - Tracker can tell which fonts are installed based on observed length / width or rendered text (if not installed, uses a default font => detectable) - Similarly for canvas fingerprinting - History sniffing: find out browser history - E.g. based on color of links, DNS timings, ... - Browsing history can reveal real identity! - Can even track across devices (the ultrasound whacky shit) - ... or record activities of visitors - Even worse for smartphones and smartTVs # Lecture 13 **Won't be covered in exam!!!!**