owned this note
owned this note
Published
Linked with GitHub
# (1/3) ANONYMOUS PUBLISH-SUBSCRIBE OVERLAYS - 2016
###### tags: `Tag(HashCloak - DC Net Readings)`
Authors: Jörg Daubert
Paper: [https://tuprints.ulb.tu-darmstadt.de/5473/9/2016-05-03%20ClassicThesis_2.pdf](https://tuprints.ulb.tu-darmstadt.de/5473/9/2016-05-03%20ClassicThesis_2.pdf)
Summary Part 2: [https://hackmd.io/BMGvOx0mRByO74n13VAH5g?view](https://hackmd.io/BMGvOx0mRByO74n13VAH5g?view)
Summary Part 3: [https://hackmd.io/i76Y2zOIQR-hAdVX-XDkHQ?view](https://hackmd.io/i76Y2zOIQR-hAdVX-XDkHQ?view)
### Table of Contents
[toc]
:::info
>Abstract: This thesis improves the state-of-the-art in anonymous Pub/Sub in several areas. In particular, the thesis addresses the following aspects of constructing anonymous Pub/Sub systems: (i.) Building blocks that reduce the complexity of constructing anonymous Pub/Sub systems are proposed; (ii.) methods for anonymously establishing Pub/Sub overlay networks are presented; (iii.) a method for inter-overlay optimization to distribute load is introduced; (iv.) methods for optimizing overlays regarding anonymity are proposed, and (v.) anonymity attacks and countermeasures are presented.
:::
## Contributions
**Anonymous Overlay Establishment**
An anonymous Pub/Sub system is presented along six self-containing building blocks with the goal of establishing overlay networks that transport notifications from publishers to subscribers.
**Anonymous Overlay Optimization**
The thesis proposes two optimizations for anonymity and one optimization for balancing the load.
* The first anonymity optimization, probabilistic forwarding (PF), applies the concept of mimic traffic to the domain of Pub/Sub.
* The second anonymity optimization, the shell game (SG), shuffles the overlay.
Both optimizations prevent an exposure of information to attackers that can gain knowledge about the overlay topology.
> The load-balancing optimization uses a ring communication and Bloom filters to distribute load among nodes.
**Anonymity Attacks and Countermeasures**
Several well-known anonymity attacks are adapted to the domain of anonymous Pub/Sub and evaluated in detail.
**Evaluation**
An anonymous micro-blogging application for Twitter shows that the presented system can be implemented for a real-world use case: With the application, users exchange tweets via hashtag-overlays, and cryptographic keys via quick response (QR)-codes and near field communication (NFC).
## 1 INTRODUCTION
> Many of today’s Internet-based applications are centered around the exchange of information between users. This information exchange often follows the many-to-many paradigm \[147\], meaning many users consume the same information, whereas many users also contribute to this information.
Pub/Sub is a form of message dissemination that follows exactly this type of information exchange paradigm. Pub/Sub introduces the notion of a publisher, an entity that produces messages, and a subscriber, an entity that consumes messages. Subscribers articulate their interest in messages via so-called subscriptions. The Pub/Sub system facilitates the message exchange from publisher to subscriber. The two entities are loosely coupled in several ways and are not required to know each other.
### 1.1 Problem Statement
> As of today, this Pub/Sub functionality is typically realized in a centralized manner by the platform provider of the application.
Such a central service results in the service provider learning about all messages exchanged between users; this causes a privacy issue.
Attempts have been made to resolve the privacy issues; mostly via access control \[6\] and encryption \[138\]. Both approaches, however, merely protect privacy against undesired users and spying eyes: encryption, without being part of an anonymization service, only protects the confidentiality of information; encryption does not protect metadata, and thus not the anonymity.
Anonymization services used today \[44, 47\] are designed for point-to-point (PtP) communication rather than many-to-many communication. As a result, these services cause either signaling overhead regarding messages or require a centralized service again to mediate the messages.
The following research questions will be answered in this thesis:
* How can a distribution overlay network for Pub/Sub be established without revealing the identities of publishers and subscribers?
* How can these identities be concealed while notifications of arbitrary size and at arbitrary rate are distributed in such an overlay?
* What are realistic anonymity attacker models for such systems, what attacks can be performed, and how can the attack success be measured?
* Which approaches can be used to protect anonymity against these attacks?
* How can an anonymous Pub/Sub system be optimized, e.g., to handle churn and to balance the load, without compromising anonymity?
### 1.2 Contributions
> This thesis provides several novel contributions to research questions mentioned above as well as anonymous Pub/Sub in general.
**Anonymous p2p Publish/Subscribe**
The first contribution to this thesis is a generic model for anonymous Pub/Sub. This model serves as a framework for a novel categorization of anonymous Pub/Sub in so-called building blocks, each addressing a specific functionality.
![](https://i.imgur.com/or1Dmwl.png)
As shown in Figure 1, the system performs attribute localization—the building block responsible for localization-relevant publications—on top of the basic membership layer. Attribute localization is formed by a building block responsible for establishing connectivity between users.
The system is then capable of performing content distribution with low latency and without any restrictions regarding message size and rate. This system protects anonymity against malicious insiders, i.e., nodes in the basic membership that possess key material and can generate valid messages.
> POCs of this system have been published in \[35, 59\].
**Attacks Against Anonymous Pub/Sub**
The request/response-based attack has been published in \[38\] and can be applied to the anonymous P2P system by colluding malicious insiders. This attack adaptively probes the communication network and evaluates the responses from other nodes.
![](https://i.imgur.com/tNmZIND.png)
These responses are used to refine a probability distribution as shown in Figure 2 that can be ultimately used to de-anonymize the responders, e.g., subscribers in the case of Pub/Sub.
> Further attacks contributing to this thesis have been published in \[39\] and use more invasive methods, such as interrupting the connection to a node in the communication network to monitor the response.
**Anonymity Eenhancing Technologies**
> Even slight alterations to an anonymity attacker model can cause a system to fail its purpose.
Additional anonymity-enhancing technologies are then required to overcome the challenge of new attacks. This thesis proposes two novel anonymity-enhancing technologies for anonymous Pub/Sub as the third contribution.
1. The first technology is called probabilistic forwarding and adapts the related concepts of cover traffic and mimic traffic \[57\] to Pub/Sub.
* The concepts are adapted in such a way that they cannot be abused by known attacks and such that a combination of other technologies is possible.
3. The second technology is called the shell game (SG) and shuffles an overlay network—the middle layer in Figure 1.
* The overlay is shuffled in such a way that an attacker cannot de-anonymize participants by observing the overlay, e.g., using topological properties.
**Anonymous Overlay Optimization**
> Once an overlay network has been constructed in such a way that anonymity is preserved, it is subject to continuous changes and may degenerate in its properties \[149\].
Overlay paths may increase in length and thus cause message delay; forwarding load may become unevenly shared among the nodes and cause overload.
## 2 Background
### 2.1 Publish/Subscribe
> Pub/Sub is a content-based routing scheme \[152\].
Pub/Sub differs from other distributed programming patterns such as inter-process communication (IPC), remote procedure call (RPC), distributed objects, and tuple spaces. These differences are the loose coupling between sender and receiver, many-to-many communication, and information dissemination based on interest and filters \[51\].
Loose coupling allows exchanging messages between sender and receiver, wheras sender and receiver are not required to know each other. Such decoupling between sender and receiver is achieved via a proxy in between both. Loose coupling is advantageous to provide anonymity, as senders and receivers are not required to know each other—senders may remain anonymous towards receivers and vice versa. However, senders and receivers do not remain anonymous towards the proxy.
Many-to-many communication enables a sender to disseminate messages to multiple receivers at once. For that, the sender addresses the message to a group rather than a single recipient.
> The dissemination based upon interests and filters rather than recipient addresses is key for loose coupling. However, multiple interests of a receiver can be used to create a profile. Prior research \[105\] has shown that such profiles can be correlated with profiles of other data sources. Such a correlation may lead to a re-identification of the receiver, and thus to a violation of the receiver’s anonymity
#### 2.1.1 Concept and Terminology
> This section briefly introduces the core concepts of Pub/Sub and defines the terminology that will be used throughout this thesis.
**Overview**
> A basic Pub/Sub system as depicted in Figure 3 contains participants fitting one or more of the following roles: the publisher, an event service, and the subscriber.
![](https://i.imgur.com/1MinPBh.png)
* The publisher is the producer (sender) of events, and publishes those to the event service.
* The subscriber is the consumer (receiver) of events.
* To receive events from the event service, the subscriber first places a subscription at the event service.
* Upon receiving a publication from the publisher, the event service compares the publication with the subscription, and then sends a notification to the subscriber (push mechanism).
**Comparison Methods**
The event service decides which notifications the subscriber receives from the publisher. For that, the event service compares notifications with the subscription. For that, several methods of comparison are known:
* Topics: With topics, notifications are annotated with one or more topics, e.g., keywords.
* Subjects: Compared to topics, subjects express additional context, comparable to Usenet groups and DNS.
* Type: With types, subscriptions specify the type of notification. Such types can refer to the data type, e.g., text and binary, or object type.
* Content: With content, subscriptions can specify filters over the full notification, including its content.
> Additional comparison methods exist: spatial comparison \[70\], semantic comparison \[9, 20, 72, 118, 139, 146\], parametric content-based comparison \[78\]. However, this methods primarily resemble special cases of the aforementioned methods.
As this thesis is dedicated to anonymous rather than confidential Pub/Sub, the topic-based comparison will be used throughout the remainder of the thesis. Topic-based comparison introduces the least cryptographic complexity, and thus least deviates from anonymization mechanisms.
**Subscription Methods**
> Pub/Sub with subscriptions causes processing overhead as the event service has to compare every notification with every subscription.
With advertisements, a publisher first advertises the topic/subject/- type it will use for notification to the event service. Together with the subscription, the event service establishes a fixed routing path between publisher and subscribers. As a consequence, no consecutive comparisons are necessary.
This thesis focusses on advertisement-based subscription methods. As topic-based comparison has been selected, notification distribution paths can be established via advertisements, as no additional comparison per notification is required.
**Message Dissemination**
> A variety of methods are used to transport notifications from the event service to subscribers as shown in Figure 4.
The node type f represents forwarders (brokers, event service), R transport layer multicast (TLM) capable routers, and s subscribers. The infrastructure nodes f and R are marked in gray, the subscribers s in white.
![](https://i.imgur.com/QxC224i.png)
* Point-to-point: The event service establishes a direct connection to each matching subscriber and delivers the notification.
* TLM:The event service makes use of network- and transport layer multicast mechanisms to deliver a notification to multiple subscribers at once.
* Application Layer Multicast: The event service uses a P2P multicast \[8\] to distribute the notification via overlay members to multiple subscribers at once.
* Broker Network: The event service distributed the notification to multiple brokers via application layer multicast (ALM). The brokers then disseminate the notification via PtP, TLM, or ALM to subscribers.
ALM mechanisms also establish distribution trees and meshes, and thus scale with the numbers of subscribers as well. However, ALM relies on peers to execute the notification dissemination algorithm at the application layer. Thus, existing infrastructure such as routers can hardly be used for ALM, and new peers are required. According to Esposito and Ciampi \[49\], ALM appears to be the dominant solution for secure Pub/Sub system from academia.
> In summary, ALM offers scalability as well as the flexibility to design the application-layer routing algorithms in an anonymity-preserving manner. The remainder of this thesis, therefore, focuses on ALM-based Pub/Sub approaches.
#### 2.1.2 Requirements
> A Pub/Sub system must comply with a plethora of functional and security requirements.
Functional as well as security requirements have to be discussed in combination with anonymity:
* Confidentiality: Confidentiality as the first requirements of the CIA triage enforces information not to be disclosed to an unauthorized participant. Such information occurs in three variations in Pub/Sub:
1. Information Confidentiality: No information must be disclosed to the event service, i.e., the infrastructure.
2. Subscription Confidentiality: No information about a subscription—and thus about the interests of a subscriber— must be leaked to another subscriber and publisher.
3. Publication Confidentiality: No information about a publication (notification) must be leaked to another publisher and unauthorized subscriber.
* Integrity: Integrity as the second confidentiality integrity availability (CIA) triage requirement ensures the completeness and correctness of data. For Pub/Sub, Wang et al. \[162\] distinguish the following thee variations:
1. Information Integrity: Information integrity requires that no information, e.g., stored data or message in transmission, is altered or lost without authorization.
2. Subscription Integrity: Subscription integrity requires that no subscription be altered or lost.
3. Service Integrity: Service integrity requires that the event service is not altered.
* Availability: Availability as the third CIA triage requirement ensures that information is available when needed. That includes that notifications are delivered without delay.
> This requirement can be also called scalability \[50\].
* Authentication: Authentication in Pub/Sub requires receivers to be able to verify the identity of senders and vice verse.
* Accountability: Accountability requires all actions, e.g., sending and receiving a message, to be associated with an actor. Furthermore, the actor cannot be able to dispute the action.
* User Anonymity: User anonymity requires users—subscribers and publishers—not to be identified and observed by other users or the event service.
* Low Overhead: Mechanisms to protect anonymity introduce additional overhead.
* Low Computational Overhead: requires that computation, in particular for the Pub/Sub functions match and cover.
* Low Space Overhead: requires that the consumed space regarding memory be as low as possible.
* Low Signaling Overhead: requires the number and sizes of messages to be as low as possible.
> To comply with service integrity, an event service needs to match subscriptions with publications. Optionally, the event service may aggregate subscriptions by checking if one subscription is already covered by another. This function addresses scalability.
* Matching: Assuming a set of attributes $A$, goven a notification $m_{notif}$ as well as a subscription $m_{sub}$, and a function $A(m)$ that extracts all attributes from a message $m$, $match(m_{notif},M_{sub})$ returns true if $A(m_{sub}) \subseteq A(m_{notif})$
* Covering: Given two subscriptions, $m_{sub,}1$ and $m_{sub,}2$,$cover(m_{sub,1},m_{sub,2})$ returns true if $A(m_{sub,1}) \subseteq A(m_{sub,2})$
> i.e., $m_{sub,1}$ is moew general and covers $m_{sub,2}$.
### 2.2 Anonymity
The terms anonymity and privacy are intrinsically related.
**Privacy** refers to the ability to seclude oneself.
**Anonymity** is defined as “Anonymity requires that other users or subjects are unable to determine the identity of a user bound to a subject or operation”
> Following these definitions, this thesis considers anonymity as a part of privacy. In particular, anonymity as requirement concerned with the protection of personally identifiable information.
The following two subsections introduce the concepts of anonymous communication, as well as metrics to measure the success of these concepts.
#### 2.2.1 Concepts
With an anonymization service, participants usually use a public communication channel, e.g., the Internet, and are thus exposed to various attackers \[48\]. Anonymization services attempt to fulfill the following requirements in such a scenario \[113\]:
* Unlinkability: Two or more messages cannot be linked together, i.e., to the same sender, to the same receiver, or to the same attribute.
* Sender Anonymity: No message must be linked to its sender.
* Receiver Anonymity: No message must be linked to any of its receivers.
* Relationship Anonymity: No receiver must be linked to the sender and vice versa.
> In the case of a continuous conversation between sender and receiver, the sender is also called initiator, and the receiver is called responder.
During the past decades, three main concepts to achieve these anonymity goals have emerged:
1. Group communication: that ensures that senders and receivers “hide” within a group of potential sender and receivers.
* While group communication can protect anonymity, additional protection such as encryption is required to ensure confidentiality.
2. Proxy nodes: that act “on behalf” of a sender or receiver.
* Assuming that only the relayed messages are observable, proxy servers can provide sender anonymity.
* Assuming that the last relayed message is not observable, such a proxy chain can provide receiver anonymity.
> Figure 5 depicts the types of MIX-nets. From left to right: a cascade with 2x2 nodes, a stratified MIX with 2x2 nodes–messages can be relayed from any node to any node, and a 4x2 free-route MIX.
![](https://i.imgur.com/tlDIzvD.png)
A MIX is also called a high-latency anonymization service, whereas a proxy is called a low-latency anonymization service. The high latency of a MIX occurs as a MIX shuffles messages, i.e., it collects several messages rather than forwarding those messages directly.
3. DC-nets: as a special case of group communication.
* With DC-nets, noise is added to every message, so that neither a global observer nor any participant learns the contents of the messages.
* After all messages are exchanged, the noise balances itself out, and the content of the message is revealed, but the sender remains anonymous.
#### 2.2.2 Anonymization Systems
This Section introduces selected systems that have a notable impact or have introduced novel mechanisms to anonymous communication.
* Onion Routing (Tor): The initiator of a message encrypts the message multiple times and thus creates layers of encryption around the message—the onion. An onion layer is formed as follows:
1. the initiator I takes the message,
2. appends the address of a node R,
3. and encrypts the message with the public key of a node
As a result, node 3 can decrypt the message (peel off an onion layer), read the address of node R, and forward the message to node R.
![](https://i.imgur.com/OqTpAmk.png)
* Freenet (Darknet Mode): Is a P2P information storage, search, and retrieval system \[24\]. Freenet contains two operation modes, opennet and darknet.
* The opennet mode of Freenet already addresses anonymity by omitting E2E addresses in information search. Freenet uses flooding (recursive routing) instead \[160\].
* The darknet mode further restricts the opennet mode by limiting neighbors to trusted nodes. Sender anonymity, receiver anonymity, unlinkability, and relationship anonymity are assumed to be fulfilled in a darknet.
As the darknet mode of Freenet uses recursive routing, no node IDs, e.g., IP address, leaves the direct, trusted neighborhood.
> Darknets have emerged from a mode in Freenet to a generic concept in anonymous communication, e.g., darknets with Tor \[80\], where proxies are selected based upon their trustworthyness.
#### 2.2.3 Metrics
**Set Sizes**
Definition: Anonymity is the state of being not identifiable within a set of subjects, the anonymity set.” \[112, 113\]
Sender anonymity can be measured by the number of all potential senders (Figure 7 left). Likewise, the receiver anonymity can be measured by the number of all potential receivers (Figure 7 left). A similar definition of anonymity is used by k-anonymity \[150\], DC-nets \[22\] and MIX-nets \[21\].
![](https://i.imgur.com/wGXl8n1.png)
**$k$-Anonymity**
The case of k-anonymity well illustrates the challenge of defining anonymity metrics.
The original approach published by Sweeney et al. in 2002 defined a record in a medical dataset as k-anonymous (with $k \geqslant 2$) if for every key (equivalence class) of the record, at least (k − 1) other records in the dataset shared the same value (are indistinguishable).
* To overcome the drawbacks of the k-anonymity approach, Machanavajjhala et al. proposed l-diversity in 2007 \[98\]. They enforce that every key in the dataset should, at least, occur with l diverse values (every equivalence class has, at least, l well-represented values).
> The combined anonymity metric becomes the tuple (k, l).
With l-diversity, even if the values of an equivalence class are l well represented, the sensitivity of the values may differ.
* To overcome this issue, Li et al. proposed tcloseness in 2007 \[91\]. With t-closeness, values of an equivalence class have to be close to each other with a maximum distance of t.
> The anonymity metric becomes the triple (k,l, t).
With t-closeness, it is still assumed that the dataset is only anonymized and released at once. However, some dataset may be released in intervals with incremental updates.
* To overcome this issue, Xiao et al. proposed m-invariance in 2007 \[169\]. With m-invariance, at least, m records must persist (remain invariant) across two consecutive dataset releases
> The anonymity metric becomes the quadruple (k,l, t, m).
It is difficult to establish an anonymity metric. In particular, it seems to be difficult to show that a metric cannot be attacked. For instance, Kesdogan et al. showed \[83\] that anonymity sets can be reduced significantly depending the attacker’s capabilities.
**Entropy**
Set sizes as anonymity metric do not model additional knowledge of an anonymity attacker.
Serjantov and Danezis propose entropy \[132\] as an anonymity metric, based on the Shannon entropy \[133\]:
The entropy is defined as a scalar value over a probability distribution of all potential subjects Ψ.For every subject $u ∈ Ψ$, $p_u$ (Equation (2)) models the a posteriori probability—after the attacker observed messages—of u acting in role $r ∈ R$. $R$ can be either sender or receiver. The Shannon entropy $S$ (Equation (1)) provides the entropy over this probability distribution $p$ \[132\].
![](https://i.imgur.com/sKrRlAE.png)
Entropy as anonymity metric allows expressing fine-grained knowledge $p_u$ about every subject u. An entropy of $S = 0$ indicates no anonymity. An entropy of $S = log2 |Ψ|$ indicates perfect anonymity, i.e., the anonymity set has the size $|Ψ|$.
Diaz et al. extend this notion by the so-called degree of anonymity \[43\], a value derived from the normalized entropy that expresses how much the attacker learned by observing messages.
![](https://i.imgur.com/5TKEMwE.png)
Following the notation of Serjantov and Danezis, the degree of anonymity d ∈ \[0, 1\] is defined as (Equation (4)), the relation of the a-posteriori entropy S about the maximum entropy HM (Equation (3))—the perfect anonymity.
> With d = 0, the system provides no anonymity; with d = 1, the system provides perfect anonymity
Clauss and Schiffner argue that Rényi entropy \[122\] can be used as well as an anonymity metric \[25\]. Following Equation (5), the Rényi entropy $H_α(P)$ is a generalization of the Shannon entropy via the parameter $α$. With lim $α → 1$, the Rényi entropy converges to Shannon entropy.
> Clauss and Schiffner furthermore point out that entropy metrics are heavily influenced by outliers.
![](https://i.imgur.com/mRdahcz.png)
Anderson and Lundin propose a scaled version of Shannon entropy as anonymity metric \[5\]. The scaled anonymity set size $A$ (Equation (6)) is directly derived from the Shannon entropy (Equation (1)). Values of $A = |Ψ|$ denotes the maximum anonymity set size
> $A = 1$ denotes no anonymity,
![](https://i.imgur.com/0N04Me3.png)
Tóth et al. argue that a so-called global metric, such as entropy, is insufficient as an anonymity attacker could already exploit a small deviation of $p_u$ from the a priori probability. They propose the local metrics source-hiding property $Θ$ and destination-hiding property $Ω$.
* An anonymization service is considered source-hiding with parameter $Θ$ if Equation (7) holds.
* Likewise, an anonymization service is considered destination-hiding with parameter $Ω$ if Equation (8) holds.
* Also, statements about global metrics (Shannon entropy $S$) can be derived from the local source-hiding property via Equation (9).
![](https://i.imgur.com/8Iz3UAo.png)
Reiter and Rubin use six degrees of anonymity \[120\]: absolute privacy, beyond suspicion, probable innocence, possible innocence, exposed, and provably exposed. This metric is local, i.e., it is unique for every subject, and it incorporates the graph topology of the crowd as well as assumptions regarding the anonymity attacker.
![](https://i.imgur.com/0eCgSgo.png)
Equation (10) describes the condition of probable innocence for the sender. Here, $p_f > 0.5$ describes the probability for forwarding in the system; $c$ denotes the number of colluding anonymity attackers.
**Summary**
No perfect metric to measure anonymity seems to exist. While the Crowds metric is intuitive, it is also highly specific to the Crowds system. The set size metric is also easy to comprehend but lacks the capability to model fine-grained attacker knowledge. The entropy-based metrics seem to be the most comprehensive metrics so far.
### 2.3 Anonymization Implementations & Services
> Based on the concepts and systems summarized in Sections 2.2.1 - 2.2.2, several ready to use implementations and even commercial services have emerged.
This Section briefly discusses the most cited implementations & services, as well as respective attacks and lessons to be learned.
#### 2.3.1 anon.penet.f
> The service anon.penet.fi was a centralized proxy service for e-mail.
While the service was successful and handled many messages per day, lawful copyright claims forced the operator to disclose pseudonym mappings \[48\]. This case shows that a centralized pseudonym mapping causes issues safeguarding these mappings.
#### 2.3.2 Cypherpunk Remailers
> This service extends the concept of anon.penet.fi with multiple proxy servers
Cyberpunk remailers share concepts with today’s onion routing. However, these remailers do not pad messages, and may not implement good shuffling of messages (high batch size). Nevertheless, email typically allows for some latency, and thus would allow remailers to implement good shuffling.
#### 2.3.3 Mixminion
> Mixminion \[32\] is another incarnation of the remailer concept.
Compared to Cyberpunk remailers, Mixminion nodes maintain hashes of relayed messages for some time to prevent replay attacks. Likewise, Mixminion also restricts the reply blocks to be used only once. Mixminion also introduces the concept of directory servers to discover Mixminion nodes. This concept is still used today by Tor.
#### 2.3.4 Crowds
> Crowds \[120\] applies a routing scheme via proxies—not unlike hot potato routing— to anonymize senders.
Within crowds, every node runs a routing process called jondo. Therefore, Crowds is a P2P approach. Jondos are assigned to crowds via the blender process. Such a decision can be made for instance based on physical locality.
* Whenever a sender wants to reach a receiver, it first establishes a path within the crowd.
* Compared to Tor circuits, the sender does not specify this path in Crowds.
* Every jondo, including the sender, just picks the next hop. F
* or that, every jondo that receives a message chooses with a forwarding probability pf > 0.5 to forward the message to another jondo, or to the receiver otherwise.
* Once this path is established, sender and receiver communicate via this path.
Crowds protects against de-anonymization from malicious insider threats. No jondo can tell, if the previous jondo is just a relaying jondo or the original server.
> Crowds does not apply any mixing or padding technology, therefore, an anonymity attacker who observers the message flow can trace a path in Crowds back to the origin.
#### 2.3.5 Tarzan
> Tarzan \[57\] constructs circuits consisting of proxy servers to relay UDP packets.
As there is no “crowd” assigned to Tarzan nodes by a central server, Tarzan relies on gossiping to discover neighbors. With gossiping (P2P), Tarzan connects to a known neighboring node, and asks it for more neighbors \[48\]. Using this approach, Tarzan nodes learn about proxies that can be used to build circuits.
To prevent anonymity attackers from tracing message flows, Tarzan uses cover traffic between nodes.
* For that, every node selects a couple of Tarzan nodes with which it exchanges cover traffic (mimic traffic).
* These nodes are called mimics. The mimic selection process is based on a deterministic function and publicly verifiable information such as IP address and date \[48\].
* Having this information, mimics can verify their selection, and thus prevent malicious insiders from forcing mimics to generate cover traffic and overload the system.
#### 2.3.6 Tor
Tor \[44\] is the reference implementation of onion routing as introduced in Section 2.2.2. This section elaborates on additional functions of Tor beyond onion routing, as well as attacks against Tor.
##### 2.3.6.1 Hidden services:
> The concept of onion routing is supposed to protect sender anonymity, but not receiver anonymity
The second generation of Tor introduced a concept to protect receiver anonymity as well. To reach an unknown responder, Tor uses the notion of hidden services. With hidden services, the initiator addresses a particular type of service but does not care who provides this service.
The service provider instructs the exit guard of the circuit to keep the TCP port of the circuit open. As a result, every initiator contacting the guard using the specific port will reach the service provider, but will not reveal its identity. To inform initiators how to reach the guard, the service provider publishes a hidden service listing, containing guard, TCP port, public keys, in a central repository.
Tor recommends changing circuits on a regular basis to prevent prolonged traffic correlation by a malicious onion router. Hidden services, however, require the service provider to publish every new circuit (the guard) in the central repository. Furthermore, initiators have to obtain this information.
##### 2.3.6.2 Attacks
> Tor \[44\] is one of the most popular anonymization services today. Consequently, Tor has been subject to attacks and in-depth analysis.
Tor attacks can be categorized in passive and active attacks:
* With passive attacks, an attacker observes network traffic. This attacker can be a malicious onion router that is part of a Tor circuit, or an attacker that can perform large scale Internet traffic analysis.
* With active attacks, an attacker influences the behavior of the Tor system deliberately.
* Such an attacker can take the role of an initiator, which establishes a Tor circuit, and generates traffic, e.g., to overload the circuit.
* An active attacker can also take the role of an onion router that delays messages or marks packets.
Congestion: With the congestion attack, an attacker attempts to exhaust resources of onion routers to force initiators to pick other onion routers for their circuit \[52\]. This attack can be used to concentrate traffic onto onion routers controlled by the attacker.
Packet Spinning: The packet spinning attack \[110\] is a more elaborate version of the congestion attack. It abuses the fact that Tor nodes cannot detect loops in circuits, and thus create such loops and cause packets to spin in this loop forever. As a result, this attack causes a congestion of the involved onion routers as well, also causing initiators to select other onion routers. If an onion router does not respond within a timeout, the user will choose another onion router.
Traffic Analysis: The traffic analysis attack attempts to correlate incoming and outgoing traffic of onion routers. Ideally, every conversation between initiator and responder follows a unique traffic template. While this attack would require global observing capabilities of the attacker, the attacker can be refined to work with a weaker attacker model \[101\]. As everyone can establish arbitrary Tor circuits, a single active attacker can establish circuits and measure the delay (and thus load) of individual onion routers. Together with a malicious responder that generates load patterns, the attacker can predict which other onion routers are in the circuit based on the observed delay pattern.
Sniper: The sniper attack \[77\] is a special version of the congestion attack. It uses a weakness of Tor’s flow control to exhaust the memory of an onion router quickly with minimal resource consumption. This attack is possible as Tor does not use E2E duplicate detection. Therefore, it is possible to instruct onion routers to send large message many times with a small request.
Cell Counter: The cell counter attack \[92\] is one example of marking traffic in a Tor circuit. Marking traffic helps colluding attackers to trace message flows. . In an ideal case, the attacker controls the entry guard—the first onion router of a circuit—and the last onion router of a circuit (exit guard). Then, the entry guard would mark packets in a way such that the exist guard could detect these markings.
* The cell counter attack is one specific method to mark packets.
* As Tor does not use TCP streams to communicate along the circuit, and as the payload is encrypted, only metadata fields of Tor can be used for marking.
* The cell counter is a metadata field used by Tor for flow control, similar to the TCP sequence number.
* Slight alterations to the cell counter can be used to mark circuits.
Bad Apple: The bad apple attack \[14\] uses an onion router to perform man in the middle (MITM) attacks (active attacks). The attack exploits that communication between the Tor exit guard and the receiver may occur without confidentiality and authenticity protection. A malicious exit guard can, therefore, read the message from the sender and modify the response.
> Assuming that the targeted node will directly connect to the malicious peer without Tor, the target node can be de-anonymized.
Fingerprinting: Fingerprinting attacks \[68, 163\] exploit the fact that the application layer may leak information that is reflected in Tor circuits by a fingerprint. With a database of fingerprints, every onion router of a circuit can correlate this circuit with application layer receivers. With HTTP / a web page as application layer responder, the attacker can build a database of fingerprints from common websites. The fingerprint is then constituted by the timed sequence of requests (HTML, JS, CSS, images)—also called timing attack—and the sizes of the responses.
Statistical Disclosure: The statistical disclosure attack \[33\] assumes that some properties of anonymous communication remain invariant over time, therefore, statistics can be used to reveal these invariant properties.
* For instance, if the communication between initiator and responder is invariant and the path through the anonymization service variant, simply counting messages of initiators and receivers would statistically reveal this pair.
* However, this attack requires an attacker that has global observation capabilities.
Predecessor: r The predecessor attack \[167\] is a specific version of the statistical disclosure that has been applied to Tor. This attack assumes that it is possible to associate a Tor circuit with a responder, e.g., via the fingerprinting attack.
* This attack assumes that the initiator creates many circuits over time to connect to the same responder.
* However, the current specification of Tor prevents the creation of a new circuit as long as the conversation between initiator and responder is ongoing
#### 2.3.7 MorphMix
MorphMix \[121\] is a MIXnet that does not rely on fixed MIX cascades or directory servers. MorphMix uses gossiping to discover MIX nodes.
* Every node is required to know at least one MIX node upfront.
* The node asks the other MIX node for additional MIX nodes.
* To establish confidential communication with newly discovered MIX nodes, MorphMix applies a modified Diffie-Hellman (DH) key exchange to establish a symmetric key.
* This key exchange is assisted by two already known nodes to prevent a MITM attack.
#### 2.3.8 Anonymizer.com
Anonymizer Inc. offers a commercial anonymization service via the proxy concept. To forward all TCP/IP traffic via their proxy server, they use a virtual private network (VPN) tunnel between sender and proxy.
As Anonymizer is built on VPN connections and a proxy, it introduces low latency. However, as VPNs usually do not provide padding of messages, incoming messages of the proxy may be linked to outgoing messages.
#### 2.3.9 JonDo / AN.ON & JAP
JonDo is another commercial anonymization service2 offered by JonDos GmbH. Unlike Anonymizer, JonDo follows the MIX concept. JonDo is based on the open source AN.ON, which again is derived from Java Anon Proxy (JAP) \[12\]. JAP is a desktop application that acts as a web proxy server and connects to a MIX cascade. The MIX nodes of the cascade are operated by supporters of the AN.ON project, and JonDo respectively.
> JonDo introduced a message flag to make messages linkable across multiple MIX nodes. Still, as the MIX nodes of AN.ON and JonDo are operated by several legal bodies, court orders seldomly address all operators (MIX nodes), and are thus hardly successful \[62\].
#### 2.3.10 $P_5$
> P5 is an anonymization service that uses the broadcast concept \[134, 135\].
P5 has not been realized yet and has only been modeled as a packet-level simulation.
* P5 assumes participants to organize an overlay tree.
* Within that overlay tree, a sender encrypts a message with the receiver’s public key, and broadcasts the encrypted message to all successors in the tree.
* These successors will then re-broadcast into their subtrees.
* Furthermore, all messages a padded to the same size of $1KiB$. With $P_5$ nodes can trade anonymity with message overhead: if the sender takes the tree root position, the receiver anonymity set becomes maximal. Likewise, the message overhead becomes maximal as all other nodes will receive the broadcast. This mechanism protects receiver anonymity. To protect sender anonymity as well, $P_5$ nodes generate noise, i.e., packets without meaningful content.
#### 2.3.11 Summary
> Even though Tor is the most popular implementation and has been improved and analyzed countless times—Google Scholar lists 2, 639 citations of \[44\] as of August 20th, 2015—crucial challenges remain.
1. First, Tor was not designed to and cannot protect against colluding anonymity attackers.
2. Second, the application layer should be always considered when designing anonymization services.
In particular, strict network layer separation has not worked out well for anonymity in the case of Tor so far. Moreover, attacks on the application layer can still violate anonymity.
### 2.4 P2P Systems
> P2P overlay networks are an abstraction on top of an existing network layer, the underlay.
Compared to the client-server model, where one or more nodes take the role of a server and the clients solely connect to such server nodes, P2P omits this role distinction. Instead, all nodes assume both roles and are therefore equal.
Usually, P2P systems assume IP as an underlay to establish overlay networks. P2P systems can be classified based on how they establish overlay networks:
* Unstructured and
* Structured \[95\].
> Some P2P systems, such as Napster \[54\] and \[26, 86\], use the client-server model for some functionality and are thus called hybrid.
The most important functions performed by a P2P system are store, lookup, and retrieve \[95\]. The lookup may be combined with retrieval. As lookup does not require much bandwidth compared to store and retrieve, lookup is realized by a central server in a hybrid combination of central and P2P system. This approach speeds up the lookup of information regarding delay and the number of messages.
#### 2.4.1 Structured Overlays
> When storing information, structured overlays place this information at a location (node) that is determined by some structure. This structure is typically a distributed hash table (DHT).
A DHT consists of (key, value) pairs, where the key identifies the information and the value where (which node) to find it. To establish a uniform key representation, typically a hash of the information is used. As the DHT is distributed, no single node is required to keep a complete table. Instead, nodes determine from the key which neighboring peer is closer to the value, i.e., the information. Hence, a progressive localization is possible.
* To maintain a structure that allows the determination of the correct neighboring peer, nodes organize themselves in a ring that maintains a predecessor/successor relation among nodes. For that, nodes obtain an ID, typically from the interval \[0, 1); each ID represents a position within the ring \[148\].
* The lookup of information via such a ring structure may require traversing half of all nodes in the ring. To speed up the process, nodes may maintain more than two neighbors (predecessor, successor), and thus, introduce shortcuts within the ring—skip lists \[67\].
> A DHT in combination with skip lists provides fast lookup of information with guarantees of the upper bound of nodes to traverse.
#### 2.4.2 Unstructured Overlays
> Opposed to structured overlays, unstructured overlays store information at random nodes. A lookup, therefore, requires searching for the information in the overlay
Two methods are being used to perform such a search, the graph traversal algorithms flooding, and random walks \[114\].
> Following the properties of algorithms, flooding and random walk are efficient for looking up popular (highly replicated) information, whereas structured overlays are more efficient for rare information.
Flooding: With flooding, the initiator forwards its request to all neighboring peers. These peers again continue to forward the request to all their peers. Eventually, the request reaches a responder, who discontinues the flooding.
* To allow the responder to send a message back to the initiator, all forwarding peers are required to store some information about the request for a limited time.
* Such information can be the tuple consisting of message ID and previous peer.
* While flooding is well suited to reach a responder quickly, it causes signaling overhead.
* To overcome this drawback, the flooded messages contain a time to live (TTL) counter.
* The initiator initializes the TTL counter with a positive value.
* Every peer that forwards the message first decrements the TTL counter before sending the message.
* Once the TTL counter reaches 0, the flooding stops.
Random Walk: Random walks can be considered a variation of flooding.
* Rather than flooding the message to all neighboring peers, with the random walk the message is only forwarded to exactly one randomly selected neighboring peer.
* Compared with flooding, random walks may cause less signaling overhead.
* However, the distance between requester and responder may be longer.
* Furthermore, random walks do not guarantee to reach a responder, even in case a responder is within the TTL counter distance of the initiator
##### 2.4.2.1 Summary
Structured overlays provide the best performance for sparse data, whereas unstructured overlays are best suited for highly replicated data
* Structured overlays: expose the most information to anonymity attackers as IDs are necessary, and as the attacker can therefore easily target the node with the closest ID to the desired information.
* Unstructured overlays: are therefore best suited for anonymity.
### 2.5 Network Simulation
Simulations are one of the three core methods for evaluating the performance of computer systems \[75\]. A simulation abstracts from the real systems in the sense that only certain properties considered relevant are modeled in the simulation.
Anonymous communication services are particularly hard to evaluate empirically as they are designed to prevent analysis in the first place \[76\]. Therefore, network simulations, as well as formal analysis, are the dominant evaluation methods for anonymization services.
> For some services, e.g., Tor, specific network simulation tools such as Shadow \[76\] exist.
#### 2.5.1 Discrete Event Simulation
> A discrete event simulation is based on components emitting and consuming events.
The discrete event simulation processes events in chronological order without concurrency (discrete).
* A discrete event simulator, therefore, stores events within an event queue, sorted chronologically according to the simulation time.
* An event scheduler takes care of queuing events, deleting events, re-ordering events, and processing events.
The workflow within a discrete event simulation is as follows:
* The active component emits an event and hands it over to the event scheduler.
* The scheduler determines at which simulation time the event is going to be processed, and inserts the event at the appropriate position in the event queue.
* After all actions of the active component are complete, the schedule pops the next event from the queue.
* If the event is scheduled to be processed in the future about the current simulation time, the schedule advances the simulation time to the processing time.
* Then the discrete event simulator activates the consuming component, hands over the event, and waits for the component to finish processing.
> Discrete event simulations capsulate the behavior of computer networks. For instance, nodes of a computer network can be represented as components, messages as events, and transmission delay as advancing time.
#### 2.5.2 OMNeT++
> Objective Modular Network Testbed in C++ (OMNeT++) \[115, 159\] is a discrete event simulator written in C++. Other common simulators are OPNET, and Network Simulator 2 & 3 \[109\].
This thesis considers OMNeT++ the the simulator of choice as it is free of charge for academic usage3 and provides many ready to use components from the Internet domain.
* OMNeT++ has been publicly available including all sources since 1997.
* OMNeT++ also supports all major operating systems and not only for Unix-based systems
#### 2.5.3 Summary
> In summary, network simulations are well suited to simulate anonymization services. Such simulations can abstract unnecessary details still required for application simulations such as Shadow. OMNeT++ is a mature tool for such simulations and already provides many implemented components from the Internet domain.
### 2.6 Summary
> This chapter has introduced the concepts of Pub/Sub, anonymity, P2P, and network simulation with the goal of establishing the necessary background knowledge that is being used throughout this thesis
Publish/subscribe is well suited to realize many-to-many applications in a distributed rather than a centralized manner. However, Pub/Sub also requires access to possible confidential information to route this information. Thus, anonymous Pub/Sub has to be carefully designed.
Anonymity can be sub-divided into the four goals:
1. Unlinkability
2. Sender/Receiver anonymity
3. Relationship anonymity.
Three foundational concepts exist to fulfill one or more of these goals:
1. Group communication.
2. Proxy servers / MIX nets.
3. DC-nets.
## 3 Requirements and State of the Art
> This chapter analyzes the related work, systems that combine Pub/Sub and anonymous communication in detail.
### 3.1 Publish/Subscribe System Formalization
> This section introduces notation and model for distributed Pub/Sub systems.
This model serves as common ground for the discussion of anonymity attacker models, related work, as well as contributions throughout this thesis.
> Section 2.1.1 has introduced the concept of Pub/Sub message dissemination. This section refines this concept by applying a formal notation that also incorporates the concept of anonymization services (cf. Section 2.2) as well as P2P systems (cf. Section 2.4).
Attributes: A core property of Pub/Sub is the loose coupling between participants \[51\].
* Loose coupling is achieved by routing messages according to their content and the interests of participants rather than participant IDs.
* Section 2.1.1 has introduced common comparison methods of Pub/Sub, i.e., methods to match message contents with interests.
Attributes have emerged as an expressive concept in the domain of IT security. Methods of attribute-based access control (ABAC), e.g., eXtensible Access Control Markup Language (XACML) \[63\], for achieving authorization and attribute-based encryption (ABE) \[128\] for achieving confidentiality in combination with authorization serve as an example.
To model participant interests, this thesis introduces an attribute set A. Each attribute a ∈ A reflects a participant’s interest. Every participant may be interested in multiple attributes; likewise, messages may be labeled with multiple attributes.
Graphs & Overlays: A Pub/Sub system can be modelled as a graph, with computer systems represented by nodes, and with communication links represented by edges. Peers are interconnected via an underlay, e.g., the Internet. To abstract from the underlay, this model assumes that all peers establish a basic overlay via an underlying P2P approach.
> The basic overlay is represented as the graph $G = (V, E)$ that incorporates all peers $V$ and interconnects them via overlay edges $E$. This model assumes that the graph $G$ is connected and unidirectional, i.e., $∀_{vx}$, $v_y ∈ V$ there is an edge progression from $v_x$ to $v_y$. As a result, this model can also incorporate underlays without E2E connectivity, such as wireless sensor networks. Each peer $v$ has a set of neighbors defined by adjacent edges. The function $N(v)$ (cf. Equation (11)) returns these neighbors.
Roles: The set of nodes $V$ can be partitioned into publishers $P$, subscribers $S$, and brokers that are named forwarders $F$ to follow the P2P generalization. Hence, $V = P ∪ S ∪ F$. A node may take up to all of these $roles$ in a P2P system. In particular, a node can take the publisher and subscriber role at the same time. Thus $|P ∩ S ∩ F| \geqslant 0$ holds.
The roles can be further split according to the attributes:
* The set $P_a$ denotes the publishers for attribute $a ∈ A$. Likewise, $S_a$ and $F_a$ denote the subscribers and forwarders for $a$.
* The unified set of all nodes involved in handling messages related to a is denoted by $V_a$.
* Following the attribute-specific node notation, the neighborhood function $N(v)$ can be further differentiated in inbound neighbors $N^+_a(v) ⊆ N(v)$ (see Equation (12)) that will forward messages regarding a to the local node v and outbound neighbors $N^−_a (v)$ (see Equation (13)) that node $v$ sends messages regarding attribute a to.
![](https://i.imgur.com/uXvlCfV.png)
Messages: Messages in this model are denoted by $m$. The message $m$ may relate to one or more attributes $A' ⊆ A$ and is written as $m^{A'}$.
> For simplicity, this thesis uses one attribute $a ∈ A$, and the message becomes $m^a$.
* A Pub/Sub system typically distinguishes three types of messages:
1. Notification: A notification, sometimes also called publication, with attribute a contained, is written as $m^a_{notif}$.
2. Subscription: A subscription for attribute $a$ is written as $m^a_{sub}$.
* Subscriptions serve the purpose of setting up routing tables such that all publishers and forwarders on the paths from publishers to subscribers know, in which directions they have to forward notifications.
3. Advertisement: If Pub/Sub with advertisements is used, the advertisement of attribute a is written as $m^a_{adv}$.
* Advertisements aid the subscription process by informing forwarders and subscribers from which directions to expect notifications from.
In combination with the roles:
* A subscriber $s ∈ S_a ⊆ V_a$ subscribes to the attribute $a ∈ A$ via subscription messages $m^a_{sub}$ and receives notification messages $m^a_{notif}$ originated by the publishers in set $P_a$ for attribute $a$.
* Every publisher $p ∈ P_a ⊆ V_a$ may initially advertise the availability of the attribute a via advertisement messages $m^a_{adv}$ that are disseminated in $G$, e.g., by flooding the neighbors of $p$.
* All messages between a subscriber and a publisher may be carried out over multiple hops in $G$.
Attribute Overlay: The multi-hop dissemination of messages can be considered as another overlay on top of the basic overlay G.
Without loops and duplicate paths, such an overlay per attribute forms a mesh $Ma = (P_a ∪ S_a ∪ F_a, E_a)$ that includes all publishers $P_a$, all subscribers $S_a$ and depending on the specific P2P-based Pub/Sub approach it will also include a certain number of nodes, socalled $forwarders$ $F_a$, that are not interested in attribute a at all, but will perform brokering in between publishers and subscribers.
![](https://i.imgur.com/K7pOIbp.png)
* Equation (14)) denotes the set of edges that connects the nodes in $M_a$. A Pub/Sub system may remove edges and introduce new edges. Nodes in $M_a$ can be summarized as $Va := P_a ∪ S_a ∪ F_a$.
* The function patha(p, s) (cf. Equation (15)) describes the overlay path in $Ma$ between nodes $p$ and $s$.
* The function returns this path as a node progression.
* With the assumption of no loops and no duplicate paths in $M_a$, there exists at most one path between two nodes.
* To ensure the requirements of subscription integrity and service integrity (cf. Section 2.1.2), there exists a path between every publisher and every subscriber as defined by Constraint (16).
Time & Churn: A Pub/Sub system can be also subject to churn over time and thus inherently dynamic. > The time $t ∈ T : T$ = {1, 2, ... } indicates snapshots, e.g., $G^t$ and $S^t_a$. Every system snapshot is represented by an increment of $t$.
* Node churn $\rho ∈ [0, 1]$ denotes the ratio of nodes of the total node population $V$ that are subject to churn during a time interval $[t_i, t_{i+1}]$.
* Three variations of churn are considered:
1. With $\rho_{join}, \rho_{join} × |V|$ nodes join the basic overlay $G$ during the time interval \[$t_i, t_{i+1}$].
2. With $\rho_{leave}, \rho_{leave} × |V|$ nodes of the basic overlay $G$ leave during the time interval \[$t_i, t_{i+1}$].
3. With $\rho$, this ratio is distributed among $\rho_{join}, \rho_{leave}$. That means, $\rho_{join}$ = $\dfrac{\rho}{2}$ and $\rho_{join} × |V|$ nodes join during the time interval; $\rho_{leave} = \dfrac{\rho}{2}$ and $\rho_{leave} × |V|$ nodes out of the nodes that are already part of $G$ at time $t_i$ leave within the interval.
![](https://i.imgur.com/vYgm8PG.png)
> The churn rate $\rho$ remains constant over time intervals.
* Churn occurs in two characteristics:
1. First, nodes join and leave $G$, which also affect overlays $M_a$.2.
2. Second, subscribers and publishers join and leave attribute overlays $M_a$ but not G.
****
* Nodes send a subscription message ($m_{sub}) to join an attribute overlay. Advertisement messages $m_{adv}$ from publishers indicate the availability of overlays.
* To leave an attribute overlay, they have to send an unsubscription ($m_{unsub}$) or unadvertisement ($m_{unadv}$), depending on their role in the attribute overlay.
* When nodes fail, their neighbors act as if they received an unsubscription / unadvertise message from that node.
****
Summary: Table 1 summarizes the introduced notation. This notation describes two network abstraction layers of a Pub/Sub system, the basic overlay and the attribute overlay. The notation does not explain how $M_a$ is constructed out of $G$. This overlay creation is part of the respective Pub/Sub systems.
### 3.2 Anonymity Requirements for Pub/Sub
> Security requirements for Pub/Sub have been introduced in Section 2.1.2
This section elaborates on requirements particular to anonymity and puts them into perspective the formal model established in the previous section.
Xiao \[168\] enumerates publishing anonymity, sending anonymity, and receiving anonymity as anonymity goals:
* Publishing anonymity requires that the information can be created without the creator being discovered.
* Sender and receiver anonymity have been discussed in Section 2.2.1 and require that a message is not linkable to a sender or receiver.
Further requirements exist \[113\]:
* The combination of sender and receiver anonymity leads to relationship anonymity, i.e., the requirements of sender and receiver not being linked together.
> Unlinkability requires no two messages to be linked together.
* The less differentiated term privacy is also being used \[118, 162\] in the context of Pub/Sub.
> this requirement originates from the insight that anonymity should be considered in combination with confidentiality \[129\].
A violation of either requirement can easily lead to the violation of the other one as well:
* Anonymity Without Confidentiality: means that data or messages are available in plaintext, but the sender and receiver are anonymized. Prior research has shown that profiling and correlation are possible, resulting.
* Confidentiality Without Anonymity: means that data or messages are linked to sender and receiver, but the plaintext of the data or message is not available.
This thesis transforms the terms sender and receiver anonymity into publisher anonymity and subscriber anonymity to better fit Pub/Sub. That means, given a Pub/Sub system including all exchanged messages:
![](https://i.imgur.com/twNRiOB.png)
The second requirement follows from the observation that confidentiality and anonymity must be considered in combination \[129\]. To measure the anonymity—the inverse success of an anonymity attacker— , this thesis uses the two metrics anonymity set size \[112\] and degree of anonymity/entropy \[43\] (cf. Section 2.2.3).
With these metrics, anonymity can be formalized as follows:
* Subscriber Anonymity: Every subscriber can hide in a reasonably large anonymity set $anonSet_{adv}$ from the perspective of an anonymity attacker.
* The attacker attempts to establish the minimal anonymity set containing only subscribers, and thus disclosing these subscribers, for a given attribute $a$ from a set of attribute $A$.
* Thus, a Pub/Sub system meets subscriber anonymity w.r.t. to an adversary $\mathfrak{A}$, and a set size $k$, if Equation 17 holds,
> i.e., the attacker cannot reduce any anonymity set below the size of at least $k$ participants.
* Using anonymity degree as metric, the system provides anonymity if Equation 18 holds, the attacker A cannot learn more than threshold $T$ about the system.
![](https://i.imgur.com/7d2B8up.png)
* Publisher Anonymity: Similar to subscriber anonymity, the Equations (17), (18) can be applied to publisher anonymity as well.
### 3.3 Attacker Models for Anonymity
> Compared to capabilities, attack approaches have been less extensively studied.
Attacks, e.g., the ones studied by the anonymity metrics discussed in Section 2.2.3 (page 18), are evaluated via theoretical worst-case capabilities.
#### 3.3.1 Properties & Capabilities
> Edelman and Yener \[48\] summarize the attacker properties used to evaluate anonymization services as capability, visibility, mobility, and participation. These properties can be adapted as follows to the domain of anonymous Pub/Sub:
Capability: defines what actions the attacker can take.
* A passive attacker eavesdrops and monitors;
* An active attacker extends the passive attacker with the Dolev-Yao model \[46\].
Visibility: visibility or scope defines to which extent the attacker can eavesdrop the interaction with a Pub/Sub system. This scope can be defined as a subset of nodes $V' ⊆ V$ and edges $E' ⊆ E$.
* Two types of visibility are commonly used as attacker model:
1. The global attacker can observe or interact with all edges $E$.
2. The local attacker is restricted to one node $v ∈ V$ and the set of adjacent edges $E'$ as defined by Constraint (19).
![](https://i.imgur.com/NYlAsk3.png)
Mobility: The mobility defines if the visibility of the attacker is invariant over time.
* The static attacker observes and interacts with the same part of the system over time
* i.e., $∀t ∈ T : V^{'t} = V^{'0} ∧ E^{'t} = E^{'0}$.
* The adaptive attacker choses always the visibility that suits best. For instance, P2P systems that allow IDs to be changed can be used for the adaptive attacker.
Participation: The participation distinguishes if the attacker is internal, i.e., part of the system, or external.
* An internal attacker takes the visibility of nodes $V' ⊆ V$ and thus has access to secrets and key material of these nodes.
* The external attacker typically takes the visibility of edges $E' ⊆ E$ and thus lacks access to secrets.
Literature in the domain of anonymous Pub/Sub often refers to the honest but curious attacker \[27, 103, 118\]. This attacker has the properties passive, local/global, static, and internal. This attacker often takes the role of the only forwarder (broker) in the system, and thus, the visibility covers all edges $E' = E$.
> Therefore despite having a visibility of only one node, the attacker can be considered global as well.
#### 3.3.2 Selected Attackers
This thesis uses two attacker models (property combinations):
1. the Global Observer $G\mathfrak{A}$
2. the Malicious Inisder $I\mathfrak{A}$
Global Observer $G\mathfrak{A}$: The goal of this passive, global, static, and external attacker is to break publisher and subscriber anonymity by eavesdropping on all edges and thus observing all messages.
* This attacker remains passive as it does not manipulate messages.
* It is external as it does not possess (a priori) secrets or key material.
* The attacker is furthermore global as it can observe the full communication network ($E$).
* From the global property follows that this attacker is also static, and there is no need for adaptivity.
* According to the system model (cf. Section 3.1):
* The attacker observes notifications $m_{notif}$
* Knows an attribute a without associated secrets.
* Possess topology information of the basic membership management $G := (V, E)$, and attempts to learn $S_a$ or $P_a$.
> That is, break subscriber or publisher anonymity.
Malicious Inisder $I\mathfrak{A}$: The goal of this active, internal, and local attacker is to break publisher and subscriber anonymity by participating in the anonymous Pub/Sub system and exploiting the protocol.
* Depending on the protocol, this attacker is static or adaptive.
* Furthermore, this attacker can be colluding with other malicious insiders as well as with the global observer.
* This attacker is active and therefore:
* Modifies and suppresses messages at will.
* Decrypts, and forges valid messages with key material for an attribute a.
* Colludes with several attacking nodes.
* In combination with the global observer, this attacker can also access the topology information G as well as observe messages.
> An attack using this model has been published in \[37\].
* To formalize collusion of malicious insiders, the system formalization in Table 1 is extended by the set of colluding attackers $C = V' ⊆ V$.
### 3.4 Attacks on Anonymous Pub/Sub
A multitude of attack approachs are possible to break anonymity in Pub/Sub:
Statistical Disclosure: Statistical disclosure \[33\] leverages the information gained over prolonged periods of information. This type of attack assumes that over time, true receivers handle more messages than pure forwarders.
* Applied to Pub/Sub, the anonymity attacker assumes that subscribers receive more notifications than pure forwarders.
* Alternatively, the attacker may assume that subscribers remain as members of an attribute overlay for more time intervals than pure forwarders.
> This may happen due to churn, i.e., subscribers reconnect to the attribute overlay in case of leaving neighbors, but pure forwarders may not.
* To mount this type of attack, the global observer is required.
* An attacker with the capabilities of the global observer model traces the flows of notification.
* Given the trace of a notification flow, the attacker knows which nodes are part of this flow, and are therefore member of an attribute overlay.
* By tracing multiple notifications of the same attribute overlay, the attacker gains the necessary statistical information.
* Malicious nodes can be used to enhance this attack:
* malicious insiders attempt to become pure forwarders, and then interrupt connections to neighbors.
Topology Analysis: Uses the topology of attribute overlays to reason about the role of nodes.
* This type of attack assumes that the topology of an attribute overlay follows a deterministic construction.
* As before, the anonymity attacker traces the flow of a notification and identifies the involved nodes as well as their connections.
* Given this topology and knowledge about the overlay construction method, the attacker may infer that a publisher is likely to be located in the center of such a topology.
* Likewise, subscribers may be located in leaf positions.
* To mount this type of attack, the global observer is required as well.
GrayHolding: Grayhole or eclipse attacks concentrate traffic in malicious nodes.
* This type of attack assumes that the anonymity attacker can increase his neighborhood or steal traffic by some other means
* e.g., by gaining popularity.
* The attacker may then reason that its neighbors in an attribute overlay are likely to be publishers and subscribers
* As the attacker steals more traffic, less pure forwarders can be part of the attribute overlay.
* The attacker may also use the grayhole attack as an aid for another attack, such as the statistical disclosure or the topology analysis.
* To mount this type of attack, the capabilities of a malicious insider are required.
Timing Attacks: Timing attacks exploit the reaction time of nodes.
* This type of attack assumes a difference in the reaction time of pure forwarders and publishers and subscribers.
* The attacker either acts itself, or measures the reaction time of nodes given observable actions.
* A node that reacts immediately to an advertisement may be considered a subscriber, whereas a node that reacts delayed or not at all may be considered a pure forwarder.
* A subscriber may also forward notifications slower than a pure forwarder, as the subscriber also processes the content of notifications, whereas the pure forwarder does not.
* To mount this type of attack, the capabilities of a global observer are required.
> A malicious insider may also analyze its neighbors or aid the global observer.
#### 3.4.1 A Timing Attack in Anonymous Communication
> This section presents a timing attack published in \[38\] to break subscriber anonymity.
This attack exploits the principle that many anonymous communication systems are based on request/response semantics in their protocols.
These request/response semantics can be used by the malicious insider $I\mathfrak{A}$ in collusion with the global observer $G\mathfrak{A}$ to de-anonymize responders—subscribers in the context of Pub/Sub.
* The $G\mathfrak{A}$ provides topology information about the anonymous communication system, the graph $G$.
* The $I\mathfrak{A}$ provides secrets and key material about $a$, and can send messages.
> With both attackers colluding, the $G\mathfrak{A}$ can furthermore trace message flows related to $a$ and reduce $G$ to the attribute mesh $M_a$.
This attack is based on the assumption that the attacker can send request messages regarding an attribute $a$. Then, a target node, which the attacker desires to de-anonymize, responds, and the attacker reasons about this node’s position in the basic overlay $G$. That means, the attacks aims to identify nodes related to $a$.
> For this attack to work, it is also important that the analyzed system does not impose artificial message delay, i.e., that the dominating factors in the message transmission delay are node and edge capability and load. Also, the system must provide some option to estimate or measure such delays.
Attack Approach: To perform the attack:
* The $I\mathfrak{A}$ sends a request into $G$, awaits a response, and measures the round-trip time (RTT).
* Given the attacker can estimate average delays between nodes, the attacker can establish a candidate set (anonymity set) of potential responders within the attribute mesh Ma provided by the $G\mathfrak{A}$.
* The major challenges for a realistic implementation of this approach are:
1. How can the attacker derive an anonymity set/entropy from the RTT?
2. How can the attacker model and refine the anonymity set/entropy over multiple iterations?
* Given the stated assumptions hold, after sending a request from the $I\mathfrak{A}$, the attacker will receive zero or more responses.
* For the remainder of this explanation, one response is assumed.
* Between request and response, the attacker observes the RTT denoted as $δ$.
* Given that the transmission delay is equal for every edge in $M_a$, the attacker can estimate the number of edges $d$.
* Given the number of edges, the attacker can establish the anonymity set of nodes at the appropriate distance,
> i.e., nodes with $v ∈ V_a : |path(I\mathfrak{A}, v)| = d$.
![](https://i.imgur.com/euWXQGz.png)
![](https://i.imgur.com/Y8j2XZH.png)
> As an alternative to this example, subscriptions and notifications can be used as well. Assuming a high notification rate, a subscription can be considered as the request, which is answered by the next notification.
* Following up on the example, to obtain $d$ from Equation (20), the attacker needs an estimate for the delay components $δ$ on the righthand side of the equation. These delay components can be appproximated via an average delay $δ_{avg}$ that fulfills Equation (21).
![](https://i.imgur.com/XysJ3iY.png)
* Given that the attacker obtains such an average delay, it can approximate the distance $d$ as in Equation (22). The division by 2 is necessary as $δ(m_{adv}, m_{sub})$ describes the RTT.
* However, the attacker aims at obtaining the one-way distance to the responder, and not the round-trip distance.
![](https://i.imgur.com/7Bp45Bv.png)
* With $d$ given, the attacker can rule out nodes closer than $d$, as theses nodes would have responded earlier.
* That is, all nodes $v ∈ V_a : |path(I\mathfrak{A}, v)| < d$ are not in the anonymity set.
* However, the attacker cannot rule out nodes further away.
* , i.e., $v ∈ V_a : |path(I\mathfrak{A}, v)| > d$.
* The responder at distance $d$ may suppress responses further away than $d$.
* Thus, the attacker may not observe responses from those nodes, even those nodes relate to a and, therefore, should be de-anonymized.
* This behaviour is a challenge for the attacker when modeling the attack results via an anonymity set:
* nodes at distance d are likely to be responders, nodes closer than d are impossible to be a responder.
* Hence, the attacker adjusts his initial probability distribution v.
* Equation (24) shows how probabilities are set after an attacker
> i.e., after the attacker established d. Here, a probability of 0 means that a node is not in the anonymity set.
![](https://i.imgur.com/IlTdOhv.png)
![](https://i.imgur.com/hXsHPBQ.png)
![](https://i.imgur.com/PMgVwMY.png)
![](https://i.imgur.com/2i2FV0p.png)
* The equation shows how the attacker can eliminate $j out of |V_a|$ nodes that are closer than $d$ hops on the shortest path in basic overlay $G$.
* Hence, all remaining (k at distance d, l at a distance greater than d) nodes get a higher probability.
* The parameters $λ$ and $φ$ model how the probabilities are distributed among likely nodes at distance $d$ and possible nodes further away than $d$.
* The parameter $λ$ should be set to $λ \geqslant 1.0$ and the parameter $φ \leqslant 1.0$ solved in accordance (cf. Equation (26)).
* Interpreting Equation (24), every node $v_x$ with $v[v_x] > 0$ is in the anonymity set.
> Using this approach, the attacker can derive anonymity sets from the RTT.
* Following up on the second question, the attacker can refine the probability distribution over multiple iterations via four steps:
1. A node $v_x$ which probability has been set to 0 remains at 0.
2. A node $v_x$ with a positive probability that is closer than d is set to 0.
3. The probabilities of nodes at or further away than $d$ are added.
4. The probability distribution is normalized.
* The attacker sets up an initial probability distribution.
* Equation (25) provides such an initial assignment.
* Here, the probabilities are assigned based on an estimation of the total number of responders
* For Pub/Sub the set of subscribers for attribute a given as $S_a$.
* The set $C_a$ models the malicious insiders $I\mathfrak{A}$.
* Constraint (26) ensures that the distribution can be normalized in step 4.
![](https://i.imgur.com/W9ugICg.png)
Estimating the Average Delay: The malicious insider depends upon the average delay $δ_{avg}$ to calculate the distance $d$.
> Approaches estimating this delay exist \[66\], but depends upon additional protocols such as DNS that do not incorporate the application layer delays of anonymization services.
* Neighbor RTT: The $I\mathfrak{A}$ can use its neighbors $N(I\mathfrak{A})$ to extrapolate the average delay.
* Assuming that the attacker can send messages to the neighbors that will trigger an immediate response, the attacker can measure the RTT and estimate $δ_{avg}$ ≈ RTT ÷ 2.
* Some basic overlay protocols provide the necessary means for such immediate responses.
* For instance, the last packet of a TCP window and the following acknowledgement; neighborhood requests as part of gossiping; heartbeats in conjunction with responses.
* Having multiple neighbors, the attacker can establish the average delay over all neighbors as given in Equation (27).
> While the requirements for the anonymization service for this strategy are low, it is biased by the connectivity of IA.
* Hop Count: A strategy that reduces the bias of the $I\mathfrak{A}$ uses a hop counter of messages in conjunction with loops in $G$.
* To accomplish this, the attacker sends a message with a hop counter, waits until it receives the message again (loop), and divides the RTT by the hop count difference.
* Typical instances of (inverse) hop counters are TTL counters.
* Such counters are used by flooding algorithms, and random walks.
> For instance, search requests in Freenet use a TTL.
* Alternatively, the global observer might trace the path of a message and can thus observe the length of the path without a hop counter.
> This strategy reduces the bias of the IA. However, is also imposes additional requirements on the anonymization service. Also, this strategy is less stealthy as the attacker has to send messages compared to deriving information from already existing messages. Hence, the attacker may expose itself.
* Collusion: is an extension of the hop count strategy. Rather than relying on loops and RTT, multiple colluding IAs can measure the delay on a path in between them as given in Equation (28).
* This strategy does not rely on loops and can be refined by adding more colluding attackers.
* Furthermore, assuming that both attackers are on a path between a requester and responder, the attackers may derive the transmission delay from existing messages rather than generating new messages
![](https://i.imgur.com/tjupYiL.png)
> Thus, this strategy is potentially stealthier than the hop count strategy
Summary: This section presented a novel anonymity attack that makes use of a malicious insider and a global observer.
* The attack aims to create anonymity sets for responders given an attribute $a$.
* The attack furthermore refines and reduces the anonymity set over multiple iterations.
* Also, malicious insiders can collude to provide more iterations.
* The attack can be applied to anonymization services that build upon a request/response semantic, e.g., search in P2P file sharing, and advertisements with subscriptions in Pub/Sub.
* The attack also requires some delay information about the basic overlay besides the request/response semantic.
* Three methods to obtain delay estimation via the malicious insider were proposed as well.
### 3.5 Structuring System Architectures
This section introduces an architecture that splits anonymous Pub/Sub into so-called building blocks.
1. First, the formation of building blocks is explained along the roles participants can take.
2. Second, the formation of building blocks along the lifecycle phases of a system are discussed.
Anonymity is often considered \[102, 118, 162\] as an add-on requirement that can be fulfilled by an anonymization service (cf. Section 2.2.2, page 16) such as Tor \[44\]. Applications running on top of such a service, e.g., Pub/Sub may introduce additional metadata that circumvents the anonymity protection as shown in Figure 11.
![](https://i.imgur.com/49x5CYl.png)
* For instance, attacks on web browsing as an application in conjunction with Tor have been proposed (cf. Section 2.3.6.2, page 23). These attacks use metadata leaked by the web protocol hypertext transfer protocol (HTTP) to de-anonymize users \[14, 69\] and break even confidentiality,
> e.g., by revealing visited web sites \[1, 68\].
A lesson learned from these attacks is that anonymity should be considered throughout the whole technology stack of an application. While anonymity has to be considered for every building block, designing and analyzing building blocks with one specific functional goal each is easier and less error-prone as with one monolithic system.
Design Considerations: A particular challenge for anonymous Pub/Sub arises from conflicting goals between anonymization services and Pub/Sub—the first one attempts do distribute message routing among many nodes to protect anonymity, and the latter attempts to centralize routing functionality to increase efficiency.
* Within the traditional Pub/Sub paradigm, a broker acts a as central entity \[51\] and controls communication; participants just state interests and provide content.
* Compared, anonymization services such as Tor and Crowds empower the participant to organize communication and thus take control over the infrastructure.
* Anonymous Pub/Sub, therefore, must balance both extremes, preferably in an adjustable manner.
* Every participant must be able to manage her anonymity, i.e., control privacy enhancing technologies (PETs) and measures the current level of protection, whereas the infrastructure may only take some control to maintain requirements such as integrity and availability.
![](https://i.imgur.com/cWn9eSz.png)
This thesis follows the Privacy by Design principle 3: “Privacy Embedded into Design”. Therefore, anonymity as an aspect of privacy has to be considered for every component of the architecture.
* This approach minimizes issues that arise from add-on anonymity solutions like Tor \[44\] due to the missing consideration of application layer leak of information.
Furthermore, this thesis follows the design principles of “high cohesion and loose coupling” \[18\] and applies it to the components.
* High cohesion, on the one hand, requires that elements of one building block highly depend on and interact with each other.
* Loose coupling, on the other hand, reduces interdependencies between building blocks.
> These principles support exchangeable building blocks—depending on the selection and prioritization of requirements. This structure should, therefore, ease the selection of building block technologies for applications and their respective requirements.
![](https://i.imgur.com/bcI7vwi.png)
Overview of Building Blocks: The overview and dependencies of all six building blocks are depicted in Figure 12.
* Basic membership: establishes and maintains the connection of each participant with the system.
* Attribute localization: allows subscribers to find attributes offered by publishers
* Used by both roles.
* Matching: covers the functional Pub/Sub requirement (cf. Section 2.1.2) to maintain routing tables to route messages.
* This block applies to the forwarder role.
* Community management: repairs and optimizes overlay networks whenever topological changes occur.
* This block applies to forwarders, too.
* Content distribution distributes notifications to subscribers.
* This block applies to the forwarder role.
* Key management handles all secrets required for confidentiality and authenticity and
* Applies to the publisher and subscriber role.
![](https://i.imgur.com/gjhRRYD.png)
>The diagram presents the building blocks of one node participating in the Pub/Sub system (see Figure 13).
Role Considerations: The three roles publisher, subscriber, and forwarder (also called broker) allow for a split of Pub/Sub nodes according to these roles
> The concept of roles in Pub/Sub has been introduced in Section 3.1 (page 35).
![](https://i.imgur.com/YmBtits.png)
* Figure 13 depicts an example that splits each participant into the three role-based building blocks publisher P, subscriber S, and forwarder F. The left most participant acts as mere publisher and thus does not require the subscriber building block.
* The bottom left participant only acts as forwarder.
* the bottom center participant as forwarder and subscriber.
* The right most participant assumes all roles.
* The top center participant is not active in any role.
* Depending upon the role, some of the building blocks do not have to be considered, i.e., such building blocks can be disabled.
> For instance, a node not taking the publisher or subscriber roles is not required to provide the key management building block.
Lifecycle Considerations: Beside roles, runtime phases of a system constitute another method to split building blocks. The following phases reoccur in Pub/Sub systems (see Figure 12):
1. First, a participant has to connect to the system. This phase is covered by basic membership.
2. Next, the participant locates and subscribes to attributes—attribute localization.
3. To transport notifications to the participant, forwarders have to decide if a notification matches a subscription via the matching building block.
4. Furthermore, notifications have to be distributed with minimal overhead and confidentiality, which is covered by the content distribution building block.
5. To protect confidentiality, cryptographic keys are required that are managed using the key management building block.
6. Finally, connected nodes form a community that is subject to churn and node failure, which have to be compensated via the community management building block.
> Depending upon the lifecycle phase, a node does not need to provide all building blocks. For instance, the matching and attribute localization building blocks are not required while a node is distributing content.
## Conclusion