owned this note
owned this note
Published
Linked with GitHub
# Orchestrator Storage Paths
Summary: There's a problem with S3 path prefixes for orchestrator-owned storage, in particular how to group uploads in the same folder along with transcoded results.
## GetOrchestrator Flow
The OS Discovery network flow is invoked via `GetOrchestrator` and currently has the following message types:
```
GetOrchestrator : OrchestratorRequest -> OrchestratorInfo
```
```
OrchestratorRequest {
// Ethereum address of the broadcaster
bytes address
// Broadcaster's signature over its address
bytes sig
}
OrchestratorInfo {
// URI of the node to use for submitting segments
string orchestrator
// Orchestrator's preferred object storage, if any
[]OSInfo storage
}
```
## Problem
In v1 protocol, we sent a JobID along with `OrchestratorRequest`. Among other things, this was used to initialize the orchestrator's storage session, primarily to set the path prefix for broadcaster uploads. Traditionally, this prefix has been the Manifest ID part of the job.
For OS, the format of the full path is generally:
```
manifestID/profileName/seqNo.ts
```
Concretely,
```
abc1234/240p30fps16x9/109.ts
```
Without the notion of a JobID, there is a question of what the storage session prefix should be and how we should communicate that between B and O.
## Solutions
### Random Orchestrator-Selected Prefix
The orchestrator randomly selects a prefix each time a OS session is initialized.
Sessions on the orchestrator are initialized at least twice for each transcoding job: once for [exporting to the broadcaster](https://github.com/livepeer/go-livepeer/blob/master/server/rpc.go#L296) in `OrchestratorInfo`, and once when initializing the job's [transcoding loop](https://github.com/livepeer/go-livepeer/blob/master/core/orchestrator.go#L394).
Using a "Random Orchestrator-Selected Prefix" would result in orchestrator's OS data being put in two prefixes: one for the broadcaster's uploads and another for the orchestrator's uploads. This is **technically not a problem at all**, but there is the potential loss of convenience from not having these two sets of uploads grouped together, should we (or the user) ever want to do something with that information.
The path structure would resemble this:
```
sourcePrefix/profile1Name/segNo.ts
transcodedPrefix/profile2Name/segNo.ts
transcodedPrefix/profile3Name/segNo.ts
```
**Pros** Simplest. Least amount of state sharing
**Cons** Uploads put in two places
### Broadcaster Address as Prefix
The broadcaster's address can be used as a prefix.
This is not a complete solution by itself. Each new broadcast session will restart the sequence number, so each new broadcast would collide [1] with earlier broadcasts with the same profiles.
Hence, some random prefix is still needed following the broadcaster prefix:
```
prefix=broadcasterAddr/randomNumber
````
We get slightly better per-broadcaster grouping, but still not per-broadcast grouping.
**Pros** Per-broadcaster grouping
**Cons** Grouping is still not per-broadcast
[1] Collisions are problematic for S3's [read-after-write consistency model](https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html#ConsistencyModel). It is important to ensure that each write is the very first action taken on that resource; otherwise subsequent reads may return stale data.
### ManifestID as Prefix
ManifestID would have to be passed in with `OrchestratorRequest`, replacing the old `JobId`.
During O Discovery phase, the ManifestID is not known. This imposes a requirement to invoke `GetOrchestrator` twice: once during discovery, and once to just prior to broadcast (with the ManifestID) in order to obtain proper OS information. This is a "would be nice" step for the broadcaster; it's not strictly.
The second `GetOrchestrator` call is not strictly necessary for the operation of the protocol and is done by benevolent broadcasters so that the orchestrator may have nice looking storage layout.
Calling `GetOrchestrator` for each broadcast might be required anyway, to obtain non-expired OS credentials or PM parameters.
This also leads to a potential area of semantic confusion around the use of OS. If a manifest is not supplied (since discovery won't be querying about a particular broadcast session), what should the orchestrator use as the prefix for its `OrchestratorInfo.Storage` field? We don't necessarily want to leave the storage empty if the `OrchestratorRequest.ManifestId` is unspecified; this could be taken to mean O accepts a direct upload (or has no preference), when it really does have a preference. The solution here seems to generate a fake ManifestID for each query.
**Pros** Relatively straightforward to implement
**Cons** Message semantics become more vague. Uncertain future; see "Thinking Ahead" below.
**Thinking Ahead** Since the ManifestID is now a broadcaster-internal identifier with no requirement to be globally unique, we may want to shorten it to something more readable than a 256 bit hexadecimal string, eg allowing for user-supplied names. To do this, we will likely need to incorporate the broadcaster address as a prefix to avoid collisions with other broadcasters. Also, incorporating user input as-is within file paths is a security risk [1], so O using readable broadcaster-selected ManifestIDs as a prefix may not be a good long-term strategy.
[1] Less of an issue with the current ManifestID setup since we transmit it as a string of arbitrary bytes and hex-encode on the O side.
### Orchestrator-Selected Prefix in AuthToken
The v1 protocol passed an `AuthToken` from the `TranscoderInfo` to each segment. This could have been removed for Streamflow, but we can retain it here.
`AuthToken` is used to supply additional transcoder-internal information to keep the transcoder stateless between `GetOrchestrator` (which is a speculative call that may not lead to actual work) and `SubmitSegment`. We can put an orchestrator-selected prefix in `AuthToken` and maintain directory continuity between the source upload and the transcoded upload. In fact, the existing `AuthToken.JobId` field would do perfectly well.
**Pros** Preserves the desired property of source and transcoded uploads sharing the same prefix
**Cons** Missed opportunity to streamline the networking protocol. However, this may not be the last time that having an `AuthToken` comes in handy.