Amazon S3 (Simple Storage Service)

# Amazon S3 (Simple Storage Service) 3/1 半夜發生的事情: [Amazon AWS S3 outage is breaking things for a lot of websites and apps](https://techcrunch.com/2017/02/28/amazon-aws-s3-outage-is-breaking-things-for-a-lot-of-websites-and-apps/) ## Overview https://aws.amazon.com/tw/s3/faqs/ ## Introduction ![](https://i.imgur.com/S3749g2.png) AWS S3 是一個雲端空間。我們可以創建 bucket 將資料隔離。並且我們上傳到 bucket 的資料稱作物件(Object)，資料名稱會被稱為是 Key (類似 key-value pair)。對於 S3 來說，他並沒有目錄結構，而是平坦化儲存系統。如果我們要建立類似目錄的管理方式，就是在 Key 加上 Prefix，例如： - "夜貓子的生活.pdf" - 我們可以在前面加上 "個人日記/" - 變成 "個人日記/夜貓子的生活.pdf" 當作整個 Key AWS S3 對於每一個使用者最多可以創建 100 個 Bucket, 在使用上很足夠，如果要使用 100 個以上的 Buckets 時，可以去申請擴增限制的方案。另外，S3 單個物件的容量上限為 5TB。 ## Pricing 參考: https://aws.amazon.com/tw/s3/pricing/ - S3 對於傳入資料的流量是不收費的 - S3 將資料傳出給 EC2, CloudFront(CDN) 也是不收費的 - 傳出其他區域的資料必須收費 - 節點加速無論傳到哪裡，從哪裡上傳都要額外收費。 ## Python SDK boto3 boto3 底層是由 botocore 實作：https://github.com/boto/botocore * [S3 Migration Guide](https://boto3.readthedocs.io/en/latest/guide/migrations3.html) 介紹從 boto 轉移到 boto3 的寫法，展示一般常用的功能 * [S3 Service Feature Guide](http://boto3.readthedocs.io/en/latest/guide/s3.html) 針對 S3 全面的介紹每一個功能的用法 * [S3 API details](https://boto3.readthedocs.io/en/latest/reference/services/s3.html#s3) API 說明文件，可以看到每一個參數，Return Value 。 ## Configuration https://boto3.readthedocs.io/en/latest/guide/quickstart.html#configuration ~/.aws/credentials: ```config [default] aws_access_key_id=foo aws_secret_access_key=bar [dev] aws_access_key_id=foo2 aws_secret_access_key=bar2 [prod] aws_access_key_id=foo3 aws_secret_access_key=bar3 ``` ~/.aws/config: ``` config [default] region=us-west-1 [profile dev] region=us-west-1 ``` 這裡沒有寫 [profile prod] 預設會變成是 us-east-1 #### Bucket Policy 我們要能讓外界的使用者 Download, Preview, 或是讓第三方套件可以存取我們的東西，必須設定 Bucket Policy, 對應到的會是 **GetObject** 這個動作 (Actions)，下面的設定檔代表可以公開存取。這一段必須在 S3 Bucket Policy 那裡設定 ![](https://i.imgur.com/5bAo2ZT.png) ```config { "Id": "Policy1487750609880", "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1487750608741", "Action": [ "s3:GetObject" ], "Effect": "Allow", "Resource": "arn:aws:s3:::shareclasstest/*", "Principal": "*" } ] } ``` Reference: - [AWS Bucket Policy Reference](https://docs.aws.amazon.com/AmazonS3/latest/dev/example-bucket-policies.html) - [Specifying Permissions in a Policy](https://docs.aws.amazon.com/AmazonS3/latest/dev/using-with-s3-actions.html) - [How to make s3 object public by default](http://stackoverflow.com/questions/19176926/how-to-make-all-objects-in-aws-s3-bucket-public-by-default) #### Bucket Principal http://docs.aws.amazon.com/AmazonS3/latest/dev/s3-bucket-user-policy-specifying-principal-intro.html >The Principal element specifies the **user**, account, service, or other entity that is allowed or denied access to a resource. #### CORS 關於 Same-Origin Policy 可以看這邊：https://hackmd.io/s/H1cY3TTYe Client 端如果要使用 Javacript 來與 S3 互動的話，基於 Same-Origin Policy (SOP)同源政策，這是無法達到的。所以我們要設置 CORS 允許特定來源可以使用 Javascript 發送 HTTP request 來和我們的 S3 互動。 ```yaml <CORSConfiguration> <CORSRule> <AllowedOrigin>http://www.example1.com</AllowedOrigin> <AllowedMethod>PUT</AllowedMethod> <AllowedMethod>POST</AllowedMethod> <AllowedMethod>DELETE</AllowedMethod> <AllowedHeader>*</AllowedHeader> </CORSRule> <CORSRule> <AllowedOrigin>http://www.example2.com</AllowedOrigin> <AllowedMethod>PUT</AllowedMethod> <AllowedMethod>POST</AllowedMethod> <AllowedMethod>DELETE</AllowedMethod> <AllowedHeader>*</AllowedHeader> </CORSRule> <CORSRule> <AllowedOrigin>*</AllowedOrigin> <AllowedMethod>GET</AllowedMethod> </CORSRule> </CORSConfiguration> ``` - AWS S3 CORS : http://docs.aws.amazon.com/AmazonS3/latest/dev/cors.html ## Bucket Restrictions & Limitations 因為 Bucket 的存取會跟 url 有關係。有兩種群存取方式 - Path-style: https://s3-{region}.amazonaws.com/{bucket}/ - Virtual-hosted–style: http://{bucket}.s3-aws-{region}.amazonaws.com 所以 Bucket 的命名官方建議遵循 DNS Naming convention 以下是他的命名限制: - Bucket names must be at least 3 and no more than 63 characters long. - Bucket names must be a series of one or more labels. Adjacent labels are separated by a single period (.). Bucket names can contain lowercase letters, numbers, and hyphens. Each label must start and end with a lowercase letter or a number. - Bucket names must not be formatted as an IP address (e.g., 192.168.5.4). - When using virtual hosted–style buckets with SSL, the SSL wildcard certificate only matches buckets that do not contain periods. To work around this, use HTTP or write your own certificate verification logic. We recommend that you do not use periods (".") in bucket names. **Reference** - Bucket Restrictions and Limitations: http://docs.aws.amazon.com/AmazonS3/latest/dev/BucketRestrictions.html ## Performance Tuning 什麼時候需要 Optimize? 當你的Bucket出現下面兩種情況: 1. 每秒超過 300 個 GET requests 2. 每秒超過 100 個 PUT, LIST, DELETE requests p.s 這裡的 GET, PUT 對應到的並不是 HTTP request method, 而是對於 Resource 的操作: http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectOps.html 並且 Optimize 的方式也會因為 Bucket 的使用用途而有所不同 - GET-intensive Bucket: Netflix - GET, PUT, DELETE, LIST request 混用的 Bucket #### GET-intensive Bucket 對於這種 case 來說，最佳化存取的效能就是使用 CDN。以 Bucket 的服務區域來說，它並沒有直接提供台灣的服務區域，最近的是東京但是，AWS CloudFront 是可以直接提供台灣地區的，所以可以加速資料傳輸。 [AWS CloudFront Pricing](https://aws.amazon.com/tw/cloudfront/pricing/) 另外，目前也很多人會使用 CloudFlare 的服務。雖然前幾天才剛發生過事情: https://blog.cloudflare.com/incident-report-on-memory-leak-caused-by-cloudflare-parser-bug/ #### 混合型的 Bucket 我們是利用 Key 來定位物件所在。對於 S3 來說，這些 Key 的意義是將物件分區儲存，每一區都有 I/O 的上限配額。如果我們要提高 S3 I/O 的存取效能，就要避免 Sequential Key 的情況。 ![](https://i.imgur.com/wkRwvqE.png) ![](https://i.imgur.com/UlP4XEB.png) ![](https://i.imgur.com/TVmpSQs.png) #### 其餘如果你要加快上傳到 S3 的速度，也可以利用節點加速的服務：http://docs.aws.amazon.com/AmazonS3/latest/dev/transfer-acceleration.html **Reference** - Maximizing Amazon S3 Performance (STG304) | AWS re:Invent 2013: https://www.youtube.com/watch?v=uXHw0Xae2ww ## 如何讓 Client 自行上傳檔案到 S3 ![](https://i.imgur.com/WHgP4aW.png) 一般上傳檔案使用的 API 1. boto3.client('s3').upload_fileobj 2. boto3.resource('s3').upload_fileobj 使用這兩種上傳方式都必須要提供 API 憑證 ( IAM Credentials )。所以如果要執行這樣的動作，通常只會在伺服器端進行。但是，如果今天我們的 Client 要上傳檔案到 AWS S3, 仍然是透過這種方式的話，一來會消耗兩倍的頻寬，二來會讓伺服器 Block 住，導致發生瓶頸。所以，在上傳大檔案的時候，我們會傾向讓 Client 自行上傳。 **AWS S3 提供可以讓 Client 自行上傳的 API** 1. Presigned URL 2. Presigned POST ### 1. Presigned URL 如果要使用兩者的話，都是要用 low-level interface: [boto3.client](http://boto3.readthedocs.io/en/latest/guide/clients.html#low-level-clients) API: [generate_presigned_url](http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.generate_presigned_url) example: ```python client = boto3.client('s3') presigned_url = client.generate_presigned_url( ClientMethod="put_object", Params={ 'Bucket': self.bucket, 'Key': key }, ExpiresIn=3600, HttpMethod="PUT" ) ``` **generate_presigned_url 參數說明** - ClientMethod：這個意思是 boto3.client 可以操作的 API method，清單在[這裡](http://boto3.readthedocs.io/en/latest/reference/services/s3.html#client)。( 但是其實上面所列的不見得可以當作 presigned url 的 ClientMethod, 因為文件上都沒有記載，底層實作的 botocore 內部紀錄的 client method 有些也不能用。) 例如：如果你要上傳物件，就用 ClientMethod="put_object" - Params: 以 "put_object" 為例，我們就要去看 client.put_object 的 API, 他的參數有哪些？[put_object](http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.put_object) - ExpreisIn: Presigned URL 可以使用的期限 - HttpMethod ### 2. Presigned POST API: [generate_presigned_post](http://boto3.readthedocs.io/en/latest/reference/services/s3.html#S3.Client.generate_presigned_post) presigned post 內部實作以及 configuration 是源自於 [browser-based upload](http://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-UsingHTTPPOST.html) example: http://boto3.readthedocs.io/en/latest/guide/s3.html#generating-presigned-posts ### 使用 Presigned URL, POST 時，Policy 的設定 ![](https://i.imgur.com/5wRIlgj.png) 我們在使用 Presigned URL 的時候，使用的權限會是套件的 IAM user 的權限。如果，你要限定某一個 presigned url 只有上傳物件，跟查看物件的功能，為了安全性來說，會創建額外的 IAM user, 加入 IAM user policy: ```config { "Version": "2012-10-17", "Statement": [ { "Sid": "Stmt1488193188000", "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject" ], "Resource": [ "arn:aws:s3:::shareclasstest/*" ] } ] } ``` 在 boto3 的部分使用 client: ```python s3_upload_client = boto3.client( 's3' aws_access_key_id='iam access key', aws_secret_access_key='iam secret key' ) ``` 如果今天有 multiple iam user 在 ~/.aws/credentials 要怎麼使用 boto3? 使用 boto3.session 傳遞 profile_name 參數 ```config [default] aws_access_key_id=foo aws_secret_access_key=bar [dev] aws_access_key_id=foo2 aws_secret_access_key=bar2 [prod] aws_access_key_id=foo3 aws_secret_access_key=bar3 ``` in python: ```python session = boto3.Session(profile_name='dev') # Any clients created from this session will use credentials # from the [dev] section of ~/.aws/credentials. dev_s3_client = session.client('s3') ``` ### 社群朋友提供的方法 Cognito + IAM > 之前做過類似的系統(c#)，不過做法是使用cognito + IAM。首先使用登入後的會員帳號透過cognito service去配發一個固定的Identity ID。 > > 這個ID可以產生Token，同時，IAM設定s3的policy: 讓Cognito產生的ID/Token只能存取使用者自己的s3 path (自己規劃的路徑格式，中間使用IdenityID當作變數)，以避免使用這ID/Token亂抓亂抓。 > > 最後client端使用cognito配發的ID/Token搭配S3 的multipart upload api直接由client端上傳檔案到該會員自己的S3路徑下。 > > [name=Jinmin Liu] ## Appendix: Django Upload File 在 Django 裡面，接到上傳的檔案預設的 Type 會是 InMemoryUploadedFile 它是繼承 UploadedFile （詳見 source code)。 Django Uploaded File 有寫到一段： > **UploadedFile.read()** Read the entire uploaded data from the file. Be careful with this method: **if the uploaded file is huge it can overwhelm your system if you try to read it into memory.** You’ll probably want to use chunks() instead; see below. > >from: https://docs.djangoproject.com/en/1.10/ref/files/uploads/#django.core.files.uploadedfile.UploadedFile.read [Q] 怎麼用 chunks 處理大檔案呢？ #### Source Code - [Uploaded File](https://docs.djangoproject.com/en/1.10/_modules/django/core/files/uploadedfile/#InMemoryUploadedFile) - [InMemoryUploadedFile](https://docs.djangoproject.com/en/1.10/_modules/django/core/files/uploadedfile/#InMemoryUploadedFile)