Try   HackMD

Proposal: Use aos-dev/go-storage to replace storage.ExternalStorage

Background

dumping uses storage.ExternalFileWriter to support data export.

storage.ExternalFileWriter use following APIs:

type ExternalFileWriter interface {
	// Write writes to buffer and if chunk is filled will upload it
	Write(ctx context.Context, p []byte) (int, error)
	// Close writes final chunk and completes the upload
	Close(ctx context.Context) error
}

In order to support multipart uploads, storage.ExternalStorage will create a struct to carry upload_id and completed parts:

type S3Uploader struct {
	svc           s3iface.S3API
	createOutput  *s3.CreateMultipartUploadOutput
	completeParts []*s3.CompletedPart
}

S3Uploader will create new parts in every call of Write and complete parts in Close.

Based on these design, dumping's main data export logic is following:

func WriteInsert(pCtx *tcontext.Context, cfg *Config, meta TableMeta, tblIR TableDataIR, w storage.ExternalFileWriter) (n uint64, err error) {
	...
    
	wp := newWriterPipe(w, cfg.FileSize, cfg.StatementSize, cfg.Labels)

	...


	for fileRowIter.HasNext() {
		...

		for fileRowIter.HasNext() {
			lastBfSize := bf.Len()
			if selectedField != "" {
				if err = fileRowIter.Decode(row); err != nil {
					pCtx.L().Error("fail to scan from sql.Row", zap.Error(err))
					return counter, errors.Trace(err)
				}
				row.WriteToBuffer(bf, escapeBackslash)
			} else {
				bf.WriteString("()")
			}
			counter++
			wp.AddFileSize(uint64(bf.Len()-lastBfSize) + 2) // 2 is for ",\n" and ";\n"
			...

			fileRowIter.Next()
			shouldSwitch := wp.ShouldSwitchStatement()
			if fileRowIter.HasNext() && !shouldSwitch {
				bf.WriteString(",\n")
			} else {
				bf.WriteString(";\n")
			}
			if bf.Len() >= lengthLimit {
				select {
				case <-pCtx.Done():
					return counter, pCtx.Err()
				case err = <-wp.errCh:
					return counter, err
				case wp.input <- bf:
					bf = pool.Get().(*bytes.Buffer)
					if bfCap := bf.Cap(); bfCap < lengthLimit {
						bf.Grow(lengthLimit - bfCap)
					}
					AddCounter(finishedRowsCounter, cfg.Labels, float64(counter-lastCounter))
					lastCounter = counter
				}
			}

			if shouldSwitch {
				break
			}
		}
		if wp.ShouldSwitchFile() {
			break
		}
	}
	...
	if bf.Len() > 0 {
		wp.input <- bf
	}
	close(wp.input)
	<-wp.closed
    
	...
    
	return counter, wp.Error()
}

dumping will create a buffer and call ExternalFileWriter.Write every time the buffer has been written 1048576(1M) lines.

Propose

It's indeed a burden for applications to connect to all storage services, especially for an application that has complicated business logic. So I propose to use aos-dev/go-storage to replace storage.ExternalStorage.

aos-dev/go-storage is an application-oriented unified storage layer for Golang. It's design goals are Production ready, High performance and Vendor agnostic. go-storage will support as many services as possible, including S3, GCS, OSS, COS, Kodo(qiniu), QingStor, even Dropbox(contributed via community).

Benefits

Drawbacks

  • go-storage needs to support all features that dumping supports for now, as described in issue go-service-s3#51, such as SSE.
  • dumping needs to handle the config parse to construct go-storage's Storager.

Implementations

For the first stage, we can just replace the Write and Close call without touching other parts of the projects.

  • Change the config parse to support construct go-storage's Storager

  • Way A: Use go-storage's Multiparter to replace storage.ExternalFileWriter.

  • Way B: Use go-storage to implement storage.ExternalFileWriter

Rational

io.FS

io.FS has been included in std lib since go 1.16. But io.FS is designed to work with file instead of bytes or stream. And is's lack of object storage's Multipart Object support.

spf13/afero

afero is another FileSystem Abstraction System for Go. As his name implies, it also works with files.

There is no official support for s3 like services, but there is community built one: afero-s3. It uses S3Manager to in Write operations which means user can't control the logic of underlying multipart object.