字句解析その2 - シェルもどきをgoで自作する #4

おさらい

これまで

シェルもどきoreshellを自作している。
シェルは入力した文字列を読み取って、コマンドと引数群に分割してプロセス生成に渡している。
現在の実装では入力文字列を単純に空白で分割している。
空白文字を含んでいるファイル名/パス名(例:「cp \ oge "h ge"」)を扱えるようにするために字句解析を導入する。

文字のエスケープを考慮した字句解析のステートマシン

        
    
    
      Image Not Showing
      Possible Reasons
      The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported
      Learn More →

入力された文字列を状態遷移に従って読み取る様子

入力された文字列と文字を読み取るためのポインタ

以下は入力された文字列「cp \ oge "h ge"」を字句解析する直前を表した図である。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

PとSは、文字を読み取るためのポインタである。この2つのポインタは入力された文字列の先頭から末尾に向かって移動する。

以降で、入力された文字列をこの2つのポインタを使って読み取る様子を説明する。

「cp」の切り出し

ステートマシンは開始するとすぐに「lexText状態」に遷移する。
Pの現在位置に文字「c」を見つけたので「lexString状態」に遷移する。
「lexString状態」では、バックスラッシュ文字、クォーテーション文字、空白文字、EOFのいずれかが見つかるまで

Pの現在位置の文字の確認
Pを次に移動

を繰り返す。
見つけたらSの位置からPの位置までの間の文字群を切り出す。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Pの現在位置に文字「c」を見つけた。
Pを次に進める。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Pの現在位置に文字「p」を見つけた。
Pを次に進める。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Pの現在位置に文字「 (空白)」を見つけた。
空白文字が見つかったので、Sの位置からPの位置までの間の文字群を切り出す。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

SをPの位置まで進める。
「lexText状態」に戻る。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

一つ目の区切りの「 (空白)」の切り出し

Pの現在位置に文字「 (空白)」を見つけたので「lexWhitespace状態」に遷移する。
「lexWhitespace状態」では、空白以外の文字が見つかるまで

Pの現在位置の文字の確認
Pを次に移動

を繰り返す。
見つけたらSの位置からPの位置までの間の文字群を切り出す。

Pの現在位置に文字「 (空白)」を見つけた。
Pを次に進める。
Pの現在位置に文字「\」を見つけた。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

空白以外の文字が見つかったので、Sの位置からPの位置までの間の文字群を切り出す。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

SをPの位置まで進める。
「lexText状態」に戻る。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

「\ 」(バックスラッシュ文字がついた空白文字)の切り出し

Pの現在位置にバックスラッシュ文字を見つけたので「lexEscapeChar状態」に遷移する。
「lexEscapeChar状態」では、

Pを次に移動(バックスラッシュ文字の分)
Pを次に移動(エスケープ対象となった文字の分)

を行う。
そのあとSの位置からPの位置までの間の文字群を切り出す。

Pを次に進める。(移動後位置の文字は「 (空白)」)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Pを次に進める。(移動後位置の文字は「o」)

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Sの位置からPの位置までの間の文字群を切り出す。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

SをPの位置まで進める。
「lexText状態」に戻る。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

「oge」の切り出し

Pの現在位置に文字「o」を見つけたので「lexString状態」に遷移する。
「lexString状態」では、前述の通りバックスラッシュ文字、クォーテーション文字、空白文字、EOFのいずれかが見つかるまで

Pの現在位置の文字の確認
Pを次に移動

を繰り返す。
見つけたらSの位置からPの位置までの間の文字群を切り出す。

Pの現在位置に文字「o」を見つけた。
Pを次に進める。
Pの現在位置に文字「g」を見つけた。
Pを次に進める。
Pの現在位置に文字「e」を見つけた。
Pを次に進める。
Pの現在位置に文字「 (空白)」を見つけた。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

空白文字が見つかったので、Sの位置からPの位置までの間の文字群を切り出す。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

SをPの位置まで進める。
「lexText状態」に戻る。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

二つ目の区切りの「 (空白)」の切り出し

一つ目の区切りの「 (空白)」の切り出しと同じなので説明は省略。

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

「"h ge"」(クォートで囲まれた文字列)の切り出し

Pの現在位置に文字「"」を見つけたので「lexQuotedString状態」に遷移する。
「lexQuotedString状態」では、次のクォーテーション文字が見つかるまで

Pの現在位置の文字の確認
Pを次に移動

を繰り返す。
見つけたらSの位置からPの位置までの間の文字群を切り出す。(ただし両端のクォート文字も切り出しに含める)

Pの現在位置に文字「"」を見つけた。
Pを次に進める。
Pの現在位置に文字「h」を見つけた。
Pを次に進める。
Pの現在位置に文字「 (空白)」を見つけた。
Pを次に進める。
Pの現在位置に文字「g」を見つけた。
Pを次に進める。
Pの現在位置に文字「e」を見つけた。
Pを次に進める。
Pの現在位置に文字「"」を見つけた。
Pを次に進める。

次のクォーテーション文字が見つかったので、Sの位置からPの位置までの間の文字群を切り出す。

SをPの位置まで進める。
「lexText状態」に戻る。

終端の切り出し

Pの位置が文字列全体の長さよりはみ出た場合、文字列全体の終端を検知したとみなす。
終端マークを切り出す。

字句解析で切り出したトークンを再構成する

字句解析を完了し、トークン群を切り出すことができた。
しかし、切り出したトークン群をそのままシェル内のプロセス生成に渡すことはできない。
ここで説明した通り、プロセス生成に渡したい文字列の配列になるように、トークン群から要らない要素を削除、要素同士の連結をして再構成する必要がある。

今回の例だと、

2つ目のトークンである空白文字を削除する。
3つ目のトークンと4つ目のトークンを連結する。
5つ目のトークンである空白文字を削除する。
7つ目のトークンであるEOFを削除する。

また、エスケープ処理した文字列リテラルから先頭のエスケープ文字「\」、または前後のクオート文字「"」「'」を撤去する。

実装の該当箇所は以下の通り。

// 入力文字列を空白ごとに単語に分解するのをやめて、字句解析した結果から単語配列を作る。
//words := strings.Split(strings.Trim(string(line), " "), " ")
words := lineToWords(string(line))

func lineToWords(line string) (words []string) {

  l := lexer.Lex(strings.Trim(line, " ")) // 字句解析
  var word string

  for {
    // 字句解析結果トークンを一つづつ取り出す。
    token := l.NextItem()
    if token.Type == lexer.ItemWhitespace {
      words = append(words, word)
      word = ""
    } else if token.Type == lexer.ItemEOF || token.Type == lexer.ItemError {
      words = append(words, word)
      break
    } else {
      // 文字列リテラルの連結
      word = word + token.Unescape()
    }
  }

  return words
}

func (me Item) Unescape() string {
	switch {
	case me.Type == ItemEscapeChar:
        // 先頭のバックスラッシュ文字、または前後のクォート文字を撤去
		return string(me.Val[1])
	case me.Type == ItemQuotedString:
        // 前後のクォート文字を撤去
		return strings.Trim(me.Val, string(me.Val[0]))
	}
	return me.Val
}

ソースコード

実行してみる

$ go run main.go
(ore) > touch \ oge
(ore) > ls \ oge
' oge'
(ore) > cp \ oge "h ge"
(ore) > ls
' oge' 'h ge'

字句解析その2 - シェルもどきをgoで自作する #4

おさらい

文字のエスケープを考慮した字句解析のステートマシン

入力された文字列を状態遷移に従って読み取る様子

入力された文字列と文字を読み取るためのポインタ

「cp」の切り出し

一つ目の区切りの「 (空白)」の切り出し

「\ 」(バックスラッシュ文字がついた空白文字)の切り出し

「oge」の切り出し

二つ目の区切りの「 (空白)」の切り出し

「"h ge"」(クォートで囲まれた文字列)の切り出し

終端の切り出し

字句解析で切り出したトークンを再構成する

実行してみる

Read more

フォアグラウンドジョブ/バックグラウンドジョブの実現方法 - シェルもどきをgoで自作する#20

シグナルってなに? - シェルもどきをgoで自作する#19

ジョブ制御ってなに? - シェルもどきをgoで自作する#18

ジョブってなに? - シェルもどきをgoで自作する#17