李宗棠
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    # Vectorized String Operations > Lee Tsung-Tang > ###### tags: `python` `pandas` `vectorized` `string manipulation` `Python Data Science Handbook` > 同時參考: - pandas官方文件[Working with text data](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html) - http://www.datasciencemadesimple.com/string-compare-in-pandas-python-test-whether-two-strings-are-equal-2/ [TOC] {%hackmd @88u1wNUtQpyVz9FsQYeBRg/r1vSYkogS %} ## Introducing Pandas String Operations We saw in previous sections how tools like NumPy and Pandas generalize arithmetic operations so that we can easily and quickly perform the same operation on many array elements. For example: > 如[Computation on NumPy Arrays: Universal Functions](/jThpTB_KTTmZqYzP0CRkWg)的介紹,`array`可以很容易的用vectorization of operations的方式直接對各elements做計算 ```python= import numpy as np x = np.array([2, 3, 5, 7, 11, 13]) x * 2 # array([ 4, 6, 10, 14, 22, 26]) ``` > However, For ==arrays of strings==, NumPy does *not* provide such simple access, and thus you're stuck using a more verbose loop syntax: ```python= data = ['peter', 'Paul', 'MARY', 'gUIDO'] [s.capitalize() for s in data] # ['Peter', 'Paul', 'Mary', 'Guido'] ``` > 這樣操作不僅比較麻煩,當資料中有missing data時還會出現error ```python= data = ['peter', 'Paul', None, 'MARY', 'gUIDO'] [s.capitalize() for s in data] ``` ![](https://i.imgur.com/KoDun8q.png) > `Pandas` includes features to address both this need for <font color=#0099ff>vectorized string operations</font> and for <font color=#0099ff>correctly handling missing data</font> via the `str` attribute of Pandas `Series` and `Index` objects containing strings. ```python= import pandas as pd names = pd.Series(data) names #0 peter #1 Paul #2 None #3 MARY #4 gUIDO #dtype: object ``` > 例如用`str` attribute 進行 capitalize ```python= names.str.capitalize() #0 Peter #1 Paul #2 None #3 Mary #4 Guido #dtype: object ``` ## Tables of Pandas String Methods pandas的string操作方式與python原生的字串操作頗為相似 ```python= monte = pd.Series(['Graham Chapman', 'John Cleese', 'Terry Gilliam', 'Eric Idle', 'Terry Jones', 'Michael Palin']) ``` > Here is a list of Pandas str methods that mirror Python string methods: ![](https://i.imgur.com/pHvUCMi.png) > Notice that these have ==various return values==. Some, like `lower()`, return a series of strings: ```python= monte.str.lower() #0 graham chapman #1 john cleese #2 terry gilliam #3 eric idle #4 terry jones #5 michael palin #dtype: object ``` > But some others return numbers: ```python= monte.str.len() #0 14 #1 11 #2 13 #3 9 #4 11 #5 13 #dtype: int64 ``` > Or Boolean values: ```python= monte.str.startswith('T') #0 False #1 False #2 True #3 False #4 True #5 False #dtype: bool ``` > Still others return `lists` or other compound values for each element: ```python= monte.str.split() #0 [Graham, Chapman] #1 [John, Cleese] #2 [Terry, Gilliam] #3 [Eric, Idle] #4 [Terry, Jones] #5 [Michael, Palin] #dtype: object ``` ### Methods using regular expressions 有些`str`的method可以使用regular expression匹配字串(調用`re`的function) |Method |Description| |:-:|:-:| `match()` |Call `re.match()` on each element, returning a boolean. `extract()`|Call `re.match()` on each element, returning matched groups as strings. `findall()`|Call `re.findall()` on each element `replace()`|Replace occurrences of pattern with some other string `contains()`|Call `re.search()` on each element, returning a boolean `count()`|Count occurrences of pattern `split()`|Equivalent to `str.split()`, but accepts regexps `rsplit()`|Equivalent to `str.rsplit()`, but accepts regexps > With these, you can do a wide range of interesting operations. For example, we can extract the ==first name== from each by asking for a <font color=#0099ff>contiguous group of characters at the beginning</font> of each element: > ```python= monte.str.extract('([A-Za-z]+)', expand=False) #0 Graham #1 John #2 Terry #3 Eric #4 Terry #5 Michael #dtype: object ``` `expand` : `bool`, default False - If `True`, return `DataFrame`. - If `False`, return `Series`/`Index`/`DataFrame`. [str.extract補充](#`str.extract`其他運用) >Or finding all names that <font color=#0099ff>start and end with a consonant</font>, making use of the ==start-of-string== (`^`) and ==end-of-string== (`$`) regular expression characters: > `str.findall()` ```python= monte.str.findall(r'^[^AEIOU].*[^aeiou]$') #0 [Graham Chapman] #1 [] #2 [Terry Gilliam] #3 [] #4 [Terry Jones] #5 [Michael Palin] #dtype: object ``` :waning_crescent_moon: `str.findall()`會在每個element返回`list` [str.findall補充](#`str.findall()`補充) ### Miscellaneous methods > Finally, there are some miscellaneous methods that enable other convenient operations: Method|Description |:--:|:--:| `get()`|Index each element `slice()`|Slice each element `slice_replace()`|Replace slice in each element with passed value `cat()`|Concatenate strings `repeat()`|Repeat values `normalize()`|Return Unicode form of string `pad()`|Add whitespace to left, right, or both sides of strings `wrap()`|Split long strings into lines with length less than a given width `join()`|Join strings in each element of the Series with passed separator `get_dummies()`|extract dummy variables as a dataframe #### Vectorized item access and slicing > `get()`跟`slice()`能對個別的element取值,並對整個array進行像量化操作: ```python= monte.str[0:3] #0 Gra #1 Joh #2 Ter #3 Eri #4 Ter #5 Mic #dtype: object ``` :notes: `monte.str[0:3]` 等於 `monte.str.slice(0,3)`;同樣`df.str.get(i)` 等價於 `df.str[i]` > `get()`跟`slice()`不只能作用在文在,如同python bulid-in一樣,能對`list`裡的index取值 > 結合`split()`(返回的每個elements都是`list`),可以很簡單的取出last name > ```python= monte.str.split().str.get(-1) #0 Chapman #1 Cleese #2 Gilliam #3 Idle #4 Jones #5 Palin #dtype: object ``` :notebook: `monte.str.split().str.get(-1)` 也可以表示為`monte.str.split().str[-1]` ##### indexing with `str` 的特殊狀況 > `[]`取出特定index的字符時,遇到超過文字長度的index則會return NnN > ```python= s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat']) s.str[0] #0 A #1 B #2 C #3 A #4 B #5 NaN #6 C #7 d #8 c #dtype: object s.str[1] #0 NaN #1 NaN #2 NaN #3 a #4 a #5 NaN #6 A #7 o #8 a #dtype: object ``` #### Indicator variables > 假設:A="born in America," B="born in the United Kingdom," C="likes cheese," D="likes spam": ```python= full_monte = pd.DataFrame({'name': monte, 'info': ['B|C|D', 'B|D', 'A|C', 'B|D', 'B|C', 'B|C|D']}) full_monte ``` ![](https://i.imgur.com/qsndYLM.png) > 用`get_dummies()`快速分割欄位並轉為dummy variable ```python= full_monte['info'].str.get_dummies('|') ``` ![](https://i.imgur.com/T4RtqJn.png) ## Concatenation ### Concatenating a single Series into a string > `str.cat()`可以將`series`合併成`str` ```python= s = pd.Series(['a', 'b', 'c', 'd']) s.str.cat(sep=',') # 'a,b,c,d' ``` > `str.cat()`預設會忽略missing value,`na_rep`參數可以設定遇到遺漏值時要代入的值 > ```python= t = pd.Series(['a', 'b', np.nan, 'd']) t.str.cat(sep=',') # 'a,b,d' t.str.cat(sep=',', na_rep='-') # 'a,b,-,d' ``` ### Concatenating a Series and something list-like into a Series > `str.cat()` 第一個arguments放長度與原`series`一樣的list-like object,會return每個elements合併的結果 > ```python= s.str.cat(['A', 'B', 'C', 'D']) #0 aA #1 bB #2 cC #3 dD #dtype: object ``` > 合併時如果任一邊的elements有遺漏值時都會回傳`NaN` ```python= s.str.cat(t) #0 aa #1 bb #2 NaN #3 dd #dtype: object s.str.cat(t, na_rep='-') #0 aa #1 bb #2 c- #3 dd #dtype: object ``` :mag: 可以用`na_rep`處理此問題 ### Extract first match in each subject (extract) ## Extention ### `str.extract`其他運用 > 取出多組match的結果會變成多欄的DF ```python= s = Series(['a1', 'b2', 'c3']) s.str.extract('([ab])(\d)') # 0 1 #0 a 1 #1 b 2 #2 NaN NaN ``` :waning_crescent_moon: 第二列有一組無法match,此時兩欄都會是`NaN` > 用`?`表示該組並非必要的組別 ```python= s.str.extract('([ab])?(\d)') # 0 1 #0 a 1 #1 b 2 #2 NaN 3 ``` > 還可以對column取名 ```python= s.str.extract('(?P<letter>[ab])(?P<digit>\d)') # letter digit #0 a 1 #1 b 2 #2 NaN NaN ``` > 如果只match一組 & `expand = False`,則會返回`pd.series`而不是1 column DF > ```python= s.str.extract('[ab](\d)', expand=False) #0 1 #1 2 #2 NaN #dtype: object ``` ### `str.findall()`補充 > `str.findall()`只會返回有match到的文字 ```python= s = pd.Series(['Lion', 'Monkey', 'Rabbit']) s.str.findall('Monkey') #0 [] #1 [Monkey] #2 [] #dtype: object s.str.findall('on') #0 [on] #1 [on] #2 [] #dtype: object ``` > 如果patten在一個elements match成功多次,則會return list of multiple string > ```python= s.str.findall('b') #0 [] #1 [] #2 [b, b] #dtype: object ``` ### `str.strip()`處理文字前後的空白或指定字符 > 在切割或處理字串前最後先確認文字前後沒有奇怪的內容 > > 向量化的原生[str.strip](https://docs.python.org/3/library/stdtypes.html#str.strip) ```python= s = pd.Series(['1. Ant. ', '2. Bee!\n', '3. Cat?\t', np.nan]) s #0 1. Ant. #1 2. Bee!\n #2 3. Cat?\t #3 NaN #dtype: object s.str.strip() #0 1. Ant. #1 2. Bee! #2 3. Cat? #3 NaN #dtype: object ``` > 只處理單邊 & 指定要處理的文字 ```python= s.str.lstrip('123.') #0 Ant. #1 Bee!\n #2 Cat?\t #3 NaN #dtype: object s.str.rstrip('.!? \n\t') #0 1. Ant #1 2. Bee #2 3. Cat #3 NaN #dtype: object ``` > 同時處理兩邊 & 指定處理文字 ```python= s.str.strip('123.!? \n\t') #0 Ant #1 Bee #2 Cat #3 NaN #dtype: object ``` ### 處理 `Index` (e.g. reset column name) > 除了`series`,`str`也可以作用於index ```python= idx = pd.Index([' jack', 'jill ', ' jesse ', 'frank']) idx.str.strip() # Index(['jack', 'jill', 'jesse', 'frank'], dtype='object') ``` > 這個方法很適合拿來清裡column name ```python= df = pd.DataFrame(np.random.randn(3, 2), columns=[' Column A ', ' Column B '], index=range(3)) df # Column A Column B #0 0.469112 -0.282863 #1 -1.509059 -1.135632 #2 1.212112 -0.173215 ``` > 例如清除column name兩邊空白 ```python= df.columns.str.strip() # Index(['Column A', 'Column B'], dtype='object') ``` > chain of `str` attribute, reset column name ```python= df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_') df # column_a column_b #0 0.469112 -0.282863 #1 -1.509059 -1.135632 #2 1.212112 -0.173215 ``` ### `str.replace()`的其他應用 `str.replace()`有兩個主要參數 - pat: *str or compiled regex* 要match的文字 - repl: *str or callable* 取代的文字 > 返回反向文字 ```python= repl = lambda m: m.group(0)[::-1] pd.Series(['foo 123', 'bar baz', np.nan]).str.replace(r'[a-z]+', repl) #0 oof 123 #1 rab zab #2 NaN #dtype: object ``` :notebook: "123"因為沒有匹配,所以不會轉為反向文字 > 取出第二組匹配結果,並大小寫互換 ```python= pat = r"(?P<one>\w+) (?P<two>\w+) (?P<three>\w+)" repl = lambda m: m.group('two').swapcase() pd.Series(['One Two Three', 'Foo Bar Baz']).str.replace(pat, repl) #0 tWO #1 bAR #dtype: object ``` ### Test whether two strings are equal ```python= df1 = { 'State':['Arizona','Georgia','Newyork','Indiana','Florida'], 'State_1':['New Jersey','georgia','Newyork','Indiana','florida']} df1 = pd.DataFrame(df1,columns=['State','State_1']) print(df1) ``` ![](https://i.imgur.com/cfbYP2t.png) > 比較兩欄是否完全一樣 > ```python= df1['is_equal']= (df1['State']==df1['State_1']) print(df1) ``` ![](https://i.imgur.com/POC4wsw.png) > 比較時忽略大小寫及空白 > ```python= df1['is_equal'] =( df1['State'].str.lower().str.replace('s/+',"") == df1['State_1'].str.lower().str.replace('s/+',"")) print(df1) ``` ![](https://i.imgur.com/I9Kw7iG.png)

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully