--- tags: python, xml, xml.etree.ElementTree, Document --- # xml.etree.ElementTree 筆記 [Source code](Lib/xml/etree/ElementTree.py) :::danger 筆記中單純紀錄讀取及 parse 的部份,修改及建立的內容請參閱 [Document](https://docs.python.org/2/library/xml.etree.elementtree.html) ::: - **Element** type 將分層的資料結構儲存在 ==*memory*== 中 - 每個 Element 有以下的特性 - tag - attributes (stored in a Python dictionary) - text - tail - child elements (stored in a Python sequence) - ==**xml.etree.cElementTree**== 為以 C 實作的 API :::info 儘量使用 **xml.etree.cElementTree**,因為速度快且消耗的內存小。從 Python3.3 之後,Element 會自動尋找可用的 C 函式庫來加快速度。 ::: --- ## Tutorial 範例 XML ```xml <?xml version="1.0"?> <data> <country name="Liechtenstein"> <rank>1</rank> <year>2008</year> <gdppc>141100</gdppc> <neighbor name="Austria" direction="E"/> <neighbor name="Switzerland" direction="W"/> </country> <country name="Singapore"> <rank>4</rank> <year>2011</year> <gdppc>59900</gdppc> <neighbor name="Malaysia" direction="N"/> </country> <country name="Panama"> <rank>68</rank> <year>2011</year> <gdppc>13600</gdppc> <neighbor name="Costa Rica" direction="W"/> <neighbor name="Colombia" direction="E"/> </country> </data> ``` ---- ### XML tree and elements - ==**ElementTree**== 表示了整個 XML 文件 - ==**Element**== 代表在 Tree 中的 Node ---- ### Parsing XML - 列出兩種讀 XML 的方法 1. 從 disk ```python import xml.etree.ElementTree as ET tree = ET.parse('country_data.xml') root = tree.getroot() ``` 2. 從字串 ```python root = ET.fromstring(country_data_as_string) ``` :::info fromstring() 將 XML parse 成 tree root 的 **Element**;其他 parsing function 則可能會將 XML parse 成一棵樹。 ::: - iterate root's child ```python >>> for child in root: ... print child.tag, child.attrib ... country {'name': 'Liechtenstein'} country {'name': 'Singapore'} country {'name': 'Panama'} ``` - 用 index ```python >>> root[0][1].text '2008' ``` ---- ### Finding interesting elements `Element.iter()` ```python >>> for neighbor in root.iter('neighbor'): ... print neighbor.attrib ... {'name': 'Austria', 'direction': 'E'} {'name': 'Switzerland', 'direction': 'W'} {'name': 'Malaysia', 'direction': 'N'} {'name': 'Costa Rica', 'direction': 'W'} {'name': 'Colombia', 'direction': 'E'} ``` `Element.findall()` & `Element.find()` ```python >>> for country in root.findall('country'): ... rank = country.find('rank').text ... name = country.get('name') ... print name, rank ... Liechtenstein 1 Singapore 4 Panama 68 ``` ---- ### Parsing XML with Namespaces - [XML namespace](https://en.wikipedia.org/wiki/XML_namespace) - 每個 tag 和 attributes 會變成 `{uri}sometag` 的型式 XML 範例 ```xml <?xml version="1.0"?> <actors xmlns:fictional="http://characters.example.com" xmlns="http://people.example.com"> <actor> <name>John Cleese</name> <fictional:character>Lancelot</fictional:character> <fictional:character>Archie Leach</fictional:character> </actor> <actor> <name>Eric Idle</name> <fictional:character>Sir Robin</fictional:character> <fictional:character>Gunther</fictional:character> <fictional:character>Commander Clement</fictional:character> </actor> </actors> ``` 有兩個方法可以 parse 1. 人工加上 URI ```python root = fromstring(xml_text) for actor in root.findall('{http://people.example.com}actor'): name = actor.find('{http://people.example.com}name') print name.text for char in actor.findall('{http://characters.example.com}character'): print ' |-->', char.text ``` :::success 當字串過長時,可以使用 string format。 ::: 2. (Better) 使用 xpath ```python ns = {'real_person': 'http://people.example.com', 'role': 'http://characters.example.com'} for actor in root.findall('real_person:actor', ns): name = actor.find('real_person:name', ns) print name.text for char in actor.findall('role:character', ns): print ' |-->', char.text ``` :::success 關於 xpath 請參閱 [XPath expressions](https://www.w3.org/TR/xpath/) ::: Output ```shell John Cleese |--> Lancelot |--> Archie Leach Eric Idle |--> Sir Robin |--> Gunther |--> Commander Clement ``` --- ## XPath support ### Supported XPath syntax | Syntax | Meaning | |---|---| |tag|選擇所有符合 tag 的 child elements。</br> Ex. `spam` 選擇所有 name 為 spam 的 child elements; `spam/egg` 選擇所有在 name 為 spam 的 child elements 中的 name 為 egg 的 grandchildren elements。| |\*|選擇所有 child elements。| |.|選擇當前節點| |//|選擇所有 subelements。</br> Ex. `.//egg` 選擇 tree 中所有 name 為 egg 的 elements。| |..|選擇 parent element。| |[@attrib]|選擇有 given attribute 的 element。| |[@attrib='value']|選擇所有 given attribute 等於 value 的 elements。(value 不能有引號)| |[tag]|選擇 child element 有 given tag 的 element。(限第1層的 child element)| |[tag='text']|選擇 child element 的 tag 等於 given text 的 element。(所有後代 element)| |[position]|index 的作用,可以使用 `last()`。(**1 為起始位置**)| :::danger 所有的 Predicates (使用中括號) 前面必須要有 tag name、* 或 其他Predicates。 ::: ### Example ```python import xml.etree.ElementTree as ET root = ET.fromstring(countrydata) # Top-level elements root.findall(".") # All 'neighbor' grand-children of 'country' children of the top-level # elements root.findall("./country/neighbor") # Nodes with name='Singapore' that have a 'year' child root.findall(".//year/..[@name='Singapore']") # 'year' nodes that are children of nodes with name='Singapore' root.findall(".//*[@name='Singapore']/year") # All 'neighbor' nodes that are the second child of their parent root.findall(".//neighbor[2]") ``` Output ```python= [<Element 'data' at 0x7fd5ace7e818>] [<Element 'neighbor' at 0x7fd5ab253548>, <Element 'neighbor' at 0x7fd5ab253598>, <Element 'neighbor' at 0x7fd5ab253728>, <Element 'neighbor' at 0x7fd5ab2538b8>, <Element 'neighbor' at 0x7fd5ab253908>] [<Element 'country' at 0x7fd5ab2535e8>] [<Element 'year' at 0x7fd5ab253688>] [<Element 'neighbor' at 0x7fd5ab253598>, <Element 'neighbor' at 0x7fd5ab253908>] ``` --- ## Reference ### [Functions](https://docs.python.org/2/library/xml.etree.elementtree.html#functions) ### [Element Objects](https://docs.python.org/2/library/xml.etree.elementtree.html#element-objects)