smartproxy
Back to blog

Parsing XML in Python – The Ultimate Guide 2024

Standards are a means to clear and define communication between people and things in the world. For example, the human language, USB sockets on computers, or the fact that you must add cereal before pouring milk. When it comes to computer applications and systems, one standard stands out above the rest as the most popular choice for developers – XML (eXtensible Markup Language). In this article, we’ll explore how you can parse data from XML files using Python’s built-in libraries, see the best methods to do so, and understand the importance of effectively reading information.

Martin Ganchev

Martin Ganchev

Feb 28, 2024

13 min. read

Parse XML Python - Hero Image

What is XML?

XML (eXtensible Markup Language) is a markup language used in many applications designed to store and transport data. It’s a standard for creating structured documents and exchanging data between different systems and applications on the web. An XML file uses tags to define elements within a document, similar to HTML (Hypertext Markup Language). However, while HTML is designed to display data, XML is designed to store and use data interchangeably in various applications.

XML example

Let’s take a look at an example XML file:

<?xml version="1.0" encoding="UTF-8"?>
<animal>
<item>
<name>cat</name>
<description> Cats are small, carnivorous mammals that are often kept as pets. Known for their agility, flexibility, and independent nature, cats make delightful companions.</description>
<breeds>
<breed>Persian</breed>
<breed>Siamese</breed>
<breed>Maine Coon</breed>
<breed>Bengal</breed>
<breed>Ragdoll</breed>
</breeds>
</item>
</animal>

Here, you can see many elements are organized and structured hierarchically in the content within a document. In this example, these are the following:

  • XML declaration – a brief overview of the XML document, version, and encoding;
  • <animal> is the root element;
  • <item> represents an instance of an animal, specifically a cat;
  • <name> contains the name of the animal, which is “cat”;
  • <description> provides a short description of the “cat”;
  • <breeds> is a container for listing different cat breeds using the <breed> element.

These lists are easy for humans and computers to read and understand. That’s why they’re so popular in storing and exchanging data between different systems – it doesn’t overcomplicate things and allows complete freedom to create organized item lists in any way you want.

How do you parse XML with Python?

Python offers several built-in ways of parsing XML files, so you don’t need to rush to install any external libraries. The methods are the xml.dom.minidom and the xml.etree.ElementTree modules and XPath expressions. Let’s take a closer look at each of them.

xml.dom.minidom is a module in Python's standard library that provides a minimal implementation of the Document Object Model (DOM) for working with XML documents. The DOM represents the structure of XML documents as a tree of objects, making it simple to manage, traverse, and modify XML file content through code. It provides a full representation of the document, allowing you to manipulate and look through the XML data quickly. While it’s a very convenient way of doing so, this approach is memory-intensive and may not be efficient for large XML documents.

xml.etree.ElementTree provides a slightly different method to parse and manipulate XML documents. The module is based on the ElementTree API, which is more lightweight, efficient, and designed to be easy for everyday XML file processing tasks. It represents XML documents as a tree of Element objects. Each Element object represents an XML element in the document, and it can have child elements, attributes, and text content. While it lacks some of the features of the full DOM, it’s often faster and uses less memory, making it suitable for large documents.

XPath is another method you can use to navigate through an XML file. While it’s not a Python library, it’s a universal method to parse XML files that can be used in many other programming languages, such as JavaScript, Java, C, or C++. You can even use it to scrape the web with Google Sheets! It follows the logic of building an expression by writing a path from the root element to the desired destination node that details how an XML element should be reached. It’s not the easiest nut to crack, but it’s useful.

These are just a few ways of parsing XML files in Python. While there are many other methods to do it, the Python coding language stands out as a convenient and easy tool due to its lightweight, easy-to-use modules and simple implementation. In the following sections, we’ll explore the xml.etree.ElementTree module and XPath expression language in greater detail and see how they can be used with actual data to extract valuable data from XML files.

Parsing an XML document with ElementTree

Let’s begin with Python’s xml.etree.ElementTree module. Since data in an XML file is structured hierarchically, it makes the most sense to represent it as a tree of elements that branch out from one another. The module has two classes for this purpose – ElementTree represents the whole XML document as a tree, while Element is a single node in a tree. Think of it as a real tree – it starts at the roots and branches out in several ways, but there's an apple at the end of each branch. You need to navigate through the branches like a nimble squirrel to get a taste of that juicy fruit of XML data.

Using the ElementTree

For this example, we’ll be using the following XML file:

<?xml version="1.0" encoding="UTF-8"?>
<pets>
<cat>
<name>Jinx</name>
<age>2</age>
<color>Gray</color>
<breed>Maine Coon</breed>
</cat>
<cat>
<name>Kafka</name>
<age>3</age>
<color>White</color>
<breed>Siamese</breed>
</cat>
<cat>
<name>Mori</name>
<age>1</age>
<color>Orange</color>
<breed>Tabby</breed>
</cat>
</pets>

First, create a Python script file and import the xml.etree.ElementTree library. It’s commonly shortened with an ET alias.

import xml.etree.ElementTree as ET

Next, import the data from the XML file. You can directly paste the XML file string in the code and parse from it. Notice the triple quotation marks (""") that are used on both sides of the XML document string when assigning it to a variable:

import xml.etree.ElementTree as ET
xml = """<?xml version="1.0" encoding="UTF-8"?>
<pets>
<cat>
<name>Jinx</name>
<age>2</age>
<color>Gray</color>
<breed>Maine Coon</breed>
</cat>
<cat>
<name>Kafka</name>
<age>3</age>
<color>White</color>
<breed>Siamese</breed>
</cat>
<cat>
<name>Mori</name>
<age>1</age>
<color>Orange</color>
<breed>Tabby</breed>
</cat>
</pets>
"""
root_element = ET.fromstring(xml)
print(root_element)

Alternatively, a better practice is to save the XML in a separate file. Let’s name it data.xml. You can then parse information from it like this:

import xml.etree.ElementTree as ET
tree = ET.parse("data.xml")
root_element = tree.getroot()
print(root_element)

Both code examples will return the same result – the root element of the XML tree. You can navigate through the entire tree to get the XML data you want from here.

Navigating the XML tree

Following our code example, we extracted the root element of the XML file. Every node will be a descendant of the root element pets. Keeping this in mind, you can easily navigate through the tree by specifying the nodes you want to get by iterating through child elements or by index.

Loop method

You’ll need to write a loop to navigate through the tree by searching for child elements:

for child in root_element:
print(child.tag)

The script will only print the direct child elements from the <pets> tag, resulting in just several <cat> tags being printed. To iterate further, you’ll need to iterate through each <cat> and print the text information:

for child in root_element:
for desc in child:
print(desc.text)

Index method

While the loop method helps print as much information as possible, writing so many nested loops can be a hassle. To get to a specific node more quickly, you can use indexes to point to where the element is. For example, let’s get the color of the first cat. If you take a brief look at our XML, you can assign a number to each node based on where it’s located relevant to its parent element:

...
<pets> #0
<cat> #0
<name>Jinx</name> #0
<age>2</age> #1
<color>Gray</color> #2 (the element you want to get)
<breed>Maine Coon</breed> #3
</cat>
...

Python is a zero-based language, so you’ll need to remember to start counting from 0 instead of 1. Since you want to get the first cat, you’ll write the first index as 0. Next, you’ll need to see which node is the <color> tag – it’s the 3rd node under <cat>; therefore, assign the index of 2.

cat_info = root_element[0][2].text
print(cat_info)

Other methods

The Element class has several other valuable methods to get the desired information. One is Element.iter(), which iterates through every node below it. You can even specify which element you want to print. For example, let’s get information about how old each cat is by specifying the “age”:

for age in root_element.iter("age"):
print(age.text)

Element.find() will find the first child of a node with the particular tag. Element.findall() will find all nodes that are direct child elements of a current node:

for animal in root_element.findall("cat"): # Loop and find all direct child elements of <pets>, which are <cat>
name = animal.find("name").text # Find the first <name> element under a particular <cat>
print(name) # Print the name

Using any of these methods, you can build a precise script to navigate to a piece of information in an XML file. In a real-world application, the XML files might be far more complex than the example here, and you’ll need to write longer loops, dynamic indexes, and take advantage of the various Element class methods.

Extracting data from XML

To extract data from an XML file, use one of the examples above to navigate the XML file and get only the needed XML data. The next step depends entirely on what you’re trying to do – it may be enough to print the information to the terminal, or you could save it to a CSV file for later use and analysis.

To print it to the Terminal or an equivalent command line tool, simply use the print() function as before. 

Saving the information to a CSV file will require a little more effort. You’ll need to import the csv library to write the XML data to a file. Here, the script will only get each cat's name and age and save that information to a CSV file. Follow the comments in the code if you’re unsure what each line is for:

import xml.etree.ElementTree as ET
import csv # Import the CSV library
import xml.etree.ElementTree as ET
tree = ET.parse("data.xml")
root_element = tree.getroot()
with open("cats_data.csv", 'w', newline='') as csvfile: # Open the CSV file to write in
csv_writer = csv.writer(csvfile) # Create a new CSV writer object
csv_writer.writerow(['Name', 'Age']) # Write the header information
for cat in root_element.findall('cat'): # Iterate through all cat elements in the XML
name = cat.find('name').text # Extract the name
age = cat.find('age').text # Extract the age
csv_writer.writerow([name, age]) # Write the information into the CSV file
print(f"Data saved to CSV file.")

After you run the script, you should see the cats_data.csv file appear in the same directory as your script file. If you check it, you’ll see it matches the data from the XML file:

Name,Age
Jinx,2
Kafka,3
Mori,1

That’s it! This data can be used for further analysis with other tools and applications. No matter how much extra data will be added to the XML file, this script will always print the name and age of each cat and export it to CSV.

Parsing an XML document with XPath

XPath (XML Path Language) is a query language for selecting nodes from an XML document. It provides a way to navigate and query the hierarchical structure of XML data by specifying the location of elements and attributes. XPath uses a path notation, similar to file paths in a file system, to traverse the XML document tree and get to specific nodes.

XPath is a vibrant expression language featuring 200+ functions for various purposes, from simple to complex. For a more in-depth look into all the possible methods and ways XPath can be used, check out the W3Schools tutorial or refer to the devhints.io cheatsheet.

Using XPath

To use XPath, we’ll continue using the Python xml.etree.ElementTree library. This is because Python can’t read or understand XPath queries on its own. The Element class has a method we’ve briefly explored – Element.findall(). To write XPath queries, you’ll simply need to use this method with the paths inside it.

Searching XML with XPath

We’ll be working with the same XML data as before, so copy and save the XML file from above if you haven’t done so already. Let’s start with a familiar example – find the root element of the XML file tree. Begin by importing the xml.etree.ElementTree library and get the XML data from the data.xml file.

import xml.etree.ElementTree as ET
tree = ET.parse("data.xml")
root_element = tree.getroot()

The next step is to use the Element.findall() method and pass the “.” string to it. This is XPath syntax for selecting the current node. Since you haven’t navigated anywhere yet, the current node will always be the root of the tree. Here’s what it looks like in the code:

root = root_element.findall(".")
print(root)

Those are the basics of using XPath queries with Python to get data from an XML file. You may also use libraries such as lxml or libxml2 that can do the same job for you. Read their dedicated documentation to learn more about using them together with Python to parse XML files. 

Using XPath to parse XML

To get more specific data, the XPath becomes slightly more complex. Let’s say you want to extract the breeds of each cat. The XPath query will look something like this:

# Use .findall() with XPath to get all breeds
breeds = root_element.findall('.//cat/breed')
# Print the breeds
for breed in breeds:
print(breed.text)

This script first selects the current root node to start with. Then, with a double slash (//) it selects all subelements beneath the root. In this case, //cat selects all <cat> nodes. Finally, the single slash (/) will select one element from the parent node, so /breed gets the first <breed> element under <cat>. All of this information is stored in the breeds array, so for the final step, you’ll need to write a loop that will select each Element item from the array and print its text value. This is the result:

Maine Coon
Siamese
Tabby

You can even write more complex XPath queries that only pick out items with specific parameters. Say you wanted to get only the names of cats that are three years old – this is how the script would look like:

# Use .findall() with XPath to get names of cats that are 3 years old
cat_names_3_years_old = root_element.findall('.//cat[age="3"]/name')
# Print the names
for cat_name in cat_names_3_years_old:
print(cat_name.text)

This XPath is similar to the one previously used, but it uses brackets ([]) with a parameter to only select <cat> nodes that have the <age> value of 3. From here, it’ll select the names from the filtered results and print their names. 

While XPath might initially seem daunting, it offers unlimited possibilities and flexibility when selecting nodes from a tree. Experiment with other queries, modify the example data or run the scripts through different XML files and see what they return.

Best practices

Now that you’ve learned the basics of parsing XML documents with Python, you can write your scripts and integrate them into your environment for effective XML data management. While trial and error are the best ways to succeed in writing scripts, here are some valuable tips to keep in mind that will help you out in the long run:

  • Handle errors. When parsing an XML document, be prepared for errors such as invalid or missing elements. Implement error-handling mechanisms to handle these situations, preventing unexpected crashes and making your code more robust.
  • Compatibility. Be mindful of XML encoding and ensure that your Python environment supports the encoding in the XML document. Additionally, consider using libraries that support the latest XML standards to ensure compatibility with various XML formats and specifications.
  • Iterative parsing. For large XML documents, consider using iterative parsing to process the XML in a memory-efficient manner. This involves iterating through the XML elements one at a time instead of loading the entire document into memory, reducing the risk of memory issues.
  • Close files. If you're parsing XML from a file, it's good to close it afterward. This ensures system resources are released promptly, preventing potential issues with file locks or resource leaks.

Wrapping up

In this comprehensive guide, you’ve learned how to use Python and its various libraries to parse XML documents. ElementTree and XPath are invaluable tools that do the job flawlessly, so choosing which to use is entirely up to you. Employ this knowledge by building upon the example scripts or writing your own from scratch to parse XML data from any XML file.

About the author

Martin Ganchev

Martin Ganchev

VP Enterprise Partnerships

Martin, aka the driving force behind our business expansion, is extremely passionate about exploring fresh opportunities, fostering lasting relationships in the proxy market, and, of course, sharing his insights with you.

Linkedin: https://www.linkedin.com/in/martinganchev/

All information on Smartproxy Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Smartproxy Blog or any third-party websites that may be linked therein.

In this article

What is XML?

Save big by bundling residential and mobile proxies

Never tried mobile proxies? Now is the best time to leverage the highest success rates and premium ASNs.

Grab a bundle

or

smartproxy
Chat with us

Related articles

Python Errors and Exceptions

Python Errors and Exceptions: An Ultimate Guide to Different Types and Solutions

In this article, we’ll explore the different kinds of errors and exceptions, what causes them, and provide solutions to solving them. No more headaches and cursing your code until it gets scared and starts working – master the language of Python to understand precisely what it wants from you.

smartproxy

James Keenan

Feb 05, 2024

6 min read

Parsing HTML and XML documents with lxml.

lxml Tutorial: Parsing HTML and XML Documents

Keepin’ it short and sweet: data parsing is a process of computer software converting unstructured and often unreadable data into structured and readable format. Parsing offers a lot of benefits, some of which include work optimization, saving time, reducing costs, and many more; in addition, you can use parsed data in plenty of different situations. Even tho that sounds epic, parsing itself can be quite complicated. But hold on, buddy, and get ready to explore a step-by-step process on how to parse HTML and XML documents using lxml.

smartproxy

James Keenan

Mar 10, 2022

5 min read

Frequently asked questions

What is the best XML parser for Python?

Several XML parsers are available for Python, and the best one depends on your specific needs and preferences. However, the two most widely used XML parsers in the Python ecosystem are ElementTree (xml.etree.ElementTree) and lxml.

Does Python have a built-in XML parser?

Yes, Python has a couple of built-in XML parsers – xml.etree.ElementTree and xml.dom.minidom. ElementTree provides a cleaner and more memory-efficient API for parsing and manipulating XML. At the same time, minidom offers a more comprehensive but potentially memory-intensive representation of the entire XML document in memory. For more straightforward XML tasks, ElementTree is often preferred, while minidom might be suitable for scenarios requiring extensive DOM (Document Object Model) functionality.

Is it safe to parse XML in Python?

Yes, parsing XML in Python is usually safe when using trusted XML parsing libraries like xml.etree.ElementTree or lxml. These libraries are designed to handle various aspects of XML parsing, including security considerations. However, validating input and limiting resource usage is crucial to ensure proper XML file handling. Always be cautious when processing XML from untrusted sources and implement additional security measures if necessary.

What is the Python library for parsing XML files?

Python provides several XML parsing libraries, namely ElementTree (xml.etree.ElementTree) and lxml. Both of these libraries are widely used, and the choice between them often depends on the specific requirements of your XML processing tasks.

Get in touch

Follow us

Company

© 2018-2024 smartproxy.com, All Rights Reserved