XXE Attacks: XML Parsing Gone Wrong
XML External Entity injection is one of those vulnerabilities that fascinated me the more I dug into it. The core issue is that the XML spec supports external entities, a feature that lets XML documents pull in content from external sources, and most parsers enable this by default. When an app parses untrusted XML without disabling that feature, an attacker can read arbitrary files off the server, perform SSRF, and sometimes even get remote code execution. What surprised me most when researching this was how straightforward the exploitation is compared to how long these bugs survive in production, the attack payloads are simple, but the parser defaults are so permissive that developers often have no idea the risk exists.
How XXE Works
XML supports Document Type Definitions (DTDs) that can define entities, named references that expand to their defined content during parsing. External entities reference content from URIs:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
<!ENTITY xxe SYSTEM "file:///etc/passwd">
]>
<root>
<data>&xxe;</data>
</root>
When a vulnerable parser processes this document, it resolves &xxe; by reading /etc/passwd and inserting its contents into the <data> element. The application then returns or processes this content, leaking the file to the attacker. The simplicity of this payload is what makes XXE so striking, there’s no encoding tricks or obfuscation needed, just standard XML features used as intended by the spec.
Variants
The research literature describes several XXE variants, each with its own exploitation approach:
- Classic XXE: Entity expansion includes file contents in the response. The straightforward case, you see the data come right back.
- Blind XXE: The application doesn’t return the entity value, but data can be exfiltrated via out-of-band channels (HTTP requests to a server you control). Setting up OOB exfiltration takes more work, but it’s well-documented.
- XXE via DTD: External DTD files can define parameter entities that chain together for data exfiltration. This is the technique to reach for when classic XXE doesn’t work.
- Billion Laughs (XML Bomb): Nested entity definitions that expand exponentially, causing denial of service. A few lines of XML can consume gigabytes of memory.
Java: The Most Common XXE Target
Java applications are far and away the most frequent XXE targets based on public vulnerability reports. Java’s XML parsing libraries enable external entities by default, and there are so many parser APIs that developers inevitably miss one.
Vulnerable Pattern
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.DocumentBuilder;
import org.w3c.dom.Document;
@PostMapping("/api/import")
public ResponseEntity<?> importData(@RequestBody String xmlInput) {
try {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(new InputSource(new StringReader(xmlInput)));
String value = doc.getElementsByTagName("data")
.item(0).getTextContent();
return ResponseEntity.ok(Map.of("imported", value));
} catch (Exception e) {
return ResponseEntity.badRequest().body("Invalid XML");
}
}
An attacker sends the XXE payload and receives the contents of /etc/passwd in the response. This exact pattern shows up in Spring Boot apps, legacy servlets, and SOAP services. The same vulnerability exists in SAXParserFactory, XMLInputFactory (StAX), TransformerFactory, and SchemaFactory, each one needs its own set of feature flags to disable external entities, which is part of why this keeps happening.
Secure Configuration
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
// Disable external entities and DTDs
factory.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
factory.setFeature("http://xml.org/sax/features/external-general-entities", false);
factory.setFeature("http://xml.org/sax/features/external-parameter-entities", false);
factory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
factory.setXIncludeAware(false);
factory.setExpandEntityReferences(false);
The most restrictive option is disallow-doctype-decl, which rejects any XML with a DTD declaration. It prevents all XXE variants, though it also rejects legitimate XML that uses DTDs. In practice, that’s rarely a problem, most modern XML doesn’t need DTDs, and the safety trade-off is worth it.
Python: Library-Dependent Behaviour
Python’s XML parsing landscape is confusing because different libraries have different default behaviours. The standard library, lxml, and defusedxml all handle entities differently, and the differences matter.
Vulnerable: xml.etree.ElementTree (Partial)
import xml.etree.ElementTree as ET
@app.route("/api/import", methods=["POST"])
def import_data():
xml_input = request.data
root = ET.fromstring(xml_input)
value = root.find("data").text
return jsonify({"imported": value})
ElementTree in CPython doesn’t resolve external entities by default (it silently ignores them), but it is vulnerable to the Billion Laughs attack. Teams sometimes mark this as “not vulnerable to XXE” in their security assessments and completely miss the DoS angle. The xml.sax and xml.dom.minidom modules have similar partial protections.
Vulnerable: lxml with Default Settings
from lxml import etree
@app.route("/api/import", methods=["POST"])
def import_data():
xml_input = request.data
parser = etree.XMLParser()
root = etree.fromstring(xml_input, parser)
value = root.find("data").text
return jsonify({"imported": value})
Here’s the one that catches people: lxml resolves external entities by default, and it’s the most popular XML library in the Python ecosystem. This is the most common XXE vector in Python applications based on what I’ve seen in vulnerability databases and code reviews.
Secure Configuration
from lxml import etree
parser = etree.XMLParser(
resolve_entities=False,
no_network=True,
dtd_validation=False,
load_dtd=False,
)
root = etree.fromstring(xml_input, parser)
Or use defusedxml, a drop-in replacement that disables all dangerous XML features:
import defusedxml.ElementTree as ET
root = ET.fromstring(xml_input) # Safe by default
defusedxml is the kind of library that should be in every Python project’s requirements, set it and forget it. The API is identical to the standard library, so there’s no learning curve.
C: libxml2
C applications typically use libxml2 for XML parsing, and the default configuration resolves external entities. When I looked into how C codebases handle XML, I found that correct XXE mitigation was the exception rather than the rule.
Vulnerable Pattern
#include <libxml/parser.h>
#include <libxml/tree.h>
int parse_xml_input(const char *xml_string) {
xmlDocPtr doc = xmlParseMemory(xml_string, strlen(xml_string));
if (doc == NULL) {
return -1;
}
xmlNodePtr root = xmlDocGetRootElement(doc);
xmlNodePtr node = root->children;
while (node != NULL) {
if (xmlStrcmp(node->name, (const xmlChar *)"data") == 0) {
xmlChar *content = xmlNodeGetContent(node);
printf("Data: %s\n", content);
xmlFree(content);
}
node = node->next;
}
xmlFreeDoc(doc);
return 0;
}
Secure Configuration
// Disable entity substitution and network access
xmlParserCtxtPtr ctxt = xmlNewParserCtxt();
xmlCtxtUseOptions(ctxt,
XML_PARSE_NOENT | // Do not substitute entities
XML_PARSE_NONET | // Forbid network access
XML_PARSE_DTDLOAD | // Do not load external DTDs
0
);
// Or globally disable entity loading
xmlSubstituteEntitiesDefault(0);
xmlLoadExtDtdDefaultValue = 0;
One thing that tripped me up when researching this: the XML_PARSE_NOENT flag name is misleading, it actually controls entity substitution, not entity resolution. The libxml2 documentation isn’t great about clarifying this. Always test your configuration with an actual XXE payload to verify it works as expected.
Go: Secure by Default
Go’s encoding/xml package doesn’t support external entities or DTD processing, making it immune to XXE by default. This is one of those things I genuinely appreciate about Go’s standard library, secure defaults that just work.
import "encoding/xml"
type ImportData struct {
XMLName xml.Name `xml:"root"`
Data string `xml:"data"`
}
func importHandler(w http.ResponseWriter, r *http.Request) {
var data ImportData
decoder := xml.NewDecoder(r.Body)
if err := decoder.Decode(&data); err != nil {
http.Error(w, "Invalid XML", 400)
return
}
json.NewEncoder(w).Encode(map[string]string{"imported": data.Data})
}
This is safe. Go’s XML parser ignores DTD declarations and entity references. However, if a Go application shells out to an external XML tool (like xsltproc or xmllint) or uses a CGo binding to libxml2, the external tool’s defaults apply. I came across a case study where a Go service was shelling out to xsltproc for XSLT transformations, the Go code itself was fine, but the subprocess wasn’t. It’s a good reminder to think about the full processing pipeline, not just the primary parser.
JavaScript: Node.js XML Parsers
Vulnerable: libxmljs
const libxmljs = require('libxmljs');
app.post('/api/import', (req, res) => {
const xmlInput = req.body;
const doc = libxmljs.parseXml(xmlInput);
const data = doc.get('//data').text();
res.json({ imported: data });
});
libxmljs is a Node.js binding to libxml2 and inherits its default behaviour of resolving external entities. The rule of thumb: if you see libxml2 under the hood, assume XXE is possible until proven otherwise.
Safe: fast-xml-parser
const { XMLParser } = require('fast-xml-parser');
app.post('/api/import', (req, res) => {
const parser = new XMLParser();
const result = parser.parse(req.body);
res.json({ imported: result.root.data });
});
Pure JavaScript XML parsers like fast-xml-parser and xml2js don’t implement external entity resolution, making them safe by default. For Node.js teams, these pure-JS parsers are the better choice unless there’s a specific reason to need libxml2’s features.
Detection Strategies
Static Analysis
- Java: SpotBugs with FindSecBugs detects
XXE_DOCUMENT,XXE_SAXPARSER,XXE_XMLREADER, andXXE_XPATH. Semgrep has rules for unconfiguredDocumentBuilderFactory. FindSecBugs catches the majority of Java XXE issues in my testing. - Python: Bandit B313-B320 flag usage of vulnerable XML parsers (
xml.etree,xml.sax,lxml). Semgrep rules detectetree.XMLParser()withoutresolve_entities=False. - C: cppcheck doesn’t detect XXE. Custom Semgrep rules or manual review are needed for libxml2 usage. This is one area where automated tools haven’t caught up yet.
- JavaScript: No mainstream SAST tool reliably detects XXE in Node.js. Manual review of XML parser library choice is necessary, grepping for
libxmljsis a good starting point.
Testing
Send this payload to any XML-accepting endpoint:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE foo [
<!ENTITY xxe SYSTEM "file:///etc/hostname">
]>
<root><data>&xxe;</data></root>
If the response contains the server’s hostname, the endpoint is vulnerable. For blind XXE testing, use an out-of-band payload that makes an HTTP request to a server you control, tools like Burp Collaborator make this straightforward.
Remediation
The remediation is consistent across languages: disable external entity resolution and DTD processing in the XML parser configuration.
| Language | Parser | Fix |
|---|---|---|
| Java | DocumentBuilderFactory | setFeature("...disallow-doctype-decl", true) |
| Python | lxml | XMLParser(resolve_entities=False, no_network=True) |
| Python | stdlib | Use defusedxml drop-in replacement |
| C | libxml2 | xmlSubstituteEntitiesDefault(0) |
| Go | encoding/xml | Safe by default, no action needed |
| JavaScript | libxmljs | Switch to fast-xml-parser or xml2js |
If your application doesn’t need XML at all, consider switching to JSON. If XML is required but DTDs are not, reject any input containing <!DOCTYPE before parsing. The safest XML is the XML you never parse.