View raw source for this post
Everytime I run into a file with an .xml extension, I cringe. Though, admittedly, it’s a file format that you have to be familiar with when it comes to sending and receiving data over the web. R has a package xlm2 to assist in the conversion of nested data to tabular data.
The task here is to take an easy and short xml file and convert it to a dataframe. The xml2 package was specifically designed for the task and is the successor to the xml package.
Let’s face it: you wouldn’t be on this page if you could’ve figured it out using W3schools or the R help menu. Let’s start with some super basic stuff. Have you opened the .xml file using a browser? Do that first and see if you can get a sense of how the data is organized. Second, open the xml file in a linter and see if it’s formatted correctly. (Here’s an example) I’ve lost a ton of time double checking code only to figure out that some .json was missing a “>”. Finally, open it in a good text editor and beautify the code. (I use Atom.) My last project, I “peeled” the outer layers of the onion to get at the data. It was hacky and lacked reproducibility, but it worked.
First, it’s important to get some defintions first. (W3schools has a really helpful set of tutorials on the topic.) A “node” in general and speaking in a broad way is an HTML element.
DOM (Document Object Model) - a tree structure that represents the HTML of the website, and every HTML element is a “node”.
Element Nodes - model the actual HTML elements in the document.
Attribute Nodes - model the various attributes in the different HTML elements. Attributes include id, class, title and style.
Text Nodes - model the text content inside the different HTML elements.
Root Node - the node on the very top of the document tree, usually called the document node.
Parent Node - a node that has children; represents an element that has at least one other element or text nested inside it.
Child Node - a node that has a parent; represents an element or text that is nested inside another element. For example, a
tag is often the child of a tag.
Figure 1: See DOM model on Wikipedia
XML structured data looks like this:
#xml prologue .
XPath is a syntax for defining and navigating parts of an XML document. They can be used across many languages. Xpath has over 200 functions according to W3schools. See the W3 page for xpath syntax.
All of the functions below are from the xlm2 package and the help page.
# get W3 data url
List of 2 $ node: $ doc : - attr(*, "class")= chr [1:2] "xml_document" "xml_node"
xml_name(bkfst)
[1] "breakfast_menu"
xml_structure(bkfst)
# see text xml_text(bkfst)
[1] "Belgian Waffles$5.95Two of our famous Belgian Waffles with plenty of real maple syrup650Strawberry Belgian Waffles$7.95Light Belgian waffles covered with strawberries and whipped cream900Berry-Berry Belgian Waffles$8.95Light Belgian waffles covered with an assortment of fresh berries and whipped cream900French Toast$4.50Thick slices made from our homemade sourdough bread600Homestyle Breakfast$6.95Two eggs, bacon or sausage, toast, and our ever-popular hash browns950"
The real key is understanding how to define or set the “xpath” argument.
xml_find_all(bkfst, xpath = "//name")
[1] Belgian Waffles [2] Strawberry Belgian Waffles [3] Berry-Berry Belgian Waffles [4] French Toast [5] Homestyle Breakfast
xml_text(xml_find_all(bkfst, xpath = "//name"))
[1] "Belgian Waffles" "Strawberry Belgian Waffles" [3] "Berry-Berry Belgian Waffles" "French Toast" [5] "Homestyle Breakfast"
library(tibble) name
# A tibble: 5 x 2 name price 1 Belgian Waffles $5.95 2 Strawberry Belgian Waffles $7.95 3 Berry-Berry Belgian Waffles $8.95 4 French Toast $4.50 5 Homestyle Breakfast $6.95
Lists and xml are similar and easily converted within the xml2 package.
document
If you can get a basic understanding of html web pages and xpath syntax, then you should be able to parse xml files efficiently with xlm2 .
(Get bibliographic stuff from “archetype hill”.)
The views, analysis and conclusions presented within this paper represent the author’s alone and not of any other person, organization or government entity. While I have made every reasonable effort to ensure that the information in this article was correct, it will nonetheless contain errors, inaccuracies and inconsistencies. It is a working paper subject to revision without notice as additional information becomes available. Any liability is disclaimd as to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from negligence, accident, or any other cause. The author(s) received no financial support for the research, authorship, and/or publication of this article.
─ Session info ─────────────────────────────────────────────────────────────────────────────────────────────────────── setting value version R version 3.6.3 (2020-02-29) os macOS Catalina 10.15.7 system x86_64, darwin15.6.0 ui X11 language (EN) collate en_US.UTF-8 ctype en_US.UTF-8 tz America/Chicago date 2021-03-05 ─ Packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────── package * version date lib source assertthat 0.2.1 2019-03-21 [1] CRAN (R 3.6.0) blogdown 1.1.14 2021-02-27 [1] Github (rstudio/blogdown@b0fa6ed) bookdown 0.21 2020-10-13 [1] CRAN (R 3.6.3) bslib 0.2.4 2021-01-25 [1] CRAN (R 3.6.2) cachem 1.0.3 2021-02-04 [1] CRAN (R 3.6.2) callr 3.5.1 2020-10-13 [1] CRAN (R 3.6.2) cli 2.3.0 2021-01-31 [1] CRAN (R 3.6.2) codetools 0.2-18 2020-11-04 [1] CRAN (R 3.6.2) colorspace 2.0-0 2020-11-11 [1] CRAN (R 3.6.2) crayon 1.4.1 2021-02-08 [1] CRAN (R 3.6.2) DBI 1.1.1 2021-01-15 [1] CRAN (R 3.6.2) desc 1.2.0 2018-05-01 [1] CRAN (R 3.6.0) devtools * 2.3.2 2020-09-18 [1] CRAN (R 3.6.2) digest 0.6.27 2020-10-24 [1] CRAN (R 3.6.2) dplyr 1.0.4 2021-02-02 [1] CRAN (R 3.6.2) ellipsis 0.3.1 2020-05-15 [1] CRAN (R 3.6.2) evaluate 0.14 2019-05-28 [1] CRAN (R 3.6.0) fastmap 1.1.0 2021-01-25 [1] CRAN (R 3.6.2) fs 1.5.0 2020-07-31 [1] CRAN (R 3.6.2) generics 0.1.0 2020-10-31 [1] CRAN (R 3.6.2) ggplot2 * 3.3.3 2020-12-30 [1] CRAN (R 3.6.2) ggthemes * 4.2.4 2021-01-20 [1] CRAN (R 3.6.2) glue 1.4.2 2020-08-27 [1] CRAN (R 3.6.2) gtable 0.3.0 2019-03-25 [1] CRAN (R 3.6.0) htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 3.6.2) jquerylib 0.1.3 2020-12-17 [1] CRAN (R 3.6.2) jsonlite 1.7.2 2020-12-09 [1] CRAN (R 3.6.2) knitr 1.31 2021-01-27 [1] CRAN (R 3.6.2) lifecycle 0.2.0 2020-03-06 [1] CRAN (R 3.6.0) magrittr 2.0.1 2020-11-17 [1] CRAN (R 3.6.2) memoise 2.0.0 2021-01-26 [1] CRAN (R 3.6.2) munsell 0.5.0 2018-06-12 [1] CRAN (R 3.6.0) pillar 1.4.7 2020-11-20 [1] CRAN (R 3.6.2) pkgbuild 1.2.0 2020-12-15 [1] CRAN (R 3.6.2) pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 3.6.0) pkgload 1.1.0 2020-05-29 [1] CRAN (R 3.6.2) prettyunits 1.1.1 2020-01-24 [1] CRAN (R 3.6.0) processx 3.4.5 2020-11-30 [1] CRAN (R 3.6.2) ps 1.5.0 2020-12-05 [1] CRAN (R 3.6.2) purrr 0.3.4 2020-04-17 [1] CRAN (R 3.6.2) R6 2.5.0 2020-10-28 [1] CRAN (R 3.6.2) remotes 2.2.0 2020-07-21 [1] CRAN (R 3.6.2) rlang 0.4.10 2020-12-30 [1] CRAN (R 3.6.2) rmarkdown 2.7 2021-02-19 [1] CRAN (R 3.6.3) rprojroot 2.0.2 2020-11-15 [1] CRAN (R 3.6.2) sass 0.3.1 2021-01-24 [1] CRAN (R 3.6.2) scales 1.1.1 2020-05-11 [1] CRAN (R 3.6.2) sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.6.0) stringi 1.5.3 2020-09-09 [1] CRAN (R 3.6.2) stringr 1.4.0 2019-02-10 [1] CRAN (R 3.6.0) testthat 3.0.1 2020-12-17 [1] CRAN (R 3.6.2) tibble 3.0.6 2021-01-29 [1] CRAN (R 3.6.2) tidyselect 1.1.0 2020-05-11 [1] CRAN (R 3.6.2) usethis * 2.0.1 2021-02-10 [1] CRAN (R 3.6.2) vctrs 0.3.6 2020-12-17 [1] CRAN (R 3.6.2) withr 2.4.1 2021-01-26 [1] CRAN (R 3.6.2) xfun 0.21 2021-02-10 [1] CRAN (R 3.6.2) xml2 * 1.3.2 2020-04-23 [1] CRAN (R 3.6.2) yaml 2.2.1 2020-02-01 [1] CRAN (R 3.6.0) [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library