{"id":215,"date":"2010-05-12T18:44:05","date_gmt":"2010-05-12T10:44:05","guid":{"rendered":"https:\/\/patrick-nagel.net\/blog\/?p=215"},"modified":"2010-05-13T11:11:21","modified_gmt":"2010-05-13T03:11:21","slug":"how-to-extract-a-list-of-pages-containing-a-string-from-a-mediawiki-xml-dump","status":"publish","type":"post","link":"https:\/\/patrick-nagel.net\/blog\/archives\/215","title":{"rendered":"How to extract a list of pages containing a string from a MediaWiki XML dump"},"content":{"rendered":"<p>Here comes one of those &#8220;I&#8217;ve got to write that down somewhere, and maybe it will be useful for someone else, too&#8221; posts:<\/p>\n<p>I needed to get a list of MediaWiki page names of pages that contained a certain string (&#8220;needle&#8221;) from a MediaWiki XML dump. This is how I got it, using <a href=\"http:\/\/xmlstar.sourceforge.net\/\">XMLStarlet<\/a>:<\/p>\n<pre>xml sel -N mw=http:\/\/www.mediawiki.org\/xml\/export-0.3\/ \\\r\n -t \\\r\n  -m \"\/mw:mediawiki\/mw:page\/mw:revision\/mw:text[contains(string(.), 'needle')]\" \\\r\n  -n \\\r\n  -v \"..\/..\/mw:title\" wikiexport.xml \\\r\n| xml unesc<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>Here comes one of those &#8220;I&#8217;ve got to write that down somewhere, and maybe it will be useful for someone else, too&#8221; posts: I needed to get a list of MediaWiki page names of pages that contained a certain string (&#8220;needle&#8221;) from a MediaWiki XML dump. This is how I got it, using XMLStarlet: xml &hellip; <a href=\"https:\/\/patrick-nagel.net\/blog\/archives\/215\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;How to extract a list of pages containing a string from a MediaWiki XML dump&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-215","post","type-post","status-publish","format-standard","hentry","category-uncategorized"],"_links":{"self":[{"href":"https:\/\/patrick-nagel.net\/blog\/wp-json\/wp\/v2\/posts\/215","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/patrick-nagel.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/patrick-nagel.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/patrick-nagel.net\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/patrick-nagel.net\/blog\/wp-json\/wp\/v2\/comments?post=215"}],"version-history":[{"count":10,"href":"https:\/\/patrick-nagel.net\/blog\/wp-json\/wp\/v2\/posts\/215\/revisions"}],"predecessor-version":[{"id":224,"href":"https:\/\/patrick-nagel.net\/blog\/wp-json\/wp\/v2\/posts\/215\/revisions\/224"}],"wp:attachment":[{"href":"https:\/\/patrick-nagel.net\/blog\/wp-json\/wp\/v2\/media?parent=215"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/patrick-nagel.net\/blog\/wp-json\/wp\/v2\/categories?post=215"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/patrick-nagel.net\/blog\/wp-json\/wp\/v2\/tags?post=215"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}