How to extract a list of pages containing a string from a MediaWiki XML dump

Here comes one of those “I’ve got to write that down somewhere, and maybe it will be useful for someone else, too” posts:

I needed to get a list of MediaWiki page names of pages that contained a certain string (“needle”) from a MediaWiki XML dump. This is how I got it, using XMLStarlet:

xml sel -N mw= \
 -t \
  -m "/mw:mediawiki/mw:page/mw:revision/mw:text[contains(string(.), 'needle')]" \
  -n \
  -v "../../mw:title" wikiexport.xml \
| xml unesc

Leave a Reply

Your email address will not be published. Required fields are marked *