Synthesis Experiments | Benchmark Pt.6

Wikipedia Dumps

Of course, Wikipedia always has some of the most up-to-date information on any major world events, including the current pandemic crisis. I have been hesitant to use it because of how the dumps are structured - they are in XML or SQL formats, and I did not know how to extract information from them and save it in text file format. Luckily, Matt showed me this helpful repo that builds a tool for parsing these dumps. However, the documentation has said that this script is for parsing Wikitionaries and for English only, and I needed one for general articles in different languages. I looked around and found the Wikipedia Extractor script which should do just that. This gave me some hope - I downloaded the gigantic .bz2 dumps for English and French and ran the script to generate plain text files and remove formatting.