While the PDF format is a convenient replacement for paper with complex permissions and security options, it can present barriers for accessing and manipulating data. An example of this is the in the report How COVID is Changing the World, A Statistical Perspective, although the document is licensed CC-BY, all of the data tables are ‘trapped’ in the PDF format. Rather than manually entering the data tables into a spreadsheet, in this activity you will scrape data tables into a PDF format.
For this activity, you will be freeing data tables from PDFs and creating a CSV or an Excel sheet with the data. We will be
Find a PDF: Find a PDF online that is openly licensed but is in PDF form. If you cannot find a PDF you are interested in use How COVID is Changing the World
Download Tabula:
- Download the version of Tabula for your operating system:
- Windows: tabula-win.zip
- Mac OS X: tabula-mac.zip
- Linux/Other: tabula-jar.zip, view README.txt inside for instructions
- Extract the zip file. (Instructions: Windows, Mac)
- Go into the folder you just extracted. Run the “Tabula” program inside.
- A web browser will open. If it doesn’t, open your web browser, and go to http://localhost:8080. There’s Tabula!
- Upload the PDF and extract the the data
Complete this Activity
After you do this assignment, please either export, and import it into Google Sheets and share the link to the original PDF and the sheet in the comment box below. Or simply copy and paste one of the data tables in the comment box below.
Image Credit: Image used on featured image: On Videotape by Mitchell Joyce (CC by NC 2.0)
The exported sheet, direct from Tabula, is located here: https://docs.google.com/spreadsheets/d/1HqQlC4d3ktU1D-qJOn3L4AuAP8ow0rANB2ZsgplMz28/edit?usp=sharing
There are a couple of formatting issues with header cells, but they’re easily fixed, and on the whole, I’m really impressed by Tabula as a tool. I’ll definitely be incorporating it into my own research!
Hi all,
I’ve attached my Google Sheet below. The data contained within the sheet comes from an OECD (2019) report, “Educating 21st Century Children: Emotional Well-being in the Digital Age.” Like Ciara, I had some issues with encoding in this particular chart, but since the data in the table was relatively uncomplicated, I could fix these errors manually. That being said, for more complex data sets, encoding might be tricky to fix. Nonetheless, Tabula is an interesting tool I will consider using in the future.
https://docs.google.com/spreadsheets/d/1Op3_hcBG1bkLo49KRm2-J67Id0mh5DVtVrFh99nBpmw/edit?usp=sharing
PDF file link https://unstats.un.org/unsd/ccsa/documents/covid19-report-ccsa.pdf
Total imports Medical imports Share of total medical imports
Country Share of imports Value of all products Share of world imports Medical equipment supplies Medical Medicines protective Personal
products
(US$ billion) (%) (%) (%) (%) (%) (%)
World 1 011.3 6 100 14 17 56 13
1. United States 193.1 8 19 16 16 59 10
2. Germany 86.7 7 9 12 18 57 13
3. China 65.0 3 6 23 15 46 16
4. Belgium 56.6 13 6 8 12 75 5
5. Netherlands 52.7 8 5 16 20 55 8
6. Japan 44.8 6 4 16 16 56 13
7. United Kingdom 41.1 6 4 11 15 62 12
8. France 40.5 6 4 12 20 53 15
9. Italy 37.1 8 4 9 15 66 9
10. Switzerland 36.9 13 4 6 9 80 5
PDF: https://open.library.ubc.ca/cIRcle/collections/graduateresearch/42591/items/1.0380582
Table below was smalll and easy to extract. I can see applications for scraping data from more complext tables.
Table 4: The number of occurrences with the DOs of specific social media platforms and related communication methods.
Social media platform / communication method // Number of occurrences in DOs
Adult dating app 1
Email 14
Facebook 9
Instagram 2
MSN messages 4
Snapchat 1
Text messages (SMS) 10
Twitter 2
YouTube 1
Social media (unspecified) 1
Total 45
An interesting tool! Still requires some data cleanup, but a nice start: https://docs.google.com/spreadsheets/d/1-ThZ9zMGa5gItzN4lUXQIOENcyRhfndw2jAopvobRak/edit?usp=sharing
Took me a little longer to extract the data. Not sure if was the connection at my end.
Also noticed that had to format certain sections, and some of the data was showing up as #####, had to re-format those.
But a pretty handy tool all the same.
My first time using this. https://docs.google.com/spreadsheets/d/11DgWFAFhK2xu8_zydJuoQi9aIVaogPnoJpaipGs2kvI/edit#gid=0
**Air traffic demand collapses (passenger growth**)
20%
0%
2.00%
-20% -14.10%
-40%
-53.50%
-60%
Only in March, airlines are estimated to lose and functional operability in order for it to deliver on its USD 28 billion in revenues, and airports and air value in overcoming the consequences of this
navigation service providers have lost around unprecedented crisis.
USD 8 billion and USD 824 million, respectively.
Monthly passenger traffic (compared to 2019) Decline in air cargo volume – March 2020 (thousand tonnes)
2019 2020 World Total
400 -792
North
America
89
300 Middle East
-158 Latin America/Caribbean
-9 200 Europe
https://drive.google.com/file/d/1ksVSSGR-H2sXrsM71A7QSJ9WRBdFqkgW/view?usp=sharing
This tool is definitely helpful. I played with extracting different datasets and some do require more clean-up than others but overall it’s great.
pdf source: https://unstats.un.org/unsd/ccsa/documents/covid19-report-ccsa.pdf (table Employment in countries with workplace closure)
World2 25968718274066
Low income countries75252314027
Lower-middle income countries1 119983210054097
Upper-middle income countries50239196211531
High income countries5639619964494
Africa26556117711751
Americas4609817988795
Arab States4989176469
Asia and the Pacific1 09257297148665
Europe and Central Asia3939513964594
World without China2 25988719374084
Great tool!
Hi everyone. This is my first time using this tool. Below is the table I extracted of the gender breakdown of physicians and nurses (p. 60 of the report). Based on this experience, I would probably only use it to extract data from relatively simple tables – the clean up involved with some of the outputs from the more complex data visualizations is a bit daunting. I was glad to learn about it, though.
Distribution of physicians and nurses, by sex
80%
African region marker
Nurses 65% 35%
Physicians 28% 72%
Region of the Americas
Nurses 86% 14%
Physicians 46% 54%
Eastern Mediterranean Region
Nurses 79% 21%
Physicians 35% 65%
European Region
Nurses 84% 16%
Physicians 53% 47%
South-East Asia Region
Nurses 79% 21%
Physicians 39% 61%
Western Pacific Region
Nurses 81% 19%
Physicians 41% 59%
Female Male
PDF source: https://unstats.un.org/unsd/ccsa/documents/covid19-report-ccsa.pdf
Extracted data: https://docs.google.com/spreadsheets/d/1Yqdmb_trX2Ht0uqKhC6EfUv3W0gMKKZFDU321B0sqUM/edit?usp=sharing
Impressive! I was skeptical going in, but this tool works surprisingly well. This would be very handy for Open CourseWare content that is available in read-only PDF. I found it worked well on several of the data tables I captured. A couple would not import properly, but I’m not sure if that’s a result of my overly broad selection. If the data that was extracted is ‘close enough,’ then it’s easy to edit later. Very cool!
I used the COVID-19 report suggest to test out this small tool, and I created a Tabula CSV file for the “Trade in medical goods” table on page 24
https://docs.google.com/spreadsheets/d/1vvAs0QSJEwiQdsLxxFyJFlDmVn_iyaBqZOCaa4-4hxM/edit?usp=sharing
This is a good tool for small tables and has some possibilities for pulling a table of interest into a CSV that can be uploaded to a visualization tool of choice for further examination. I would like to see how it might handle larger complex tables, and the time commitment to data cleaning might be a deterrent. Formatting picked up from the conversion may make the rendering of the CSV challenging (i.e. white spaces for soft returns can throw columns off).
I have never used this tool before and it is neat, especially for extracting multiple pieces of data from the PDF- however, as described in the unit – it doesn’t always come out super clear. I made the mistake of extracting too much at one time and probably should have just taken one table at a time. Oh well – lesson learned. I just used the phf that was provided and google sheet is here: https://docs.google.com/spreadsheets/d/1r2l2sNEaFbIcNSPNflojG7x4dZdVGm7NVnm9MEM2SfU/edit?usp=sharing
Using the provided resource “How COVID is changing the World: A Statistical Perspective” I exported the table on page 24 into Tabula. Having not used this tool before, I’m really interested in continuing to using and testing it out.
While there is some data clean-up required, using OpenRefine this would be a quick fix. A copy of the exported table is available in the Google Sheet here: https://docs.google.com/spreadsheets/d/1_b2Ls5e9VOYVwmPCEYl-q5RGHJ3oHzYbYKsR63PrsWg/edit?usp=sharing
I tried to extract data from Page 40 of the PDF document “Hoe Covid is Changing the World?”.
The result was quite interesting. Conversion of the dat a to CVS was flawless. I really likes the relatively easy / fast process. I need to explore “Tabula” more and get familiar with its features. It looks like a great tool.
Hi all, this was also my first time using Tabula (or any data-extracting software for that matter)!
I went ahead with the document “How Covid is Changing the World?” and originally tried to capture and export a line graph (Monthly Passenger Traffic, p. 21). Unfortunately, Tabula was only able to scrape text components (such as x and y axes) with some formatting errors, and not the data points themselves. For my second attempt, I tried to scrape a more traditional table format (Learners Not In School, p.51) using the stream and lattice extraction methods; this captured all the text data, but not columns and other formatting. That being said, I could still see this being a major time-saver! I have tried OpenRefine to clean up data previously, and this activity made me wonder if it would be a good, complementary tool to use with Tabula (as well as what other tools exist out there!).
Find my Tabula export here:
https://drive.google.com/file/d/1_Kmw9e55NJqWHo6Sq0wVQynDuIqh3Cs9/view?usp=sharing