15 responses to “Scraping data from a PDF”

  1. Ciara

    The exported sheet, direct from Tabula, is located here: https://docs.google.com/spreadsheets/d/1HqQlC4d3ktU1D-qJOn3L4AuAP8ow0rANB2ZsgplMz28/edit?usp=sharing

    There are a couple of formatting issues with header cells, but they’re easily fixed, and on the whole, I’m really impressed by Tabula as a tool. I’ll definitely be incorporating it into my own research!

  2. Lauren

    Hi all,

    I’ve attached my Google Sheet below. The data contained within the sheet comes from an OECD (2019) report, “Educating 21st Century Children: Emotional Well-being in the Digital Age.” Like Ciara, I had some issues with encoding in this particular chart, but since the data in the table was relatively uncomplicated, I could fix these errors manually. That being said, for more complex data sets, encoding might be tricky to fix. Nonetheless, Tabula is an interesting tool I will consider using in the future.

    https://docs.google.com/spreadsheets/d/1Op3_hcBG1bkLo49KRm2-J67Id0mh5DVtVrFh99nBpmw/edit?usp=sharing

  3. Anber Rana

    PDF file link https://unstats.un.org/unsd/ccsa/documents/covid19-report-ccsa.pdf

    Total imports Medical imports Share of total medical imports
    Country Share of imports Value of all products Share of world imports Medical equipment supplies Medical Medicines protective Personal
    products
    (US$ billion) (%) (%) (%) (%) (%) (%)
    World 1 011.3 6 100 14 17 56 13
    1. United States 193.1 8 19 16 16 59 10
    2. Germany 86.7 7 9 12 18 57 13
    3. China 65.0 3 6 23 15 46 16
    4. Belgium 56.6 13 6 8 12 75 5
    5. Netherlands 52.7 8 5 16 20 55 8
    6. Japan 44.8 6 4 16 16 56 13
    7. United Kingdom 41.1 6 4 11 15 62 12
    8. France 40.5 6 4 12 20 53 15
    9. Italy 37.1 8 4 9 15 66 9
    10. Switzerland 36.9 13 4 6 9 80 5

  4. Kieran

    PDF: https://open.library.ubc.ca/cIRcle/collections/graduateresearch/42591/items/1.0380582

    Table below was smalll and easy to extract. I can see applications for scraping data from more complext tables.

    Table 4: The number of occurrences with the DOs of specific social media platforms and related communication methods.
    Social media platform / communication method // Number of occurrences in DOs
    Adult dating app 1
    Email 14
    Facebook 9
    Instagram 2
    MSN messages 4
    Snapchat 1
    Text messages (SMS) 10
    Twitter 2
    YouTube 1
    Social media (unspecified) 1
    Total 45

  5. Lisa D

    An interesting tool! Still requires some data cleanup, but a nice start: https://docs.google.com/spreadsheets/d/1-ThZ9zMGa5gItzN4lUXQIOENcyRhfndw2jAopvobRak/edit?usp=sharing

  6. Bilkiss

    Took me a little longer to extract the data. Not sure if was the connection at my end.

    Also noticed that had to format certain sections, and some of the data was showing up as #####, had to re-format those.
    But a pretty handy tool all the same.
    My first time using this. https://docs.google.com/spreadsheets/d/11DgWFAFhK2xu8_zydJuoQi9aIVaogPnoJpaipGs2kvI/edit#gid=0

    **Air traffic demand collapses (passenger growth**)
    20%
    0%
    2.00%
    -20% -14.10%
    -40%
    -53.50%
    -60%
    Only in March, airlines are estimated to lose and functional operability in order for it to deliver on its USD 28 billion in revenues, and airports and air value in overcoming the consequences of this
    navigation service providers have lost around unprecedented crisis.
    USD 8 billion and USD 824 million, respectively.
    Monthly passenger traffic (compared to 2019) Decline in air cargo volume – March 2020 (thousand tonnes)
    2019 2020 World Total
    400 -792
    North
    America
    89
    300 Middle East
    -158 Latin America/Caribbean
    -9 200 Europe

  7. Janet Calderon

    https://drive.google.com/file/d/1ksVSSGR-H2sXrsM71A7QSJ9WRBdFqkgW/view?usp=sharing

    This tool is definitely helpful. I played with extracting different datasets and some do require more clean-up than others but overall it’s great.

  8. Klint Fung

    pdf source: https://unstats.un.org/unsd/ccsa/documents/covid19-report-ccsa.pdf (table Employment in countries with workplace closure)

    World2 25968718274066
    Low income countries75252314027
    Lower-middle income countries1 119983210054097
    Upper-middle income countries50239196211531
    High income countries5639619964494
    Africa26556117711751
    Americas4609817988795
    Arab States4989176469
    Asia and the Pacific1 09257297148665
    Europe and Central Asia3939513964594
    World without China2 25988719374084

    Great tool!

  9. Cari

    Hi everyone. This is my first time using this tool. Below is the table I extracted of the gender breakdown of physicians and nurses (p. 60 of the report). Based on this experience, I would probably only use it to extract data from relatively simple tables – the clean up involved with some of the outputs from the more complex data visualizations is a bit daunting. I was glad to learn about it, though.

    Distribution of physicians and nurses, by sex
    80%
    African region marker
    Nurses 65% 35%
    Physicians 28% 72%
    Region of the Americas
    Nurses 86% 14%
    Physicians 46% 54%
    Eastern Mediterranean Region
    Nurses 79% 21%
    Physicians 35% 65%
    European Region
    Nurses 84% 16%
    Physicians 53% 47%
    South-East Asia Region
    Nurses 79% 21%
    Physicians 39% 61%
    Western Pacific Region
    Nurses 81% 19%
    Physicians 41% 59%
    Female Male

  10. Erik C

    PDF source: https://unstats.un.org/unsd/ccsa/documents/covid19-report-ccsa.pdf

    Extracted data: https://docs.google.com/spreadsheets/d/1Yqdmb_trX2Ht0uqKhC6EfUv3W0gMKKZFDU321B0sqUM/edit?usp=sharing

    Impressive! I was skeptical going in, but this tool works surprisingly well. This would be very handy for Open CourseWare content that is available in read-only PDF. I found it worked well on several of the data tables I captured. A couple would not import properly, but I’m not sure if that’s a result of my overly broad selection. If the data that was extracted is ‘close enough,’ then it’s easy to edit later. Very cool!

  11. Patricia Foster

    I used the COVID-19 report suggest to test out this small tool, and I created a Tabula CSV file for the “Trade in medical goods” table on page 24

    https://docs.google.com/spreadsheets/d/1vvAs0QSJEwiQdsLxxFyJFlDmVn_iyaBqZOCaa4-4hxM/edit?usp=sharing

    This is a good tool for small tables and has some possibilities for pulling a table of interest into a CSV that can be uploaded to a visualization tool of choice for further examination. I would like to see how it might handle larger complex tables, and the time commitment to data cleaning might be a deterrent. Formatting picked up from the conversion may make the rendering of the CSV challenging (i.e. white spaces for soft returns can throw columns off).

  12. Elizabeth Gillis

    I have never used this tool before and it is neat, especially for extracting multiple pieces of data from the PDF- however, as described in the unit – it doesn’t always come out super clear. I made the mistake of extracting too much at one time and probably should have just taken one table at a time. Oh well – lesson learned. I just used the phf that was provided and google sheet is here: https://docs.google.com/spreadsheets/d/1r2l2sNEaFbIcNSPNflojG7x4dZdVGm7NVnm9MEM2SfU/edit?usp=sharing

  13. Sarah

    Using the provided resource “How COVID is changing the World: A Statistical Perspective” I exported the table on page 24 into Tabula. Having not used this tool before, I’m really interested in continuing to using and testing it out.
    While there is some data clean-up required, using OpenRefine this would be a quick fix. A copy of the exported table is available in the Google Sheet here: https://docs.google.com/spreadsheets/d/1_b2Ls5e9VOYVwmPCEYl-q5RGHJ3oHzYbYKsR63PrsWg/edit?usp=sharing

  14. Majid Alimohammadi

    I tried to extract data from Page 40 of the PDF document “Hoe Covid is Changing the World?”.

    The result was quite interesting. Conversion of the dat a to CVS was flawless. I really likes the relatively easy / fast process. I need to explore “Tabula” more and get familiar with its features. It looks like a great tool.

  15. Hikaru Ikeda

    Hi all, this was also my first time using Tabula (or any data-extracting software for that matter)!
    I went ahead with the document “How Covid is Changing the World?” and originally tried to capture and export a line graph (Monthly Passenger Traffic, p. 21). Unfortunately, Tabula was only able to scrape text components (such as x and y axes) with some formatting errors, and not the data points themselves. For my second attempt, I tried to scrape a more traditional table format (Learners Not In School, p.51) using the stream and lattice extraction methods; this captured all the text data, but not columns and other formatting. That being said, I could still see this being a major time-saver! I have tried OpenRefine to clean up data previously, and this activity made me wonder if it would be a good, complementary tool to use with Tabula (as well as what other tools exist out there!).

    Find my Tabula export here:
    https://drive.google.com/file/d/1_Kmw9e55NJqWHo6Sq0wVQynDuIqh3Cs9/view?usp=sharing

Leave a Reply