Camelot is nice and lightweight but is currently unmaintained.. https://github.com/datalab-to/marker is a good alternative. It's a mix of OCR and pdf parsing and can use LLMs for correcting thorny cases. Here is an example of an invocation for a different dataset - https://github.com/publicmap/amche-atlas/issues/104#issuecomment-2842058569
On Mon, Jul 21, 2025 at 2:04 PM Saloni Taneja <[email protected]> wrote: > Hi everyone, > > I’ve been trying to parse the compiled PDFs uploaded by the CGWB here > <https://cgwb.gov.in/en/ground-water-level-monitoring> (specifically the > ones under “4. Water Level Data”) which contain four readings per > monitoring well per year. However, I’ve run into an issue with overlapping > text across columns, which is leading to jumbled or misaligned outputs. > > For instance, on page 5 of the file titled “August Ground Water Level > 1994–2023”, the district “Dr. B.R. Ambedkar Konaseema” appears as “Dr. B.R. > Ambedkar Konaseem”, with the missing "a" mistakenly attached to the start > of the following block name. Camelot (Python) is detecting these characters > but struggles to resolve them correctly, likely because overlapping text > layers in the PDF are assigned nearly identical coordinates, causing cell > misassignments. Another example is all rows correspondeding to "Dadra and > Nagar Haveli and Daman and Diu". > > I wanted to check: > > 1. Has anyone here successfully parsed this dataset before? > 2. Am I understanding the complexity of scraping this correctly? > 3. Does anyone have a contact at CGWB who might be able to share the > original Excel files? The PDFs appear to have been exported via iLovePDF > from XLSX files. Since these files are already publicly available, I’m > hoping the CGWB might be open to sharing the source formats directly, but > I'm worried the turnaround times might vary. > > Any help, advice, or pointers would be really appreciated. Thanks so much! > > Best, > > -- > Datameet is a community of Data Science enthusiasts in India. Know more > about us by visiting http://datameet.org > --- > You received this message because you are subscribed to the Google Groups > "datameet" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion visit > https://groups.google.com/d/msgid/datameet/1e037636-d31e-4cb3-8703-433000a9a573n%40googlegroups.com > <https://groups.google.com/d/msgid/datameet/1e037636-d31e-4cb3-8703-433000a9a573n%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion visit https://groups.google.com/d/msgid/datameet/CAMgvHC40aczxc9WXp0-CvK7-w%2Bc0zMzYbngkp6E9764REeShEw%40mail.gmail.com.
