Re: [datameet] Parsing CGWB Groundwater Data PDFs

sreeram kandimalla Mon, 21 Jul 2025 02:31:48 -0700

Camelot is nice and lightweight but is currently unmaintained..
https://github.com/datalab-to/marker is a good alternative. It's a mix of
OCR and pdf parsing and can use LLMs for correcting thorny cases. Here is
an example of an invocation for a different dataset -
https://github.com/publicmap/amche-atlas/issues/104#issuecomment-2842058569


On Mon, Jul 21, 2025 at 2:04 PM Saloni Taneja <[email protected]>
wrote:

> Hi everyone,
>
> I’ve been trying to parse the compiled PDFs uploaded by the CGWB here
> <https://cgwb.gov.in/en/ground-water-level-monitoring> (specifically the
> ones under “4. Water Level Data”) which contain four readings per
> monitoring well per year. However, I’ve run into an issue with overlapping
> text across columns, which is leading to jumbled or misaligned outputs.
>
> For instance, on page 5 of the file titled “August Ground Water Level
> 1994–2023”, the district “Dr. B.R. Ambedkar Konaseema” appears as “Dr. B.R.
> Ambedkar Konaseem”, with the missing "a" mistakenly attached to the start
> of the following block name. Camelot (Python) is detecting these characters
> but struggles to resolve them correctly, likely because overlapping text
> layers in the PDF are assigned nearly identical coordinates, causing cell
> misassignments. Another example is all rows correspondeding to "Dadra and
> Nagar Haveli and Daman and Diu".
>
> I wanted to check:
>
>    1. Has anyone here successfully parsed this dataset before?
>    2. Am I understanding the complexity of scraping this correctly?
>    3. Does anyone have a contact at CGWB who might be able to share the
>    original Excel files? The PDFs appear to have been exported via iLovePDF
>    from XLSX files. Since these files are already publicly available, I’m
>    hoping the CGWB might be open to sharing the source formats directly, but
>    I'm worried the turnaround times might vary.
>
> Any help, advice, or pointers would be really appreciated. Thanks so much!
>
> Best,
>
> --
> Datameet is a community of Data Science enthusiasts in India. Know more
> about us by visiting http://datameet.org
> ---
> You received this message because you are subscribed to the Google Groups
> "datameet" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion visit
> https://groups.google.com/d/msgid/datameet/1e037636-d31e-4cb3-8703-433000a9a573n%40googlegroups.com
> <https://groups.google.com/d/msgid/datameet/1e037636-d31e-4cb3-8703-433000a9a573n%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/datameet/CAMgvHC40aczxc9WXp0-CvK7-w%2Bc0zMzYbngkp6E9764REeShEw%40mail.gmail.com.

Re: [datameet] Parsing CGWB Groundwater Data PDFs

Reply via email to