Hi all,

I've started working on the generic dataset indexing project discussed in
our last meeting on Thursday. The initial implementation targets ATLAS as
the first dataset:

https://github.com/jayvenn21/gsoc-dataset-indexing

What it does:

   - Reads the ATLAS TSV (1,938 proteins, 50 metadata fields)
   - Generates a POSIX directory tree with structured JSON metadata per
   protein
   - Includes a search CLI for filtering by organism, resolution, domain
   classifications, etc.
   - Generic lib/ layer so adding new datasets (mdCATH, etc.) is just a new
   ingest adapter

The next steps would be adding a second dataset and wiring into
CyberShuttle's VFS. Happy to adjust direction based on feedback.


Thanks,

Jayanth

Reply via email to