I am not sure I fully understood your use case, but let me suggest few
different possible solutions :

1) Query Time join approach : you keep 2 collections, one static with all
the pages, one that just store lighweight documents containing the crawling
interaction :
1) Id, content -> Pages
2)pageId, ExperimentId, CrawlingCycleId ->CrawlingInteractions

Then your query will be something like this ( to retrieve pageId):
http://localhost:8983/solr/select?q={!join+from=id+to=pageId}text:query&fq=CrawlingCycleId:[N
To K]

To retrieve the entire page can be more problematic as you have to reverse
the Join and you will join on millions of items. Not sure if it's going to
work

2) You use atomic updates[1], and for each experiment and iteration you just
add the fields you want ( experimentId and CrawlingCycleId). Be careful here
as Atomic Updates doesn't mean you are not going to write the entire
document again ( this is valid only under certain condition which doesn't
apply to your use case i think), but at least it will give you a bit of
advantage as your post requests pushing the document will be much more
lightweight.



[1]
https://lucene.apache.org/solr/guide/6_6/updating-parts-of-documents.html




-----
---------------
Alessandro Benedetti
Search Consultant, R&D Software Engineer, Director
Sease Ltd. - www.sease.io
--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply via email to