Is there a way to get aggregate word counts over a subset of documents? For example given the following data:
{ "id": "1", "category": "cat1", "includes": "The green car.", }, { "id": "2", "category": "cat1", "includes": "The red car.", }, { "id": "3", "category": "cat2", "includes": "The black car.", } I'd like to be able to get total term frequency counts per category. e.g. <category name="cat1"> <lst name="the">2</lst> <lst name="car">2</lst> <lst name="green">1</lst> <lst name="red">1</lst> </category> <category name="cat2"> <lst name="the">1</lst> <lst name="car">1</lst> <lst name="black">1</lst> </category> I was initially hoping to do this within Solr and I tried using the TermFrequencyComponent. This gives term frequencies for individual documents and term frequencies for the entire index but doesn't seem to help with subsets. For example, TermFrequencyComponent would tell me that car occurs 3 times over all documents in the index and 1 time in document 1 but not that it occurs 2 times over cat1 documents and 1 time over cat2 documents. Is there a good way to use Solr/Lucene to gather aggregate results like this? I've been focusing on just using Solr with XML files but I could certainly write Java code if necessary. Thanks, David