Is there a way to get aggregate word counts over a subset of documents?

For example given the following data:

  {
    "id": "1",
    "category": "cat1",
    "includes": "The green car.",
  },
  {
    "id": "2",
    "category": "cat1",
    "includes": "The red car.",
  },
  {
    "id": "3",
    "category": "cat2",
    "includes": "The black car.",
  }

I'd like to be able to get total term frequency counts per category. e.g.

<category name="cat1">
   <lst name="the">2</lst>
   <lst name="car">2</lst>
   <lst name="green">1</lst>
   <lst name="red">1</lst>
</category>
<category name="cat2">
   <lst name="the">1</lst>
   <lst name="car">1</lst>
   <lst name="black">1</lst>
</category>

I was initially hoping to do this within Solr and I tried using the
TermFrequencyComponent. This gives term frequencies for individual
documents and term frequencies for the entire index but doesn't seem to
help with subsets. For example, TermFrequencyComponent would tell me that
car occurs 3 times over all documents in the index and 1 time in document 1
but not that it occurs 2 times over cat1 documents and 1 time over cat2
documents.

Is there a good way to use Solr/Lucene to gather aggregate results like
this? I've been focusing on just using Solr with XML files but I could
certainly write Java code if necessary.

Thanks,

David

Reply via email to