Tim, thanks for the response. I definitely owe you a beer next time you're
in Austin.

I hadn't thought of your approach of turning things around. But, I don't
think it will work because of some stuff I left out in my original email.
First, the relationship between Company and Article is many-to-many. I
could easily adapt your approach to take this into account:

articleID_companyID (key)
articleID
companyID
publishDate
source
company_name
company_desc
contents

So, the search results return article/company pairs, and I group by
companyID. But, what makes this messier is that 1) companies contain many,
many fields, all of which are filterable and 2) companies contain a number
of multi-valued tuple fields, articles are just one example. For example,
companies have lists of addresses, which have city, state, country, county,
street address, latitude, longitude, #residentsInCounty, etc. Or, companies
have products which have a number of fields.

Modifying the above article/company pairs approach to become an
article/company/address/product/... (plus many other fields) approach
creates a combinatorial explosion of documents.

Sorry about not including these extra details in my original question. I
think I simplified my original question a bit too much.


On Tue, Feb 26, 2013 at 12:17 PM, Timothy Potter <thelabd...@gmail.com>wrote:

> Hi Clint,
>
> Nice to see you on this list!
>
> What about treating each article as the indexed unit (i.e. each
> article is a document) with structure:
>
> articleID
> publishDate
> source
> company_name
> company_desc
> contents
>
> Then you can do grouping by company_name field.
>
> I happen to know you're very familiar with grouping in Solr so there
> must be a reason you're not approaching the problem from this angle.
>
> Cheers,
> Tim
>
> On Tue, Feb 26, 2013 at 10:32 AM, Clint Miller <clint.mill...@gmail.com>
> wrote:
> > Suppose I have companies that have articles associated with them. I want
> to
> > be able to search for companies based on the text in the articles. For
> > example, suppose the Company and Article classes look like this:
> >
> > Company
> > ----------------------
> > name: String
> > description: String
> > articles: Article[]
> >
> > Article
> > -------------------------
> > publishDate: Date
> > source: String (like Reuters or AP)
> > contents: String
> >
> > I want to be able to do the following types of operations:
> >
> >    1. Search for companies by name, description, or article contents with
> >    name and description boosted higher than article contents.
> >    2. Facet on article sources.
> >    3. Filter on article sources.
> >    4. Boost companies with newer articles.
> >
> > One approach to representing this data is to use the following Solr
> fields:
> >
> > name_s
> > description_s
> > article_1_publish_date_tdt
> > article_1_source_s
> > article_1_contents_s
> > ...
> > article_publish_dates_tdts
> > article_sources_ss
> > article_contents_ss
> >
> > where article_publish_dates_tdts is a copyField of all the
> > *_publish_date_tdt fields, article_sources_ss is a copyField of all the
> > *_source_s fields, and article_contents_ss is a copyField of all the
> > *_contents_s fields.
> >
> > This structure allows me to do my first 2 types of operations easily
> > enough. To search on name, description, and contents, I just use qf set
> to
> > name_s, description_s, and article_contents_ss, with name_s and
> > description_s boosted accordingly. I can facet on article sources by
> using
> > article_sources_ss.
> >
> > But, I'm having trouble figuring out how to filter on article sources or
> > boost companies with newer articles. For example, say I search for "green
> > energy" and filter by source = 'Reuters'. If I use
> > "article_sources_ss:Reuters AND article_contents_ss:Green Energy" that
> > won't give correct results since a company may have 2 articles, one about
> > green energy from AP and another from Reuters that isn't about green
> energy
> > at all. Yet, that company would match the query.
> >
> > Similarly, based on Tim Potter's online presentation, I understand how to
> > boost based on recent dates if a company has only a single article and a
> > single date. But, I'm not sure how to do the boosting with a list of
> > articles and dates.
> >
> > Is this possible? Will I need to go down a path of writing custom
> > functions? If so, any pointers on a custom function approach I should
> use?
> >
> > Thank you very much for any help.
>

Reply via email to