Re: Looking for tips on indexing documents containing multi-valued tuple fields

Timothy Potter Tue, 26 Feb 2013 12:11:55 -0800

Ok - I suspected grouping by company_name was too obvious here ;-)

A couple of tricks to think about (not claiming any of these will help) are:


1) Document transformer - you can return any company fields you need
in the response from a database lookup using a Document transformer.
This lets you avoid adding all company fields to your documents like
addresses, etc. This only works for fields you don't need to search /
filter on. Again, may or may not be useful in your scenario but wanted
to make you aware of the feature, something like:

    <transformer name="db" class="com.mycompany.LoadFromDatabaseTransformer" >
       <int name="connection">jdbc://....</int>
     </transformer>

2) Custom hashing when assigning a document to a shard

Not sure if you're doing distributed right now. With Solr 4.1, you can
assign all articles for the same company to the same shard (see:
https://issues.apache.org/jira/browse/SOLR-2592). This allows you to
overcome all the distributed limitations of grouping and joins when
querying at the company level (which sounds like a primary use case).
Since all the docs for a specific company are in the same shard, joins
and grouping work as if you are not distributed.

3) Investigate Lucene's BlockJoinQuery and work being done in Solr to
take advantage of BJQ

See: 
http://blog.mikemccandless.com/2012/01/searching-relational-content-with.html
https://issues.apache.org/jira/browse/SOLR-3076


On Tue, Feb 26, 2013 at 12:04 PM, Clint Miller <clint.mill...@gmail.com> wrote:
> Tim, thanks for the response. I definitely owe you a beer next time you're
> in Austin.
>
> I hadn't thought of your approach of turning things around. But, I don't
> think it will work because of some stuff I left out in my original email.
> First, the relationship between Company and Article is many-to-many. I
> could easily adapt your approach to take this into account:
>
> articleID_companyID (key)
> articleID
> companyID
> publishDate
> source
> company_name
> company_desc
> contents
>
> So, the search results return article/company pairs, and I group by
> companyID. But, what makes this messier is that 1) companies contain many,
> many fields, all of which are filterable and 2) companies contain a number
> of multi-valued tuple fields, articles are just one example. For example,
> companies have lists of addresses, which have city, state, country, county,
> street address, latitude, longitude, #residentsInCounty, etc. Or, companies
> have products which have a number of fields.
>
> Modifying the above article/company pairs approach to become an
> article/company/address/product/... (plus many other fields) approach
> creates a combinatorial explosion of documents.
>
> Sorry about not including these extra details in my original question. I
> think I simplified my original question a bit too much.
>
>
> On Tue, Feb 26, 2013 at 12:17 PM, Timothy Potter <thelabd...@gmail.com>wrote:
>
>> Hi Clint,
>>
>> Nice to see you on this list!
>>
>> What about treating each article as the indexed unit (i.e. each
>> article is a document) with structure:
>>
>> articleID
>> publishDate
>> source
>> company_name
>> company_desc
>> contents
>>
>> Then you can do grouping by company_name field.
>>
>> I happen to know you're very familiar with grouping in Solr so there
>> must be a reason you're not approaching the problem from this angle.
>>
>> Cheers,
>> Tim
>>
>> On Tue, Feb 26, 2013 at 10:32 AM, Clint Miller <clint.mill...@gmail.com>
>> wrote:
>> > Suppose I have companies that have articles associated with them. I want
>> to
>> > be able to search for companies based on the text in the articles. For
>> > example, suppose the Company and Article classes look like this:
>> >
>> > Company
>> > ----------------------
>> > name: String
>> > description: String
>> > articles: Article[]
>> >
>> > Article
>> > -------------------------
>> > publishDate: Date
>> > source: String (like Reuters or AP)
>> > contents: String
>> >
>> > I want to be able to do the following types of operations:
>> >
>> >    1. Search for companies by name, description, or article contents with
>> >    name and description boosted higher than article contents.
>> >    2. Facet on article sources.
>> >    3. Filter on article sources.
>> >    4. Boost companies with newer articles.
>> >
>> > One approach to representing this data is to use the following Solr
>> fields:
>> >
>> > name_s
>> > description_s
>> > article_1_publish_date_tdt
>> > article_1_source_s
>> > article_1_contents_s
>> > ...
>> > article_publish_dates_tdts
>> > article_sources_ss
>> > article_contents_ss
>> >
>> > where article_publish_dates_tdts is a copyField of all the
>> > *_publish_date_tdt fields, article_sources_ss is a copyField of all the
>> > *_source_s fields, and article_contents_ss is a copyField of all the
>> > *_contents_s fields.
>> >
>> > This structure allows me to do my first 2 types of operations easily
>> > enough. To search on name, description, and contents, I just use qf set
>> to
>> > name_s, description_s, and article_contents_ss, with name_s and
>> > description_s boosted accordingly. I can facet on article sources by
>> using
>> > article_sources_ss.
>> >
>> > But, I'm having trouble figuring out how to filter on article sources or
>> > boost companies with newer articles. For example, say I search for "green
>> > energy" and filter by source = 'Reuters'. If I use
>> > "article_sources_ss:Reuters AND article_contents_ss:Green Energy" that
>> > won't give correct results since a company may have 2 articles, one about
>> > green energy from AP and another from Reuters that isn't about green
>> energy
>> > at all. Yet, that company would match the query.
>> >
>> > Similarly, based on Tim Potter's online presentation, I understand how to
>> > boost based on recent dates if a company has only a single article and a
>> > single date. But, I'm not sure how to do the boosting with a list of
>> > articles and dates.
>> >
>> > Is this possible? Will I need to go down a path of writing custom
>> > functions? If so, any pointers on a custom function approach I should
>> use?
>> >
>> > Thank you very much for any help.
>>

Re: Looking for tips on indexing documents containing multi-valued tuple fields

Reply via email to