Solr index size affected by duplication

sagandhi Sun, 18 Nov 2018 06:04:15 -0800

Hi,

This is a sample doc -
<doc>
        <field name="doc_type">parent</field>
        <field name="item">shirt</field>
        <doc>
            <field name="doc_type">child</field>
            <field name="c_COLOR">Red</field>
            <field name="c_SIZE">XL</field>
            <field name="c_PRICE">6</field>
        </doc>
        <field name="p_COLOR">Red</field>
        <field name="p_SIZE">XL</field>
        <field name="p_PRICE">6</field>
</doc>


The parent doc represents an item/object and the nested docs contain
extended properties of the object in parent doc.
So while searching the nested docs are filtered out for proper result count.
This required duplicating the nested doc fields in the parent doc.

This duplication of fields has resulted in huge Solr index size and I am
planning to get rid of them and use blockjoin for nested doc fields. 
This has caused another serious problem where if the value I am searching
for is present in a nested doc, no results are found (as nested docs are
filtered out as a rule. This used to work before because even if the nested
doc is filtered out, the parent doc is still returned)

I have come up with 2 approaches to solve this.
1. Include global field while indexing:
For each field in nested doc add the corresponding value in global field in
the parent doc.
<doc>
        <field name="doc_type">parent</field>
        <doc>
                <field name="doc_type">child</field>
                <field name="c_COLOR">Red</field>
            <field name="c_SIZE">XL</field>
            <field name="c_PRICE">6</field>
        </doc>
        <field name="global">Red</field>
        <field name="global">XL</field>
        <field name="global">6</field>
</doc>

2. Use a new copy field:
The fields in nested doc have unique name patterns from other fields so I
can easily create another copy field that contains only the nested doc
fields.
Now while querying, I use block-join on this copy field along with the
existing global field like so -

global:(red) OR {!parent which=doc_type:parent}c_global:(red)

Add this in schema:
<copy desc="c_global" src="c_*">

3. I came across another approach/hack accidentally.
I had modified the existing schema to remove duplicate parent fields but the
data I used for reindexing contained the duplicate parent fields.
So the global field contains values from both parent and nested field. But
the indexed doc itself will skip the parent doc fields as the schema doesn't
have them.
I was able to search for nested doc field values, and the total index size
was less than the above two.

Can someone please suggest which is the better option and why?

Thanks!
Soham



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Solr index size affected by duplication

Reply via email to